Teaching AI systems to learn language from letters, not words


What the research is:

A new approach to natural language processing (NLP) that teaches neural networks linguistic fundamentals by training them using unsegmented textual input on the interaction between individual letters rather than whole words.

How it works:

Most recurrent neural networks (RNNs) that form the basis of NLP systems are trained on vocabularies of known words. To train RNNs in a way that more closely resembles how humans learn the fundamentals of language, we removed the word boundaries from training datasets and trained the networks at the character (instead of word) level. A multilingual study of this unsupervised character-level language modeling task used datasets of millions of words in English, German, and Italian. It showed that these “near tabula rasa” RNNs develop an impressive spectrum of linguistic knowledge, including segmenting groups of characters into words, distinguishing nouns from verbs, and even inducing simple forms of word meaning.

Why it matters:

This work is part of Facebook AI's ongoing efforts to reduce the need for supervision in language systems, including for machine translation tasks. Our results for this approach suggest that, given enough input, AI systems can learn many linguistic rules effectively from scratch, opening the door to future research into unsupervised language learning approaches that require less prior knowledge than present-day NLP systems.

Read the full paper:

Tabula nearly rasa: Probing the linguistic knowledge of character-level neural language models trained on unsegmented text