August 09, 2019
A new model to learn word embeddings (words or phrases mapped to dense vectors of numbers that represent their meaning) that are resilient to misspellings. Although popular methods such as word2vec and GloVe provide viable representations for words observed during training, they fail to yield embeddings for out-of-vocabulary (OOV) words — words that were unseen at training time. This becomes particularly troublesome when dealing with text that contains abbreviations, slang, or misspellings. To address this deficiency, we propose Misspelling Oblivious Embeddings (MOE), a new model that combines our open source library fastText with a supervised task that embeds misspellings close to their correct variants. We checked the effectiveness of this approach considering different intrinsic and extrinsic tasks, and found that MOE outperforms fastText for user-generated text.
The loss function of fastText aims to more closely embed words that occur in the same context. We call this semantic loss. In addition to the semantic loss, MOE also considers an additional supervisedloss that we call spell correction loss. The spell correction loss aims to embed misspellings close to their correct versions by minimizing the weighted sum of semantic loss and spell correction loss.
Minimizing the related loss-function component (semantic loss) corresponds to maximizing the probability of predicting the context given the word. Similarly, spell correction loss aims to maximize the probability of predicting the word, given its misspelling. We used Wikipedia Dump to train semantic loss and a misspelling dataset with 20 million examples. For future research, we released the misspellings dataset.
To evaluate our approach, we want to know whether we are mapping misspellings close to the correct variants. Below are the results for the misspelling “samallest.” If we retrieve the six closest neighbors of “samallest,” “smallest” doesn’t appear in the retrieved list for fastText. This shows, even changing one letter may change generated embeddings significantly for fastText. MOE, however,is able to retrieve “smallest” in the top three results.
Existing word embedding methods are often unable to deal with malformed texts, which contain OOV words. In real-world tasks, input text generated by people often contains misspellings (a common source of OOV words), particularly in situations like web searches, chat messages, and social media posts. Misspellings appear in up to 15 percent of web search queries. Our approach will improve the ability to apply word embeddings to real-world scenarios. Future work could apply this method to train multilingual embeddings and contextual embeddings.