1. Smoothed n-gram based models for tweet language identification: A case study of the Brazilian and European Portuguese national varieties
- Author
-
Dayvid Castro, Ellen Souza, Adriano L. I. Oliveira, Douglas Vitório, and Diego Santos
- Subjects
Language identification ,business.industry ,Computer science ,Speech recognition ,Bigram ,02 engineering and technology ,computer.software_genre ,language.human_language ,ComputingMethodologies_PATTERNRECOGNITION ,n-gram ,European Portuguese ,020204 information systems ,Cache language model ,0202 electrical engineering, electronic engineering, information engineering ,language ,020201 artificial intelligence & image processing ,Artificial intelligence ,Language model ,Portuguese ,business ,tf–idf ,computer ,Software ,Word (computer architecture) ,Natural language processing - Abstract
Identifying the language of a text is an important step for several natural language processing applications. State-of-the-art language identification (LID) systems perform very well when discriminating between unrelated languages on standard datasets. However, the LID task has a bottleneck when discriminating between similar languages or language varieties. Furthermore, LID has also proven to be very challenging when dealing with short texts such as the ones from Twitter. In this paper, we propose the use of smoothed n-gram language models to classify tweets in both Brazilian and European Portuguese variants. Word and character n-gram language models were combined and evaluated through five different classifiers. We have compared the smoothed n-gram language models together with the Term Frequency and Inverse Document Frequency weighting scheme. This paper also proposes an ensemble model, in which the class labels output were combined using majority voting and algebraic combiners. The best configuration reached accuracy of 92.71% using an ensemble model, which combines Lidstone (0.1) character 6-gram, Good–Turing word unigram, and Witten–Bell word bigram models, together with the Log-Likelihood Ratio estimation method.
- Published
- 2017
- Full Text
- View/download PDF