Processing Large Text Corpus Using N-Gram Language Modeling and Smoothing

Authors :: Sandhya Avasthi
D. P. Acharjya
Ritu Chauhan
Source :: Lecture Notes in Networks and Systems ISBN: 9789811596889
Publication Year :: 2021
Publisher :: Springer Singapore, 2021.
Abstract: The prediction of next word, letter or phrase for the user, while she is typing, is a really valuable tool for improving user experience. The users are communicating, writing reviews and expressing their opinion on such platforms frequently and many times while moving. It has become necessary to provide the user with an application that can reduce typing effort and spelling errors when they have limited time. The text data is getting larger in size due to the extensive use of all kinds of social media platforms and so implementation of text prediction application is difficult considering the size of text data to be processed for language modeling. This research paper’s primary objective is processing large text corpus and implementing a probabilistic model like N-grams to predict the next word when the user provides input. In this exploratory research, n-gram models are discussed and evaluated using Good Turing Estimation, perplexity measure and type-to-token ratio.