Start Over

Next word prediction for Urdu language using deep learning models.

Authors :: Shahid, Ramish
Wali, Aamir
Bashir, Maryam
Source :: Computer Speech & Language. Aug2024, Vol. 87, pN.PAG-N.PAG. 1p.
Publication Year :: 2024
Abstract: Deep learning models are being used for natural language processing. Despite their success, these models have been employed for only a few languages. Pretrained models also exist but they are mostly available for the English language. Low resource languages like Urdu are not able to benefit from these pre-trained deep learning models and their effectiveness in Urdu language processing remains a question. This paper investigates the usefulness of deep learning models for the next word prediction and suggestion model for Urdu. For this purpose, this study considers and proposes two word prediction models for Urdu. Firstly, we propose to use LSTM for neural language modeling of Urdu. LSTMs are a popular approach for language modeling due to their ability to process sequential data. Secondly, we employ BERT which was specifically designed for natural language modeling. We train BERT from scratch using an Urdu corpus consisting of 1.1 million sentences thus paving the way for further studies in the Urdu language. We achieved an accuracy of 52.4% with LSTM and 73.7% with BERT. Our proposed BERT model outperformed two other pre-trained BERT models developed for Urdu. Since this is a multi-class problem and the number of classes is equal to the vocabulary size, this accuracy is still promising. Based on the present performance, BERT seems to be effective for the Urdu language, and this paper lays the groundwork for future studies. • The first language model for Urdu using LSTM and BERT. • The proposed model predicts the next word in Urdu language. • Presents a BERT for Urdu language that is trained from scratch on 1.1 million Urdu sentence. • Present a pre-trained language model for use in many NLP tasks for Urdu. • The BERT model trained from scratch outperforms other pre-trained BERT models developed for Urdu. [ABSTRACT FROM AUTHOR]