1. Optimized Identification of Sentence-Level Multiclass Events on Urdu-Language-Text Using Machine Learning Techniques
- Author
-
Somia Ali, Uzma Jamil, Muhammad Younas, Bushra Zafar, and Muhammad Kashif Hanif
- Subjects
Sentence level classification ,deep learning ,machine learning ,Urdu language ,event classification ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
In today’s digital world, social media platforms generate a plethora of unstructured information. However, for low-resource languages like Urdu, there is a scarcity of well-structured data for specific tasks such as event classification. Urdu, a language prominent in South Asia, has boasted a complex morphological structure with unique features but has lacked standard linguistic resources like datasets. Long-text classification has demanded more effort than short-text classification due to its expansive vocabulary, information redundancy, and noise. Text processing has been the latest trend in research, with many machine learning and deep learning techniques widely used for it. Multiclass classification has been utilized to classify different languages for various purposes. In this research, a multiclass classification for the Urdu language was performed using a text dataset taken from five different social media platforms including Geo News, Samaa News, Dawn News, Express News, and Urdu Blogs totaling 103,771 sentences. We used sentence-level classification to categorize sentences including terrorist attacks, national news, sports, entertainment, politics, safety, earthquakes, fraud and corruption, sexual assault, weather, accidents, forces, inflation, murder and death, education, and international news. Deep learning, transformer-based and machine learning classifiers are used for event classification. The SMFCNN classifier achieved the greatest accuracy of 88.29%. We incorporated transformer-based models, with the proposed XLM-R+ model demonstrating superior performance with an accuracy of 89.8%. Our results were compared to previously reported techniques that used traditional models, highlighting the significant improvements offered by our approaches. The novelty of this research lies in the inclusion of 16 event categories to broaden coverage and the implementation of the SMFCNN and transformer-based algorithms. This study highlights the potential of deep learning and transformer-based models in enhancing the accuracy and generalizability of multiclass classification in low-resource languages Urdu.
- Published
- 2025
- Full Text
- View/download PDF