Back to Search Start Over

Classification of Sindhi Headline News Documents based on TF-IDF Text Analysis Scheme

Authors :
Qurban Ali Lakhan
Irfan Ali Kandhro
Ajab Ali Lashari
Sahar Zafar Jumani
Saima Sipy Nangraj
Mirza Taimoor Baig
Subhash Guriro
Source :
Indian Journal of Science and Technology. 12:1-10
Publication Year :
2019
Publisher :
Indian Society for Education and Environment, 2019.

Abstract

Objectives: Sindhi language, historically rich belongs to Indo-Aryan language with diverse background and diverse dialects. Recent drive in globalization, e-commerce and e-literacy have influenced on languages as well. There are lots of magazines, Sindhi books, newspapers and web material available online, but unluckily still proper dataset is not designed for Sindhi information processing. This research study focuses on the Sindhi language news headline texts dataset and automated tool for the online texts’ classification based on the predefined label. Methods/Statistical Analysis: For the collection of datasets, the scraping tool is designed for extraction of the headline news from most popular newspapers: Awami Awaz and Daily Jhoongar. The dataset contains 2800 Sindhi headline news with five categories: 0. Entertainment, 1. Sports, 2. Science and Technology, 3. International, 4. National, 5. Sindhi news. The dataset is normalized by removing stop words and cleaning the spaces, punctuations and other unnecessary texts. Furthermore, the language feature is analyzed using TF-IDF and vector model. This paper presents Sindhi headline news classification model with implementation of the machine learning classification algorithms, namely. Multinomial NB, Linear SVC, Logistic Regression, MLP classifier, SGD Classifier, Random Forest Classifier, Ridge Classifier. Findings: The results show that the performance of the Linear SVC and MLP Classifier indicate better results on Sindhi headlines news categorization as compared to other classification techniques. This research study helps in improving the automatic classification of Sindhi text headline news. Application/Improvements: It is recommended that LSVC and MLP Classifiers should be used in Sindhi language news headline classification.

Details

ISSN :
09745645 and 09746846
Volume :
12
Database :
OpenAIRE
Journal :
Indian Journal of Science and Technology
Accession number :
edsair.doi...........5191afc1c45aa60f2ebbc32d12b62f01
Full Text :
https://doi.org/10.17485/ijst/2019/v12i33/146130