1. Classification of Sindhi Headline News Documents based on TF-IDF Text Analysis Scheme
- Author
-
Qurban Ali Lakhan, Irfan Ali Kandhro, Ajab Ali Lashari, Sahar Zafar Jumani, Saima Sipy Nangraj, Mirza Taimoor Baig, and Subhash Guriro
- Subjects
Multidisciplinary ,Stop words ,business.industry ,Computer science ,Headline ,02 engineering and technology ,computer.software_genre ,language.human_language ,Newspaper ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Statistical classification ,ComputingMethodologies_PATTERNRECOGNITION ,0202 electrical engineering, electronic engineering, information engineering ,language ,Feature (machine learning) ,020201 artificial intelligence & image processing ,Sindhi ,Artificial intelligence ,0305 other medical science ,business ,tf–idf ,computer ,Classifier (UML) ,Natural language processing - Abstract
Objectives: Sindhi language, historically rich belongs to Indo-Aryan language with diverse background and diverse dialects. Recent drive in globalization, e-commerce and e-literacy have influenced on languages as well. There are lots of magazines, Sindhi books, newspapers and web material available online, but unluckily still proper dataset is not designed for Sindhi information processing. This research study focuses on the Sindhi language news headline texts dataset and automated tool for the online texts’ classification based on the predefined label. Methods/Statistical Analysis: For the collection of datasets, the scraping tool is designed for extraction of the headline news from most popular newspapers: Awami Awaz and Daily Jhoongar. The dataset contains 2800 Sindhi headline news with five categories: 0. Entertainment, 1. Sports, 2. Science and Technology, 3. International, 4. National, 5. Sindhi news. The dataset is normalized by removing stop words and cleaning the spaces, punctuations and other unnecessary texts. Furthermore, the language feature is analyzed using TF-IDF and vector model. This paper presents Sindhi headline news classification model with implementation of the machine learning classification algorithms, namely. Multinomial NB, Linear SVC, Logistic Regression, MLP classifier, SGD Classifier, Random Forest Classifier, Ridge Classifier. Findings: The results show that the performance of the Linear SVC and MLP Classifier indicate better results on Sindhi headlines news categorization as compared to other classification techniques. This research study helps in improving the automatic classification of Sindhi text headline news. Application/Improvements: It is recommended that LSVC and MLP Classifiers should be used in Sindhi language news headline classification.
- Published
- 2019
- Full Text
- View/download PDF