Back to Search
Start Over
Classification of Sindhi Headline News Documents based on TF-IDF Text Analysis Scheme
- Source :
- Indian Journal of Science and Technology. 12:1-10
- Publication Year :
- 2019
- Publisher :
- Indian Society for Education and Environment, 2019.
-
Abstract
- Objectives: Sindhi language, historically rich belongs to Indo-Aryan language with diverse background and diverse dialects. Recent drive in globalization, e-commerce and e-literacy have influenced on languages as well. There are lots of magazines, Sindhi books, newspapers and web material available online, but unluckily still proper dataset is not designed for Sindhi information processing. This research study focuses on the Sindhi language news headline texts dataset and automated tool for the online texts’ classification based on the predefined label. Methods/Statistical Analysis: For the collection of datasets, the scraping tool is designed for extraction of the headline news from most popular newspapers: Awami Awaz and Daily Jhoongar. The dataset contains 2800 Sindhi headline news with five categories: 0. Entertainment, 1. Sports, 2. Science and Technology, 3. International, 4. National, 5. Sindhi news. The dataset is normalized by removing stop words and cleaning the spaces, punctuations and other unnecessary texts. Furthermore, the language feature is analyzed using TF-IDF and vector model. This paper presents Sindhi headline news classification model with implementation of the machine learning classification algorithms, namely. Multinomial NB, Linear SVC, Logistic Regression, MLP classifier, SGD Classifier, Random Forest Classifier, Ridge Classifier. Findings: The results show that the performance of the Linear SVC and MLP Classifier indicate better results on Sindhi headlines news categorization as compared to other classification techniques. This research study helps in improving the automatic classification of Sindhi text headline news. Application/Improvements: It is recommended that LSVC and MLP Classifiers should be used in Sindhi language news headline classification.
- Subjects :
- Multidisciplinary
Stop words
business.industry
Computer science
Headline
02 engineering and technology
computer.software_genre
language.human_language
Newspaper
030507 speech-language pathology & audiology
03 medical and health sciences
Statistical classification
ComputingMethodologies_PATTERNRECOGNITION
0202 electrical engineering, electronic engineering, information engineering
language
Feature (machine learning)
020201 artificial intelligence & image processing
Sindhi
Artificial intelligence
0305 other medical science
business
tf–idf
computer
Classifier (UML)
Natural language processing
Subjects
Details
- ISSN :
- 09745645 and 09746846
- Volume :
- 12
- Database :
- OpenAIRE
- Journal :
- Indian Journal of Science and Technology
- Accession number :
- edsair.doi...........5191afc1c45aa60f2ebbc32d12b62f01
- Full Text :
- https://doi.org/10.17485/ijst/2019/v12i33/146130