Back to Search
Start Over
Analyzing LDA and NMF Topic Models for Urdu Tweets via Automatic Labeling
- Source :
- IEEE Access, Vol 9, Pp 127531-127547 (2021)
- Publication Year :
- 2021
- Publisher :
- IEEE, 2021.
-
Abstract
- The understanding and analyzing of available content on Social media Platforms such as Twitter and Facebook, through various topic modeling methods is not supervised. However, despite several existing conventional techniques, they have had limited success when applied directly for filtering and quick comprehension of short-text contents due to text sparseness and noise. Thus, it always has been challenging to discover reliable latent topics from online discussion texts that prevail with low words co-occurrence and availability of large size social media benchmark datasets, even for resource-rich languages. The existing literature lacks such work for Urdu text to unveil niche topics even with conventional topic models, mainly due to the lack of benchmark datasets, limited availability of pre-processing tools/ algorithms, and time and compute limitations on large-sized datasets. This work presents experiments with multiple approaches of topic modeling like Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocation (LDA), and Non-negative Matrix Factorization (NMF) on 0.8 million Urdu tweets. These tweets are collected through Twitter API by giving various hashtags as a query to avoid dominance of single topic in the dataset. In addition, we have pre-processed the text of the tweets, prepared the three variants of the collected dataset, and extracted multiple features to represent documents on different n-grams. Furthermore, all these techniques are compared and evaluated on the dataset variants, using both qualitative and quantitative measures. We have also demonstrated the results of these approaches through visualization methods, graphs depicting tweets size per topic, word clouds, and hashtags analysis, giving insights about algorithms performances on finalized topics. Observed results reveal that NMF outperformed the techniques with TF-IDF feature vectors in Urdu tweets text, while LDA performed best with merging short-text strategy into long pseudo documents.
- Subjects :
- Topic model
General Computer Science
Computer science
Feature vector
topic modeling
Semantics
computer.software_genre
Latent Dirichlet allocation
Non-negative matrix factorization
symbols.namesake
General Materials Science
Social media
topic evaluation
Electrical and Electronic Engineering
short-text topic model
Probabilistic latent semantic analysis
Latent semantic analysis
business.industry
Natural language processing
General Engineering
TK1-9971
ComputingMethodologies_PATTERNRECOGNITION
Urdu text processing
symbols
Artificial intelligence
Electrical engineering. Electronics. Nuclear engineering
business
computer
Subjects
Details
- Language :
- English
- ISSN :
- 21693536
- Volume :
- 9
- Database :
- OpenAIRE
- Journal :
- IEEE Access
- Accession number :
- edsair.doi.dedup.....629e0ef94d11f72e233f339a650aefed