Back to Search
Start Over
Finding the Number of Latent Topics With Semantic Non-Negative Matrix Factorization
- Source :
- IEEE Access, Vol 9, Pp 117217-117231 (2021)
- Publication Year :
- 2021
- Publisher :
- IEEE, 2021.
-
Abstract
- Topic modeling, or identifying the set of topics that occur in a collection of articles, is one of the primary objectives of text mining. One of the big challenges in topic modeling is determining the correct number of topics: underestimating the number of topics results in a loss of information, i.e., omission of topics, underfitting, while overestimating leads to noisy and unexplainable topics and overfitting. In this paper, we consider a semantic-assisted non-negative matrix factorization (NMF) topics model, which we call SeNMFk, based on Kullback-Leibler(KL) divergence and integrated with a method for determining the number of latent topics. SeNMFk involves (i) creating a random ensemble of pairs of matrices whose mean is equal to the initial words-by-documents matrix representing the text corpus and the Shifted Positive Pointwise Mutual Information (SPPMI) matrix, which encodes the context information, respectively, and (ii) jointly factorizing each of these pairs with different number of topics to acquire sets of latent topics that are stable to noise. We demonstrate the performance of our method by identifying the number of topics in several benchmark text corpora, when compared to other state-of-the-art techniques. We also show that the number of document classes in the input text corpus may differ from the number of the extracted latent topics, but these classes can be retrieved by clustering the column-vectors of one of the factor matrices. Additionally, we introduce a software called pyDNMFk to estimate the number of topics. We demonstrate that our unsupervised method, SeNMFk, not only determines the correct number of topics, but also extracts topics with a high coherence and accurately classifies the documents of the corpus.
- Subjects :
- Text corpus
Topic model
General Computer Science
Computer science
topic modeling
Context (language use)
Overfitting
Pointwise mutual information
computer.software_genre
NLP
Matrix decomposition
Non-negative matrix factorization
Machine learning
General Materials Science
Analytical models
Cluster analysis
Probabilistic logic
business.industry
semantic non-negative matrix factorization
General Engineering
Data models
Stability analysis
Minimization
TK1-9971
Semantics
ComputingMethodologies_PATTERNRECOGNITION
Artificial intelligence
Electrical engineering. Electronics. Nuclear engineering
business
computer
Coherence
Natural language processing
Subjects
Details
- Database :
- OpenAIRE
- Journal :
- IEEE Access, Vol 9, Pp 117217-117231 (2021)
- Accession number :
- edsair.doi.dedup.....307cd1f27e32f35faa42922f9c7de381