Back to Search Start Over

Finding the Number of Latent Topics With Semantic Non-Negative Matrix Factorization

Authors :
Hristo N. Djidjev
Erik Skau
Boian S. Alexandrov
James P. Smith
Manish Bhattarai
Tom Tierney
Raviteja Vangara
Valentin Stanev
Gopinath Chennupati
Source :
IEEE Access, Vol 9, Pp 117217-117231 (2021)
Publication Year :
2021
Publisher :
IEEE, 2021.

Abstract

Topic modeling, or identifying the set of topics that occur in a collection of articles, is one of the primary objectives of text mining. One of the big challenges in topic modeling is determining the correct number of topics: underestimating the number of topics results in a loss of information, i.e., omission of topics, underfitting, while overestimating leads to noisy and unexplainable topics and overfitting. In this paper, we consider a semantic-assisted non-negative matrix factorization (NMF) topics model, which we call SeNMFk, based on Kullback-Leibler(KL) divergence and integrated with a method for determining the number of latent topics. SeNMFk involves (i) creating a random ensemble of pairs of matrices whose mean is equal to the initial words-by-documents matrix representing the text corpus and the Shifted Positive Pointwise Mutual Information (SPPMI) matrix, which encodes the context information, respectively, and (ii) jointly factorizing each of these pairs with different number of topics to acquire sets of latent topics that are stable to noise. We demonstrate the performance of our method by identifying the number of topics in several benchmark text corpora, when compared to other state-of-the-art techniques. We also show that the number of document classes in the input text corpus may differ from the number of the extracted latent topics, but these classes can be retrieved by clustering the column-vectors of one of the factor matrices. Additionally, we introduce a software called pyDNMFk to estimate the number of topics. We demonstrate that our unsupervised method, SeNMFk, not only determines the correct number of topics, but also extracts topics with a high coherence and accurately classifies the documents of the corpus.

Details

Database :
OpenAIRE
Journal :
IEEE Access, Vol 9, Pp 117217-117231 (2021)
Accession number :
edsair.doi.dedup.....307cd1f27e32f35faa42922f9c7de381