Back to Search
Start Over
A nonparametric model for online topic discovery with word embeddings.
- Source :
-
Information Sciences . Dec2019, Vol. 504, p32-47. 16p. - Publication Year :
- 2019
-
Abstract
- With the explosive growth of short documents generated from streaming textual sources (e.g., Twitter), latent topic discovery has become a critical task for short text stream clustering. However, most online clustering models determine the probability of producing a new topic by manually setting some hyper-parameter/threshold, which becomes barrier to achieve better topic discovery results. Moreover, topics generated by using existing models often involve a wide coverage of the vocabulary which is not suitable for online social media analysis. Therefore, we propose a n on p ara m etric m odel (NPMM) which exploits auxiliary word embeddings to infer the topic number and employs a "spike and slab" function to alleviate the sparsity problem of topic-word distributions in online short text analyses. NPMM can automatically decide whether a given document belongs to existing topics, measured by the squared Mahalanobis distance. Hence, the proposed model is free from tuning the hyper-parameter to obtain the probability of generating new topics. Additionally, we propose a nonparametric sampling strategy to discover representative terms for each topic. To perform inference, we introduce a one-pass Gibbs sampling algorithm based on Cholesky decomposition of covariance matrices, which can further be sped up using a Metropolis-Hastings step. Our experiments demonstrate that NPMM significantly outperforms the state-of-the-art algorithms. [ABSTRACT FROM AUTHOR]
Details
- Language :
- English
- ISSN :
- 00200255
- Volume :
- 504
- Database :
- Academic Search Index
- Journal :
- Information Sciences
- Publication Type :
- Periodical
- Accession number :
- 138180004
- Full Text :
- https://doi.org/10.1016/j.ins.2019.07.048