Back to Search
Start Over
A Biterm-based Dirichlet Process Topic Model for Short Texts
- Source :
- Proceedings of the 3rd International Conference on Computer Science and Service System.
- Publication Year :
- 2014
- Publisher :
- Atlantis Press, 2014.
-
Abstract
- Topic models are prevalent in many fields (e.g. context analysis), which are applied to discovering the latent topics. In document modeling, conventional topic models (e.g. latent Dirichlet allocation and its variants) do well for normal documents. However, the severe data sparsity problem makes the topic modeling in short texts difficult and unreliable. To tackle this problem, an effective approach (biterm topic model) has been proposed recently which learns topics by directly modeling the generation of word co-occurrence patterns at corpus-level rather than at document-level. But it requires human intervention for determining the number of topics. In this paper, we propose a Dirichlet process based on word cooccurrence to make topic mining from short texts more automatically. Meanwhile, we design a Markov chain Monte Carlo sampling scheme for posterior inference in our model which is an extension of the sampling algorithm based on Chinese restaurant process. Finally, we conduct experiments on real data. The results show that our method outperforms the baseline on quality of topic and perplexity and it is more flexible. KeywordsDirichlet Process; Clustering; Biterm; Short Texts; Topic Mining
- Subjects :
- Topic model
Perplexity
business.industry
Computer science
Inference
computer.software_genre
Latent Dirichlet allocation
Dirichlet process
symbols.namesake
ComputingMethodologies_PATTERNRECOGNITION
symbols
Artificial intelligence
Chinese restaurant process
Cluster analysis
business
computer
Word (computer architecture)
Natural language processing
Subjects
Details
- ISSN :
- 19516851
- Database :
- OpenAIRE
- Journal :
- Proceedings of the 3rd International Conference on Computer Science and Service System
- Accession number :
- edsair.doi...........01332b4ce607fd8c6d9981cba2f76408
- Full Text :
- https://doi.org/10.2991/csss-14.2014.71