1. A LDA-Based Approach for Semi-Supervised Document Clustering
- Author
-
Li Zhang, Ruizhang Huang, and Ping Zhou
- Subjects
Information Systems and Management ,Estimation theory ,business.industry ,Computer science ,Computer Science::Information Retrieval ,Inference ,Pattern recognition ,Document clustering ,computer.software_genre ,Latent Dirichlet allocation ,Computer Science Applications ,Normal distribution ,symbols.namesake ,Generative model ,ComputingMethodologies_PATTERNRECOGNITION ,Artificial Intelligence ,symbols ,Artificial intelligence ,Data mining ,business ,Cluster analysis ,Representation (mathematics) ,computer - Abstract
In this paper, we develop an approach for semi-supervised document clustering based on Latent Dirichlet Allocation (LDA), namely LLDA. A small amount of labeled documents are used to indicate user's document grouping preference. A generative model is investigated to jointly model documents and the small amount of document labels. A variational inference algorithm is developed to infer the document collection structure. We explore the performance of our proposed approach on both a synthetic dataset and realistic document datasets. Our experiments indicate that our proposed approach performs well on grouping documents based on different user grouping preferences. The comparison between our proposed approach and state-of-the-art semi-supervised clustering algorithms using labeled instance shows that our approach is effective. clustering model designed with the LDA model. Considering the effectiveness of the LDA model on the document clustering problem, in this paper, we investigate a LDA-based model for semi-supervised document clustering, namely LLDA. Labeled documents are used as the type of supervised-information and are used to indicate user's document grouping preferences. A generative model is investigated by using which documents are partitioned by maximizing the joint generative likelihood of text documents and the user-provided document labels. These labels were treated as variables which obey normal distribution and are regressed on the topic proportions. The computational cost of LLDA parameter estimation is also a problem for developing the LLDA model for the semi-supervised document clustering. Traditionally, there are two algorithms to infer LLDA parameters, in particular, the variational inference algorithm and the Gibbs sampling algorithm. Compared with the Gibbs sampling algorithm, the variational inference algorithm shows better computational performance due to the high dimensional representation of text documents. In this paper, we also derived a variational inference algorithm for the LLDA model. We have conducted extensive experiments on our proposed LLDA model by using both synthetic and realistic datasets. We also compared our approach with state-of-the-art semi-supervised document clustering algorithms with labeled documents as supervised information. Experimental results show that the LLDA model is effective for semi-supervised document clustering.
- Published
- 2014
- Full Text
- View/download PDF