1. Graph Attention Topic Modeling Network
- Author
-
Yuanfang Guo, Di Jin, Chuan Wang, Xiaochun Cao, Fan Wu, Junhua Gu, and Liang Yang
- Subjects
Independent and identically distributed random variables ,Topic model ,Theoretical computer science ,Word embedding ,Computer science ,Inference ,02 engineering and technology ,Latent variable ,010501 environmental sciences ,Overfitting ,computer.software_genre ,01 natural sciences ,Latent Dirichlet allocation ,Dirichlet distribution ,symbols.namesake ,Stochastic block model ,0202 electrical engineering, electronic engineering, information engineering ,0105 earth and related environmental sciences ,Probabilistic latent semantic analysis ,Document classification ,symbols ,Graph (abstract data type) ,Topological graph theory ,020201 artificial intelligence & image processing ,computer ,Latent semantic indexing - Abstract
Existing topic modeling approaches possess several issues, including the overfitting issue of Probablistic Latent Semantic Indexing (pLSI), the failure of capturing the rich topical correlations among topics in Latent Dirichlet Allocation (LDA), and high inference complexity. In this paper, we provide a new method to overcome the overfitting issue of pLSI by using the amortized inference with word embedding as input, instead of the Dirichlet prior in LDA. For generative topic model, the large number of free latent variables is the root of overfitting. To reduce the number of parameters, the amortized inference replaces the inference of latent variable with a function which possesses the shared (amortized) learnable parameters. The number of the shared parameters is fixed and independent of the scale of the corpus. To overcome the limited application of amortized inference to independent and identically distributed (i.i.d) data, a novel graph neural network, Graph Attention TOpic Network (GATON), is proposed to model the topic structure of non-i.i.d documents according to the following two observations. First, pLSI can be interpreted as stochastic block model (SBM) on a specific bi-partite graph. Second, graph attention network (GAT) can be explained as the semi-amortized inference of SBM, which relaxes the i.i.d data assumption of vanilla amortized inference. GATON provides a novel scheme, i.e. graph convolution operation based scheme, to integrate word similarity and word co-occurrence structure. Specifically, the bag-of-words document representation is modeled as a bi-partite graph topology. Meanwhile, word embedding, which captures the word similarity, is modeled as attribute of the word node and the term frequency vector is adopted as the attribute of the document node. Based on the weighted (attention) graph convolution operation, the word co-occurrence structure and word similarity patterns are seamlessly integrated for topic identification. Extensive experiments demonstrate that the effectiveness of GATON on topic identification not only benefits the document classification, but also significantly refines the input word embedding.
- Published
- 2020
- Full Text
- View/download PDF