101. Network-Based Bag-of-Words Model for Text Classification
- Author
-
Shuang Gu, Dongyang Yan, Liu Yang, and Keping Li
- Subjects
Dynamic network analysis ,General Computer Science ,Relation (database) ,Computer science ,Property (programming) ,Bag-of-words ,KNN ,computer.software_genre ,01 natural sciences ,010305 fluids & plasmas ,complex network ,0103 physical sciences ,text correlation ,General Materials Science ,010306 general physics ,Representation (mathematics) ,business.industry ,General Engineering ,Task (computing) ,classification ,Bag-of-words model ,The Internet ,Data mining ,lcsh:Electrical engineering. Electronics. Nuclear engineering ,Element (category theory) ,business ,computer ,lcsh:TK1-9971 - Abstract
The rapidly developing internet and other media have produced a tremendous amount of text data, making it a challenging and valuable task to find a more effective way to analyze text data by machine. Text representation is the first step for a machine to understand the text, and the commonly used text representation method is the Bag-of-Words (BoW) model. To form the vector representation of a document, the BoW model separately matches and counts each element in the document, neglecting much correlation information among words. In this paper, we propose a network-based bag-of-words model, which collects high-level structural and semantic meaning of the words. Because the structural and semantic information of a network reflects the relationship between nodes, the proposed model can distinguish the relation of words. We apply the proposed model to text classification and compare the performance of the proposed model with different text representation methods on four document datasets. The results show that the proposed method achieves the best performance with high efficiency. Using the Eccentricity property of the network as features can get the highest accuracy. We also investigate the influence of different network structures in the proposed method. Experimental results reveal that, for text classification, the dynamic network is more suitable than the static network and the hybrid network.
- Published
- 2020