Start Over

Phrase2Vec: Phrase embedding based on parsing.

Authors :: Wu, Yongliang
Zhao, Shuliang
Li, Wenbin
Source :: Information Sciences. May2020, Vol. 517, p100-127. 28p.
Publication Year :: 2020
Abstract: • BOP (bag of phrases) is a better text representation than BOW (bag of words). • Phrase2Vec can be solved through two stages: phrase mining and phrase embedding. • Hierarchical phrases in sentences can be mined by traversing the parse tree. • Phrase2Vec can effectively solve the problem of phrase embedding. • The BOP can improve the performance of downstream natural language processing tasks, e.g. text categorization, text clustering. Text is one of the most common unstructured data, and usually, the most primary task in text mining is to transfer the text into a structured representation. However, the existing text representation models split the complete semantic unit and neglect the order of words, finally lead to understanding bias. In this paper, we propose a novel phrase-based text representation method that takes into account the integrity of semantic units and utilizes vectors to represent the similarity relationship between texts. First, we propose HPMBP (Hierarchical Phrase Mining Based on Parsing) which mines hierarchical phrases by parsing and uses BOP (Bag Of Phrases) to represent text. Then, we put forward three phrase embedding models, called Phrase2Vec, including Skip-Phrase, CBOP (Continuous Bag Of Phrases), and GloVeFP (Global Vectors For Phrase Representation). They learn the phrase vector with semantic similarity, further obtain the vector representation of the text. Based on Phrase2Vec, we propose PETC (Phrase Embedding based Text Classification) and PETCLU (Phrase Embedding based Text Clustering). PETC utilizes the phrase embedding to get the text vector, which is fed to a neural network for text classification. PETCLU gets the vectorization expression of text and cluster center by Phrase2Vec, furthermore extends the K-means model for text clustering. To the best of our knowledge, it is the first work that focuses on the phrase-based English text representation. Experiments show that the introduced Phrase2Vec outperforms state-of-the-art phrase embedding models in the similarity task and the analogical reasoning task on Enwiki, DBLP, and Yelp dataset. PETC is superior to the baseline text classification methods in the F1-value index by about 4%. PETCLU is also ahead of the prevalent text clustering methods in entropy and purity indicators. In summary, Phrase2Vec is a promising approach to text mining. [ABSTRACT FROM AUTHOR]