Back to Search Start Over

Graph-based clustering of extracted paraphrases for labelling crime reports

Authors :
Priyanka Das
Asit Kumar Das
Source :
Knowledge-Based Systems. 179:55-76
Publication Year :
2019
Publisher :
Elsevier BV, 2019.

Abstract

Paraphrases are well-known as synonyms that express the same context in different articulations. Extracting paraphrases from a large text corpus is a challenging task in Natural Language Processing applications. The present work proposes a graph based clustering technique for discovering labels of crime reports based on extracted paraphrases from large untagged crime corpora. Initially, the entity pairs are represented as shallow parse trees where the headword in each tree reflects the actual meaning of the phrase between the entities. Though the phrases having similar headwords have been collected together, there exist many phrases between the entities that express similar context without sharing the same headword. Therefore, clustering is done to create a group of similar meaning phrases termed as paraphrases. A complete weighted graph is constructed with the phrases as nodes and cosine similarity between pair of phrases as the weight of an edge with the phrases as terminal nodes. The graph is made sparse by removing edges with weights less than a threshold value and clustering coefficient has been calculated for each node. The subgraph(s) comprising node(s) with the highest clustering coefficient has been extracted with their adjacent edges. The remaining nodes with their adjacent edges in the graph are added one at a time to an extracted subgraph, if and only if the average clustering coefficient of the resultant subgraph increases and an agglomerative merging technique is applied to merge the extracted subgraphs until no merging takes place. Finally, each subgraph represents a cluster of phrases, yields one aspect of crime. Based on the extracted paraphrases, the reports can be easily labelled. The proposed work deals with crime reports for United States of America (USA), United Arab Emirates (UAE) and India and the evaluation is performed in terms of various supervised and unsupervised techniques.

Details

ISSN :
09507051
Volume :
179
Database :
OpenAIRE
Journal :
Knowledge-Based Systems
Accession number :
edsair.doi...........c44e8bd21caca7270100ca6bce7af015