Back to Search Start Over

Graph-based clustering of extracted paraphrases for labelling crime reports.

Authors :
Das, Priyanka
Das, Asit Kumar
Source :
Knowledge-Based Systems. Sep2019, Vol. 179, p55-76. 22p.
Publication Year :
2019

Abstract

Paraphrases are well-known as synonyms that express the same context in different articulations. Extracting paraphrases from a large text corpus is a challenging task in Natural Language Processing applications. The present work proposes a graph based clustering technique for discovering labels of crime reports based on extracted paraphrases from large untagged crime corpora. Initially, the entity pairs are represented as shallow parse trees where the headword in each tree reflects the actual meaning of the phrase between the entities. Though the phrases having similar headwords have been collected together, there exist many phrases between the entities that express similar context without sharing the same headword. Therefore, clustering is done to create a group of similar meaning phrases termed as paraphrases. A complete weighted graph is constructed with the phrases as nodes and cosine similarity between pair of phrases as the weight of an edge with the phrases as terminal nodes. The graph is made sparse by removing edges with weights less than a threshold value and clustering coefficient has been calculated for each node. The subgraph(s) comprising node(s) with the highest clustering coefficient has been extracted with their adjacent edges. The remaining nodes with their adjacent edges in the graph are added one at a time to an extracted subgraph, if and only if the average clustering coefficient of the resultant subgraph increases and an agglomerative merging technique is applied to merge the extracted subgraphs until no merging takes place. Finally, each subgraph represents a cluster of phrases, yields one aspect of crime. Based on the extracted paraphrases, the reports can be easily labelled. The proposed work deals with crime reports for United States of America (USA), United Arab Emirates (UAE) and India and the evaluation is performed in terms of various supervised and unsupervised techniques. • Novel approach for labelling crime reports. • Three important crime aspects are considered. • Graph based hierarchical clustering of paraphrases is done. • Proposed sparsity scheme is compared with edge density and Gini index based sparsity. • The experimental results show the effectiveness of the work. [ABSTRACT FROM AUTHOR]

Details

Language :
English
ISSN :
09507051
Volume :
179
Database :
Academic Search Index
Journal :
Knowledge-Based Systems
Publication Type :
Academic Journal
Accession number :
136936074
Full Text :
https://doi.org/10.1016/j.knosys.2019.05.004