Back to Search
Start Over
Graph-based clustering of extracted paraphrases for labelling crime reports
- Source :
- Knowledge-Based Systems. 179:55-76
- Publication Year :
- 2019
- Publisher :
- Elsevier BV, 2019.
-
Abstract
- Paraphrases are well-known as synonyms that express the same context in different articulations. Extracting paraphrases from a large text corpus is a challenging task in Natural Language Processing applications. The present work proposes a graph based clustering technique for discovering labels of crime reports based on extracted paraphrases from large untagged crime corpora. Initially, the entity pairs are represented as shallow parse trees where the headword in each tree reflects the actual meaning of the phrase between the entities. Though the phrases having similar headwords have been collected together, there exist many phrases between the entities that express similar context without sharing the same headword. Therefore, clustering is done to create a group of similar meaning phrases termed as paraphrases. A complete weighted graph is constructed with the phrases as nodes and cosine similarity between pair of phrases as the weight of an edge with the phrases as terminal nodes. The graph is made sparse by removing edges with weights less than a threshold value and clustering coefficient has been calculated for each node. The subgraph(s) comprising node(s) with the highest clustering coefficient has been extracted with their adjacent edges. The remaining nodes with their adjacent edges in the graph are added one at a time to an extracted subgraph, if and only if the average clustering coefficient of the resultant subgraph increases and an agglomerative merging technique is applied to merge the extracted subgraphs until no merging takes place. Finally, each subgraph represents a cluster of phrases, yields one aspect of crime. Based on the extracted paraphrases, the reports can be easily labelled. The proposed work deals with crime reports for United States of America (USA), United Arab Emirates (UAE) and India and the evaluation is performed in terms of various supervised and unsupervised techniques.
- Subjects :
- Text corpus
Information Systems and Management
Phrase
business.industry
Computer science
Cosine similarity
02 engineering and technology
computer.software_genre
Graph
Management Information Systems
Hierarchical clustering
Artificial Intelligence
020204 information systems
0202 electrical engineering, electronic engineering, information engineering
Graph (abstract data type)
020201 artificial intelligence & image processing
Artificial intelligence
Cluster analysis
business
computer
Software
Natural language processing
Clustering coefficient
Subjects
Details
- ISSN :
- 09507051
- Volume :
- 179
- Database :
- OpenAIRE
- Journal :
- Knowledge-Based Systems
- Accession number :
- edsair.doi...........c44e8bd21caca7270100ca6bce7af015