Back to Search Start Over

FrepJoin: an efficient partition-based algorithm for edit similarity join

Authors :
Jizhou Luo
Jianzhong Li
Shengfei Shi
Hongzhi Wang
Source :
Frontiers of Information Technology & Electronic Engineering. 18:1499-1510
Publication Year :
2017
Publisher :
Zhejiang University Press, 2017.

Abstract

String similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.

Details

ISSN :
20959230 and 20959184
Volume :
18
Database :
OpenAIRE
Journal :
Frontiers of Information Technology & Electronic Engineering
Accession number :
edsair.doi...........84c20d8f547b5af39bd5befab27d06a3
Full Text :
https://doi.org/10.1631/fitee.1601347