Back to Search
Start Over
FrepJoin: an efficient partition-based algorithm for edit similarity join
- Source :
- Frontiers of Information Technology & Electronic Engineering. 18:1499-1510
- Publication Year :
- 2017
- Publisher :
- Zhejiang University Press, 2017.
-
Abstract
- String similarity join (SSJ) is essential for many applications where near-duplicate objects need to be found. This paper targets SSJ with edit distance constraints. The existing algorithms usually adopt the filter-andrefine framework. They cannot catch the dissimilarity between string subsets, and do not fully exploit the statistics such as the frequencies of characters. We investigate to develop a partition-based algorithm by using such statistics. The frequency vectors are used to partition datasets into data chunks with dissimilarity between them being caught easily. A novel algorithm is designed to accelerate SSJ via the partitioned data. A new filter is proposed to leverage the statistics to avoid computing edit distances for a noticeable proportion of candidate pairs which survive the existing filters. Our algorithm outperforms alternative methods notably on real datasets.
- Subjects :
- Alternative methods
Exploit
Computer Networks and Communications
Computer science
020207 software engineering
02 engineering and technology
Hardware and Architecture
020204 information systems
Signal Processing
0202 electrical engineering, electronic engineering, information engineering
Leverage (statistics)
Edit distance
Electrical and Electronic Engineering
String metric
Algorithm
Subjects
Details
- ISSN :
- 20959230 and 20959184
- Volume :
- 18
- Database :
- OpenAIRE
- Journal :
- Frontiers of Information Technology & Electronic Engineering
- Accession number :
- edsair.doi...........84c20d8f547b5af39bd5befab27d06a3
- Full Text :
- https://doi.org/10.1631/fitee.1601347