Back to Search Start Over

reclin2: a Toolkit for Record Linkage and Deduplication.

Authors :
van der Laan, D. Jan
Source :
R Journal. Jun2022, Vol. 14 Issue 2, p320-328. 9p.
Publication Year :
2022

Abstract

The goal of record linkage and deduplication is to detect which records belong to the same object in data sets where the identifiers of the objects contain errors and missing values. The main design considerations of reclin2 are: modularity/flexibility, speed and the ability to handle large data sets. The first points makes it easy for users to extend the package with custom process steps. This flexibility is obtained by using simple data structures and by following as close as possible common interfaces in R. For large problems it is possible to distribute the work over multiple worker nodes. A benchmark comparison to other record linkage packages for R, shows that for this specific benchmark, the fastLink package performs best. However, this package only performs one specific type of record linkage model. The performance of reclin2 is not far behind the of fastLink while allowing for much greater flexibility. [ABSTRACT FROM AUTHOR]

Subjects

Subjects :
*BIG data

Details

Language :
English
ISSN :
20734859
Volume :
14
Issue :
2
Database :
Academic Search Index
Journal :
R Journal
Publication Type :
Academic Journal
Accession number :
162463005
Full Text :
https://doi.org/10.32614/rj-2022-038