The R Journal: article published in 2010, volume 2:2

The RecordLinkage Package: Detecting Errors in Data PDF download
Murat Sariyar and Andreas Borg , The R Journal (2010) 2:2, pages 61-67.

Abstract Record linkage deals with detecting homonyms and mainly synonyms in data. The package RecordLinkage provides means to perform and evaluate different record linkage methods. A stochas tic framework is implemented which calculates weights through an EM algorithm. The determination of the necessary thresholds in this model can be achieved by tools of extreme value theory. Further more, machine learning methods are utilized, including decision trees (rpart), bootstrap aggregating (bagging), ada boost (ada), neural nets (nnet) and support vector machines (svm). The generation of record pairs and comparison patterns from single data items are provided as well. Comparison patterns can be chosen to be binary or based on some string metrics. In order to reduce computation time and memory usage, blocking can be used. Future development will concentrate on additional and refined methods, performance improvements and input/output facilities needed for real-world application.



@article{RJ-2010-017,
  author = {Murat Sariyar and Andreas Borg},
  title = {{The RecordLinkage Package: Detecting Errors in Data}},
  year = {2010},
  journal = {{The R Journal}},
  doi = {10.32614/RJ-2010-017},
  url = {https://doi.org/10.32614/RJ-2010-017},
  pages = {61--67},
  volume = {2},
  number = {2}
}