Bilenko & Mooney (2002):
The problem of identifying duplicate records in databases was originally described by Newcombe  as record linkage in the context of identifying medical records of the same individual from different time
periods. Fellegi and Sunter  developed a formal mathematical problem description for record linkage and offered statistical methods for estimating matching parameters and error rates. In more recent work in statistics, Winkler proposed using EM-based methods for estimating error rates and optimal matching rules . This work studied the duplicate detection problem for the specialized domain of census records, therefore all similarity metrics were hand-tuned for optimal performance in this domain.
Hern´ andez and Stolfo  developed the sorted neighborhood method for limiting the number of potential duplicate pairs that require distance computation, while McCallum et. al. proposed the canopies
clustering algorithm  for the task of matching scientiﬁc citations. Monge and Elkan developed the iterative merging algorithm based on the union-ﬁnd data structure  and showed the advantages of using a string distance metric that allows gaps . Cohen et. al. described the problem of duplicate detection as database hardening: inferring the most likely underlying databases without duplicates (a “hard” database) given a database containing duplicates (a “soft” database) . They proved NP-hardness of solving the problem optimally and proposed a nearly linear time algorithm for ﬁnding a local optimum using the union-ﬁnd data structure.
In all of these approaches ﬁxed-cost similarity metrics were used to compare database records. The only
previous work on adaptive duplicate detection that we know of is the approach described by Cohen in ,
which learns how to combine multiple similarity metrics to identify duplicates, but does not adaptively tune
the underlying ﬁeld-similarity metrics themselves.