Bilenko & Mooney (2002):

The problem of identifying duplicate records in databases was originally described by Newcombe [17] as record linkage in the context of identifying medical records of the same individual from different time
periods. Fellegi and Sunter [7] developed a formal mathematical problem description for record linkage and offered statistical methods for estimating matching parameters and error rates. In more recent work in statistics, Winkler proposed using EM-based methods for estimating error rates and optimal matching rules [24]. This work studied the duplicate detection problem for the specialized domain of census records, therefore all similarity metrics were hand-tuned for optimal performance in this domain.
Hern´ andez and Stolfo [10] developed the sorted neighborhood method for limiting the number of potential duplicate pairs that require distance computation, while McCallum et. al. proposed the canopies
clustering algorithm [12] for the task of matching scientific citations. Monge and Elkan developed the iterative merging algorithm based on the union-find data structure [14] and showed the advantages of using a string distance metric that allows gaps [13]. Cohen et. al. described the problem of duplicate detection as database hardening: inferring the most likely underlying databases without duplicates (a “hard” database) given a database containing duplicates (a “soft” database) [3]. They proved NP-hardness of solving the problem optimally and proposed a nearly linear time algorithm for finding a local optimum using the union-find data structure.
In all of these approaches fixed-cost similarity metrics were used to compare database records. The only
previous work on adaptive duplicate detection that we know of is the approach described by Cohen in [2],
which learns how to combine multiple similarity metrics to identify duplicates, but does not adaptively tune
the underlying field-similarity metrics themselves

This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s