Dirty Entity Resolution starring Locality-Sensitive Hashing on Incoherent and Incomplete Distributed Big Data: A Feasibility Study

More Info
expand_more

Abstract

In real-world scenarios, users provide invaluable data; however, this data is inherently incoherent, incomplete, and duplicated, i.e., different data rows refer to the same real-world object. Merging duplications to a single entry broadens the knowledge of a given real-world object represented within a data set. Applying a straightforward cross-join operation to check for duplications is infeasible and intractable in big data scenarios. Instead of comparing all rows, the similarity can be approximated by Locality-Sensitive Hashing (LSH). LSH’s MinHash creates colliding hashes on common tokens. With enough matching hashes, two rows can be deemed similar. The applications of LSH within an Entity Resolution (ER) pipeline on incoherent and incomplete big data are thoroughly investigated in this work. An ER pipeline is introduced to make the deduplication problem tractable. This pipeline consists of four phases: preprocessing, blocking, matching, and clustering. In the preprocessing, the data is made coherent. In the blocking, the data is divided into smaller main blocks by blocking on complete properties. These blocks contain a majority of True-Matches. The created blocks can be augmented with additional blocks that rely on incomplete properties. In the matching phase, only the Matches found within the blocks are compared, significantly reducing the required number of comparisons. The incomplete properties cannot be transformed into features for every Match; therefore, classifiers are trained per property combination, allowing for optimal Match classification. In clustering, the duplicated rows are merged into a single Entity. These Entities are created with disconnected sub-graphs of Positive-Matches. These sub-graphs are particularly susceptible to noise. Therefore, novel implementations of MinHash are used for noise removal and finding missed True-Matches. The created Entities were tested on an existingmodel. Blocks constructed with MinHash on N-gram tokens allow spelling mistakes to be overcome within the user data, which improves the Pairs Completeness (PC) at the cost of Pairs Quality (PQ) and Reduction Ratio (RR). Introducing augmentative blocks increases the PC significantly at the cost of PQ and a low cost of RR. Multiple runs are investigated throughout the pipeline. The runs relying on MinHash showed increased PC at the cost of PQ and RR. A novel concept of a double-pass pipeline is introduced, in which the baseline ER pipeline should be initially applied, followed by a more expensive error-correctiveMinHash ER pipeline that can be used on Entities created by the first pass, combining the strengths of the investigated runs.