Holistic Schema Matching at Scale

Master Thesis (2020)
Author(s)

Kyriakos Psarakis (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A. Katsifodimos – Mentor (TU Delft - Web Information Systems)

G.J.P.M. Houben – Graduation committee member (TU Delft - Web Information Systems)

A. van Deursen – Graduation committee member (TU Delft - Software Technology)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2020
Language
English
Graduation Date
03-12-2020
Awarding Institution
Delft University of Technology
Programme
['Computer Science | Software Technology']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Schema matching is a fundamental task in the data integration pipeline and has been studied extensively in the past decades, leading to many novel schema matching methods. However, these methods do not follow a standard evaluation process, leading to uncertainty in which one performs best in matching accuracy and runtime constraints, and in which specific schema matching category, and with what hyperparameters. To clear the confusion, the need for a scalable benchmarking suite to determine the field's progress became apparent, leading to the first contribution of this work, a scalable benchmarking suite for schema matching tasks. In the meantime, we realized that the literature lacked a scalable holistic schema matching system, leading to our second contribution. By considering the knowledge gained from our proposed benchmark, we developed a system that can incorporate any algorithm and data source while running the schema matching jobs in parallel across multiple machines in a scalable fashion. Furthermore, we decided to give a leading role to the users of such a system. The reason behind that is that it became apparent in the benchmark that no algorithm is perfect in every situation, and in mission-critical applications, we cannot afford any mistakes. Thus, the users would have to approve the proposed matches, and we focused on making this task scalable, fast, and straightforward.

Files

License info not available