Holistic Schema Matching at Scale

More Info
expand_more

Abstract

Schema matching is a fundamental task in the data integration pipeline and has been studied extensively in the past decades, leading to many novel schema matching methods. However, these methods do not follow a standard evaluation process, leading to uncertainty in which one performs best in matching accuracy and runtime constraints, and in which specific schema matching category, and with what hyperparameters. To clear the confusion, the need for a scalable benchmarking suite to determine the field's progress became apparent, leading to the first contribution of this work, a scalable benchmarking suite for schema matching tasks. In the meantime, we realized that the literature lacked a scalable holistic schema matching system, leading to our second contribution. By considering the knowledge gained from our proposed benchmark, we developed a system that can incorporate any algorithm and data source while running the schema matching jobs in parallel across multiple machines in a scalable fashion. Furthermore, we decided to give a leading role to the users of such a system. The reason behind that is that it became apparent in the benchmark that no algorithm is perfect in every situation, and in mission-critical applications, we cannot afford any mistakes. Thus, the users would have to approve the proposed matches, and we focused on making this task scalable, fast, and straightforward.