Tabular Schema Matching for Modern Settings
More Info
expand_more
Abstract
Schema matching is a critical data integration process, which aims at capturing relevance between elements of different datasets; when datasets are tabular, it translates to the process of discovering related columns among them. Accurately discovering column matches is integral for several applications, such as entity resolution, data cleaning and data augmentation. While there exists a multitude of schema matching methods in the literature, we identify three major issues: i) there is no comprehensive study of comparing them in terms of effectiveness and efficiency, due to not available implementations and lack of evaluation datasets, ii) existing methods might be impractical and even inapplicable in certain modern settings, and iii) the heterogeneity and complexity of data can impede capturing relevance among columns for existing methods, as certain assumptions might not be holding for the entirety of underlying datasets. In this thesis, we tackle these issues by reviewing existing schema matching techniques and proposing novel methods capable to address challenges imposed by modern settings.
Starting with Chapter 2, we present an extensive comparison study on existing schema matching methods, by introducing Valentine. Specifically, Valentine constitutes an open-source experimental suite, which encompasses several state-of-the-art schema matching solutions. To guide the evaluation process towards modern applications, we extract four relatedness scenarios from the dataset discovery literature. To tackle the lack of existing datasets with ground truth, we devise a principled fabrication process. Our findings lead to insights that can help to improve future research on the field of schema matching, while they affect the design choices we make for novel methods we present in the following chapters.
Next, in Chapter 3, we turn our focus on applying schema matching among datasets stored in different data silos, which cannot be collocated and each contains information about column matches. Towards this direction, we introduce SiMa, a matching method that leverages existing matches in each silo, to build a column match prediction model, powered by the employment of a Graph Neural Network (GNN). To do so, SiMa transforms columns and matches among them in each silo to a graph, while it performs targeted negative edge sampling and incremental training to enhance the learning process. In our experimental evaluation, we show the benefits of using SiMa over state-of-the-art techniques, both in terms of effectiveness and efficiency.
Finally, Chapter 4 discusses the problem of discovering join relationships among datasets in a repository. To ameliorate the shortcomings of previous methods, we propose OmniMatch, a self-supervised method that can effectively capture both equi- and fuzzy-joins among tabular data. At the core of the method is the exploitation of a comprehensive set of similarity signals among columns, which are then transformed into a similarity graph. This graph, in conjunction with automatically generated positive and negative column match examples, enable the employment of a Relational Graph Convolution Network (RGCN) towards training a generalizable join prediction model. We compare the effectiveness of OmniMatch with several other state-of-the-art matching and column representation methods, while we verify the usefulness of utilizing a wide-spectrum of similarity signals to capture joins.
We conclude the thesis by reviewing our main findings, reflecting on our contributions and discussing potential limitations of the methods and approaches presented. Moreover, based on the insights we gain from surveying and developing novel matching methods, we discuss challenges and future directions in the field.