C. Koutras | TU Delft Repository

Tabular Schema Matching for Modern Settings

Doctoral thesis (2024) - C. Koutras, G.J.P.M. Houben, A. Katsifodimos, C. Lofi

Schema matching is a critical data integration process, which aims at capturing relevance between elements of different datasets; when datasets are tabular, it translates to the process of discovering related columns among them. Accurately discovering column matches is integral for several applications, such as entity resolution, data cleaning and data augmentation. While there exists a multitude of schema matching methods in the literature, we identify three major issues: i) there is no comprehensive study of comparing them in terms of effectiveness and efficiency, due to not available implementations and lack of evaluation datasets, ii) existing methods might be impractical and even inapplicable in certain modern settings, and iii) the heterogeneity and complexity of data can impede capturing relevance among columns for existing methods, as certain assumptions might not be holding for the entirety of underlying datasets. In this thesis, we tackle these issues by reviewing existing schema matching techniques and proposing novel methods capable to address challenges imposed by modern settings.
Starting with Chapter 2, we present an extensive comparison study on existing schema matching methods, by introducing Valentine. Specifically, Valentine constitutes an open-source experimental suite, which encompasses several state-of-the-art schema matching solutions. To guide the evaluation process towards modern applications, we extract four relatedness scenarios from the dataset discovery literature. To tackle the lack of existing datasets with ground truth, we devise a principled fabrication process. Our findings lead to insights that can help to improve future research on the field of schema matching, while they affect the design choices we make for novel methods we present in the following chapters.
Next, in Chapter 3, we turn our focus on applying schema matching among datasets stored in different data silos, which cannot be collocated and each contains information about column matches. Towards this direction, we introduce SiMa, a matching method that leverages existing matches in each silo, to build a column match prediction model, powered by the employment of a Graph Neural Network (GNN). To do so, SiMa transforms columns and matches among them in each silo to a graph, while it performs targeted negative edge sampling and incremental training to enhance the learning process. In our experimental evaluation, we show the benefits of using SiMa over state-of-the-art techniques, both in terms of effectiveness and efficiency.
Finally, Chapter 4 discusses the problem of discovering join relationships among datasets in a repository. To ameliorate the shortcomings of previous methods, we propose OmniMatch, a self-supervised method that can effectively capture both equi- and fuzzy-joins among tabular data. At the core of the method is the exploitation of a comprehensive set of similarity signals among columns, which are then transformed into a similarity graph. This graph, in conjunction with automatically generated positive and negative column match examples, enable the employment of a Relational Graph Convolution Network (RGCN) towards training a generalizable join prediction model. We compare the effectiveness of OmniMatch with several other state-of-the-art matching and column representation methods, while we verify the usefulness of utilizing a wide-spectrum of similarity signals to capture joins.
We conclude the thesis by reviewing our main findings, reflecting on our contributions and discussing potential limitations of the methods and approaches presented. Moreover, based on the insights we gain from surveying and developing novel matching methods, we discuss challenges and future directions in the field.
...

Schema matching is a critical data integration process, which aims at capturing relevance between elements of different datasets; when datasets are tabular, it translates to the process of discovering related columns among them. Accurately discovering column matches is integral for several applications, such as entity resolution, data cleaning and data augmentation. While there exists a multitude of schema matching methods in the literature, we identify three major issues: i) there is no comprehensive study of comparing them in terms of effectiveness and efficiency, due to not available implementations and lack of evaluation datasets, ii) existing methods might be impractical and even inapplicable in certain modern settings, and iii) the heterogeneity and complexity of data can impede capturing relevance among columns for existing methods, as certain assumptions might not be holding for the entirety of underlying datasets. In this thesis, we tackle these issues by reviewing existing schema matching techniques and proposing novel methods capable to address challenges imposed by modern settings.
Starting with Chapter 2, we present an extensive comparison study on existing schema matching methods, by introducing Valentine. Specifically, Valentine constitutes an open-source experimental suite, which encompasses several state-of-the-art schema matching solutions. To guide the evaluation process towards modern applications, we extract four relatedness scenarios from the dataset discovery literature. To tackle the lack of existing datasets with ground truth, we devise a principled fabrication process. Our findings lead to insights that can help to improve future research on the field of schema matching, while they affect the design choices we make for novel methods we present in the following chapters.
Next, in Chapter 3, we turn our focus on applying schema matching among datasets stored in different data silos, which cannot be collocated and each contains information about column matches. Towards this direction, we introduce SiMa, a matching method that leverages existing matches in each silo, to build a column match prediction model, powered by the employment of a Graph Neural Network (GNN). To do so, SiMa transforms columns and matches among them in each silo to a graph, while it performs targeted negative edge sampling and incremental training to enhance the learning process. In our experimental evaluation, we show the benefits of using SiMa over state-of-the-art techniques, both in terms of effectiveness and efficiency.
Finally, Chapter 4 discusses the problem of discovering join relationships among datasets in a repository. To ameliorate the shortcomings of previous methods, we propose OmniMatch, a self-supervised method that can effectively capture both equi- and fuzzy-joins among tabular data. At the core of the method is the exploitation of a comprehensive set of similarity signals among columns, which are then transformed into a similarity graph. This graph, in conjunction with automatically generated positive and negative column match examples, enable the employment of a Relational Graph Convolution Network (RGCN) towards training a generalizable join prediction model. We compare the effectiveness of OmniMatch with several other state-of-the-art matching and column representation methods, while we verify the usefulness of utilizing a wide-spectrum of similarity signals to capture joins.
We conclude the thesis by reviewing our main findings, reflecting on our contributions and discussing potential limitations of the methods and approaches presented. Moreover, based on the insights we gain from surveying and developing novel matching methods, we discuss challenges and future directions in the field.

Data Lakes

A Survey of Functions and Systems

Journal article (2023) - Rihan Hai, Christos Koutras, Christoph Quix, Matthias Jarke

Data lakes are becoming increasingly prevalent for Big Data management and data analytics. In contrast to traditional 'schema-on-write' approaches such as data warehouses, data lakes are repositories storing raw data in its original formats and providing a common access interface. Despite the strong interest raised from both academia and industry, there is a large body of ambiguity regarding the definition, functions and available technologies for data lakes. A complete, coherent picture of data lake challenges and solutions is still missing. This survey reviews the development, architectures, and systems of data lakes. We provide a comprehensive overview of research questions for designing and building data lakes. We classify the existing approaches and systems based on their provided functions for data lakes, which makes this survey a useful technical reference for designing, implementing and deploying data lakes. We hope that the thorough comparison of existing solutions and the discussion of open research challenges in this survey will motivate the future development of data lake research and practice. ...

Amalur

Data Integration Meets Machine Learning

Conference paper (2023) - Rihan Hai, Christos Koutras, Andra Ionescu, Ziyu Li, Wenbo Sun, Jessie van Schijndel, Yan Kang, Asterios Katsifodimos

Machine learning (ML) training data is often scattered across disparate collections of datasets, called data silos. This fragmentation poses a major challenge for data-intensive ML applications: integrating and transforming data residing in different sources demand a lot of manual work and computational resources. With data privacy and security constraints, data often cannot leave the premises of data silos, hence model training should proceed in a decentralized manner. In this work, we present a vision of how to bridge the traditional data integration (DI) techniques with the requirements of modern machine learning. We explore the possibilities of utilizing metadata obtained from data integration processes for improving the effectiveness and efficiency of ML models. Towards this direction, we analyze two common use cases over data silos, feature augmentation and federated learning. Bringing data integration and machine learning together, we highlight new research opportunities from the aspects of systems, representations, factorized learning and federated learning. ...

Amalur

Next-generation Data Integration in Data Lakes

Abstract (2022) - Rihan Hai, Christos Koutras, Andra Ionescu, Asterios Katsifodimos

Data science workflows often require extracting, preparing and integrating data from multiple data sources. This is a cumbersome and slow process: most of the times, data scientists prepare data in a data processing system or a data lake, and export it as a table, in order for it to be consumed by a Machine Learning (ML) algorithm. Recent advances in the area of factorized ML, allow us to push down certain linear algebra (LA) operators, executing them closer to the data sources. With this work, we revisit classic data integration (DI) systems and see how these fit into modern data lakes that are meant to support LA as a first-class citizen. ...

Valentine in Action

Matching Tabular Data at Scale

Journal article (2021) - Christos Koutras, Kyriakos Psarakis, George Siachamis, Andra Ionescu, Marios Fragkoulis, Angela Bonifati, Asterios Katsifodimos

Capturing relationships among heterogeneous datasets in large data lakes - traditionally termed schema matching - is one of the most challenging problems that corporations and institutions face nowadays. Discovering and integrating datasets heavily relies on the effectiveness of the schema matching methods in use. However, despite the wealth of research, evaluation of schema matching methods is still a daunting task: there is a lack of openly-available datasets with ground truth, reference method implementations, and comprehensible GUIs that would facilitate development of both novel state-of-the-art schema matching techniques and novel data discovery methods.Our recently proposed Valentine is the first system to offer an open-source experiment suite to organize, execute and orchestrate large-scale matching experiments. In this demonstration we present its functionalities and enhancements: i) a scalable system, with a user-centric GUI, that enables the fabrication of datasets and the evaluation of matching methods on schema matching scenarios tailored to the scope of tabular dataset discovery, ii) a scalable holistic matching system that can receive tabular datasets from heterogeneous sources and provide with similarity scores among their columns, in order to facilitate modern procedures in data lakes, such as dataset discovery. ...

Valentine: Evaluating Matching Techniques for Dataset Discovery

Conference paper (2021) - Christos Koutras, George Siachamis, Andra Ionescu, Kyriakos Psarakis, Jerry Brons, Marios Fragkoulis, Christoph Lofi, Angela Bonifati, Asterios Katsifodimos

Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema matching has been used to find matching pairs of columns between a source and a target schema. However, the use of schema matching in dataset discovery methods differs from its original use. Nowadays schema matching serves as a building block for indicating and ranking inter-dataset relationships. Surprisingly, although a discovery method’s success relies highly on the quality of the underlying matching algorithms, the latest discovery methods employ existing schema matching algorithms in an ad-hoc fashion due to the lack of openly-available datasets with ground truth, reference method implementations, and evaluation metrics. In this paper, we aim to rectify the problem of evaluating the effectiveness and efficiency of schema matching methods for the specific needs of dataset discovery. To this end, we propose Valentine, an extensible open-source experiment suite to execute and organize large-scale automated matching experiments on tabular data. Valentine includes implementations of seminal schema matching methods that we either implemented from scratch (due to absence of open source code) or imported from open repositories. The contributions of Valentine are: i) the definition of four schema matching scenarios as encountered in dataset discovery methods, ii) a principled dataset fabrication process tailored to the scope of dataset discovery methods and iii) the most comprehensive evaluation of schema matching techniques to date, offering insight on the strengths and weaknesses of existing techniques, that can serve as a guide for employing schema matching in future dataset discovery methods. ...

REMA

Graph embeddings-based relational schema matching

Abstract (2020) - Christos Koutras, Marios Fragkoulis, Asterios Katsifodimos, Christoph Lofi

Schema matching is the process of capturing correspondence between attributes of different datasets and it is one of the most important prerequisite steps for analyzing heterogeneous data collections. State-of-the-art schema matching algorithms that use simple schema- or instance-based similarity measures struggle with finding matches beyond the trivial cases. Semantics-based algorithms require the use of domain-specific knowledge encoded in a knowledge graph or an ontology. As a result, schema matching still remains a largely manual process, which is performed by few domain experts. In this paper we present the Relational Embeddings MAtcher, or rema, for short. rema is a novel schema matching approach which captures semantic similarity of attributes using relational embeddings: a technique which embeds database rows, columns and schema information into multidimensional vectors that can reveal semantic similarity. This paper aims at communicating our latest findings, and at demonstrating rema's potential with a preliminary experimental evaluation. ...

Data as a language

A novel approach to data integration

Abstract (2019) - Christos Koutras

In modern enterprises, both operational and organizational data is typically spread across multiple heterogeneous systems, databases and file systems. Recognizing the value of their data assets, companies and institutions construct data lakes, storing disparate datasets from dierent departments and systems. However, for those datasets to become useful, they need to be cleaned and integrated. Data can be well documented, structured and encoded in dierent schemata, but also unstructured with implicit, human-understandable semantics. Due to the sheer scale of the data itself but also the multitude of representations and schemata, data integration techniques need to scale without relying heavily on human labor. Existing integration approaches fail to address hidden semantics without human input or some form of ontology, making large scale integration a daunting task. The goal of my doctoral work is to devise scalable data integration methods, employing modern machine learning to exploit semantics and facilitate discovery of novel relationship types. In order to capture semantics with minimal human intervention, we propose a new approach which we call Data as a Language (DaaL). By leveraging embeddings from the Natural Language Processing (NLP) literature, DaaL aims at extracting semantics from structured and semi-structured data, allowing the exploration of relevance and similarity among dierent data sources. This paper discusses existing data integration mechanisms and elaborates on how NLP techniques can be used in data integration, alongside challenges and research directions. ...