Valentine in Action

Matching Tabular Data at Scale

More Info


Capturing relationships among heterogeneous datasets in large data lakes - traditionally termed schema matching - is one of the most challenging problems that corporations and institutions face nowadays. Discovering and integrating datasets heavily relies on the effectiveness of the schema matching methods in use. However, despite the wealth of research, evaluation of schema matching methods is still a daunting task: there is a lack of openly-available datasets with ground truth, reference method implementations, and comprehensible GUIs that would facilitate development of both novel state-of-the-art schema matching techniques and novel data discovery methods.Our recently proposed Valentine is the first system to offer an open-source experiment suite to organize, execute and orchestrate large-scale matching experiments. In this demonstration we present its functionalities and enhancements: i) a scalable system, with a user-centric GUI, that enables the fabrication of datasets and the evaluation of matching methods on schema matching scenarios tailored to the scope of tabular dataset discovery, ii) a scalable holistic matching system that can receive tabular datasets from heterogeneous sources and provide with similarity scores among their columns, in order to facilitate modern procedures in data lakes, such as dataset discovery.