Domain-focused dataset discovery for tabular datasets, using easily-available information about the domain

Master Thesis (2022)
Author(s)

R.M.S. Mokiem (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Christoph Lofi – Mentor (TU Delft - Web Information Systems)

G.J. Houben – Graduation committee member (TU Delft - Web Information Systems)

J.M. Weber – Coach (TU Delft - Pattern Recognition and Bioinformatics)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2022 Riaas Mokiem
More Info
expand_more
Publication Year
2022
Language
English
Copyright
© 2022 Riaas Mokiem
Graduation Date
09-11-2022
Awarding Institution
Delft University of Technology
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Dataset discovery techniques originally required datasets to have the same domain which made them unsuitable to be used on a larger scale. To avoid this requirement, newer techniques use additional information, aside from the datasets being processed, to better understand the data. They might rely on a knowledge base that describes the meaning of data, or a lexical database that defines meaningful relations between the words contained within the data. The main problem with this approach is that these types of additional information have poor coverage of the data being analyzed.

I propose to use a type of information that I call dataset domain terms. These are terms, or data values, that represent the domain of a dataset. I provide a technique that can derive these dataset domain terms automatically from existing datasets which means they are easily available. The problem of poor coverage can also be mitigated by only discovering datasets for the domain represented by these dataset domain terms. I provide a dataset discovery technique that takes this approach with these dataset domain terms.

Through an evaluation, I show that these dataset domain terms are sufficiently representative of the domain to be used for dataset discovery. The accuracy of the dataset discovery technique is also shown to be comparable to state-of-the-art dataset discovery techniques, though its precision is lacking.

This makes it highly suitable to filter datasets before other dataset discovery techniques can be performed on them. The data from these filtered datasets should also have a limited range of domains. So subsequent dataset discovery techniques should be less affected by the poor coverage of the additional information they use to understand the data. This allows dataset discovery to be performed on a larger scale.

Files

Thesis_Riaas_Mokiem.pdf
(pdf | 0.833 Mb)
License info not available