Domain-focused dataset discovery for tabular datasets, using easily-available information about the domain

Master thesis (2022)

Authors

R.M.S. Mokiem Electrical Engineering, Mathematics and Computer Science

Contributors

C. Lofi Web Information Systems - (supervisor 1)

G.J.P.M. Houben Web Information Systems - (supervisor 2)

J.M. Weber Pattern Recognition and Bioinformatics - (coach)

Faculty

Electrical Engineering, Mathematics and Computer Science

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:012f7697-16be-4c93-965b-b4f8ebd391b3

Published Date

09-11-2022

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Dataset discovery techniques originally required datasets to have the same domain which made them unsuitable to be used on a larger scale. To avoid this requirement, newer techniques use additional information, aside from the datasets being processed, to better understand the data. They might rely on a knowledge base that describes the meaning of data, or a lexical database that defines meaningful relations between the words contained within the data. The main problem with this approach is that these types of additional information have poor coverage of the data being analyzed.

I propose to use a type of information that I call dataset domain terms. These are terms, or data values, that represent the domain of a dataset. I provide a technique that can derive these dataset domain terms automatically from existing datasets which means they are easily available. The problem of poor coverage can also be mitigated by only discovering datasets for the domain represented by these dataset domain terms. I provide a dataset discovery technique that takes this approach with these dataset domain terms.

Through an evaluation, I show that these dataset domain terms are sufficiently representative of the domain to be used for dataset discovery. The accuracy of the dataset discovery technique is also shown to be comparable to state-of-the-art dataset discovery techniques, though its precision is lacking.

This makes it highly suitable to filter datasets before other dataset discovery techniques can be performed on them. The data from these filtered datasets should also have a limited range of domains. So subsequent dataset discovery techniques should be less affected by the poor coverage of the additional information they use to understand the data. This allows dataset discovery to be performed on a larger scale.

Files

Thesis_Riaas_Mokiem.pdf

(.pdf | 0.833 Mb)