Domain-focused dataset discovery for tabular datasets, using easily-available information about the domain

More Info
expand_more

Abstract

Dataset discovery techniques originally required datasets to have the same domain which made them unsuitable to be used on a larger scale. To avoid this requirement, newer techniques use additional information, aside from the datasets being processed, to better understand the data. They might rely on a knowledge base that describes the meaning of data, or a lexical database that defines meaningful relations between the words contained within the data. The main problem with this approach is that these types of additional information have poor coverage of the data being analyzed.

I propose to use a type of information that I call dataset domain terms. These are terms, or data values, that represent the domain of a dataset. I provide a technique that can derive these dataset domain terms automatically from existing datasets which means they are easily available. The problem of poor coverage can also be mitigated by only discovering datasets for the domain represented by these dataset domain terms. I provide a dataset discovery technique that takes this approach with these dataset domain terms.

Through an evaluation, I show that these dataset domain terms are sufficiently representative of the domain to be used for dataset discovery. The accuracy of the dataset discovery technique is also shown to be comparable to state-of-the-art dataset discovery techniques, though its precision is lacking.

This makes it highly suitable to filter datasets before other dataset discovery techniques can be performed on them. The data from these filtered datasets should also have a limited range of domains. So subsequent dataset discovery techniques should be less affected by the poor coverage of the additional information they use to understand the data. This allows dataset discovery to be performed on a larger scale.