Print Email Facebook Twitter Domain-focused dataset discovery for tabular datasets, using easily-available information about the domain Title Domain-focused dataset discovery for tabular datasets, using easily-available information about the domain Author Mokiem, Riaas (TU Delft Electrical Engineering, Mathematics and Computer Science) Contributor Lofi, C. (mentor) Houben, G.J.P.M. (graduation committee) Weber, J.M. (graduation committee) Degree granting institution Delft University of Technology Date 2022-11-09 Abstract Dataset discovery techniques originally required datasets to have the same domain which made them unsuitable to be used on a larger scale. To avoid this requirement, newer techniques use additional information, aside from the datasets being processed, to better understand the data. They might rely on a knowledge base that describes the meaning of data, or a lexical database that defines meaningful relations between the words contained within the data. The main problem with this approach is that these types of additional information have poor coverage of the data being analyzed.I propose to use a type of information that I call dataset domain terms. These are terms, or data values, that represent the domain of a dataset. I provide a technique that can derive these dataset domain terms automatically from existing datasets which means they are easily available. The problem of poor coverage can also be mitigated by only discovering datasets for the domain represented by these dataset domain terms. I provide a dataset discovery technique that takes this approach with these dataset domain terms. Through an evaluation, I show that these dataset domain terms are sufficiently representative of the domain to be used for dataset discovery. The accuracy of the dataset discovery technique is also shown to be comparable to state-of-the-art dataset discovery techniques, though its precision is lacking.This makes it highly suitable to filter datasets before other dataset discovery techniques can be performed on them. The data from these filtered datasets should also have a limited range of domains. So subsequent dataset discovery techniques should be less affected by the poor coverage of the additional information they use to understand the data. This allows dataset discovery to be performed on a larger scale. Subject Dataset discoveryDataset domain termsData-driven To reference this document use: http://resolver.tudelft.nl/uuid:012f7697-16be-4c93-965b-b4f8ebd391b3 Part of collection Student theses Document type master thesis Rights © 2022 Riaas Mokiem Files PDF Thesis_Riaas_Mokiem.pdf 852.69 KB Close viewer /islandora/object/uuid:012f7697-16be-4c93-965b-b4f8ebd391b3/datastream/OBJ/view