Optimizing Dataset Quality for Enhanced Machine Learning Performance

A Study on the Impact of Dataset Metrics

Bachelor thesis (2024)

Authors

E. Ünlüyurt Electrical Engineering, Mathematics and Computer Science

Contributors

Kubilay Atasu Data-Intensive Systems - (mentor)

T.A. Akyıldız Data-Intensive Systems - (graduation committee member)

Burcu Kulahcioglu Ozkan Software Engineering - (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:8fff8762-4b42-40de-b787-5afec878ad6b

Published Date

27-06-2024

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

With the increase of machine learning applications in our every-day life, high-quality datasets are becoming necessary to train accurate and reliable models. This research delves into the factors that contribute to a high quality dataset and examines how different dataset metrics affect the performance of machine learning models particularly focusing on Graph Neural Networks (GNNs) Tabular Transformers and Large Language Models (LLMs). The metrics, under scrutiny include graph sparsity, missing data cells, modularity and text length. Various datasets are adjusted to assess how these metrics impact model performance.

The results of the experiments reveal that sparse graphs can preserve relational information. However increasing density does not necessarily lead to improved performance due to noise interference. The models demonstrated accuracy and low error rates in the presence of significant missing data indicating their ability to handle incomplete information effectively and generalize well based on imputation strategies and structural design. Higher modularity was found to aid in capturing patterns. Introduced complexity that could potentially hinder performance. Notably text length emerged as a factor influencing model performance by offering contextual details.

These insights show the significance of considering attributes when designing machine learning models for intricate predictive tasks. Through experimentation and optimization of these metrics we can enhance model resilience and accuracy for applicability, in real world scenarios.

Files

Impact_of_Dataset_Metrics_Efe_... (.pdf)

(.pdf | 0.288 Mb)