Optimizing Dataset Quality for Enhanced Machine Learning Performance

A Study on the Impact of Dataset Metrics

More Info
expand_more

Abstract

With the increase of machine learning applications in our every-day life, high-quality datasets are becoming necessary to train accurate and reliable models. This research delves into the factors that contribute to a high quality dataset and examines how different dataset metrics affect the performance of machine learning models particularly focusing on Graph Neural Networks (GNNs) Tabular Transformers and Large Language Models (LLMs). The metrics, under scrutiny include graph sparsity, missing data cells, modularity and text length. Various datasets are adjusted to assess how these metrics impact model performance.

The results of the experiments reveal that sparse graphs can preserve relational information. However increasing density does not necessarily lead to improved performance due to noise interference. The models demonstrated accuracy and low error rates in the presence of significant missing data indicating their ability to handle incomplete information effectively and generalize well based on imputation strategies and structural design. Higher modularity was found to aid in capturing patterns. Introduced complexity that could potentially hinder performance. Notably text length emerged as a factor influencing model performance by offering contextual details.

These insights show the significance of considering attributes when designing machine learning models for intricate predictive tasks. Through experimentation and optimization of these metrics we can enhance model resilience and accuracy for applicability, in real world scenarios.