Data-Driven Empirical Analysis of Correlation-Based Feature Selection Techniques

Bachelor thesis (2023)

Authors

I. Buşe Electrical Engineering, Mathematics and Computer Science

Contributors

A. Ionescu Web Information Systems - (mentor)

A Katsifodimos Web Information Systems - (mentor)

E. Isufi (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

Machine learning Correlation Feature selection AutoML Feature Engineering Pearson correlation Symmetric Uncertainty Cramér's V Spearman correlation Data-driven activities

To reference this document use:

http://resolver.tudelft.nl/uuid:ea4b4691-bf10-4f93-b8d0-200ff2a12dec

More Info

expand_more

Published Date

26-06-2023

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Thus far the democratization of machine learning, which resulted in the field of AutoML, has focused on the automation of model selection and hyperparameter optimization. Nevertheless, the need for high-quality databases to increase performance has sparked interest in correlation-based feature selection, a simple and fast, yet effective approach to removing noise and redundancy in relational data. However, little to no attention has been paid to what correlation metric to choose in order to maximize the performance of ML systems. Our research investigates the effectiveness and efficiency of four widely-known correlation measures, in particular Pearson, Spearman, Cramér's V, Symmetric Uncertainty, in a manner that simulates an AutoML-like setting. We show that the exact theoretical assumptions of the methods do not always hold in practice, as well as shed light on the main aspects that need to be considered when integrating correlation-based feature selection in ML systems. Notably, the results indicate that the performance obtained by correlation-based methods is highly tied to the types and number of features present in the underlying database rather than the choice of ML algorithm. We devise promising conclusions that can further serve the advancement of AutoML systems by making feature selection fully automatic and computationally tractable.

Files

Data_Driven_Empirical_Analysis... (pdf)

(pdf | 0.886 Mb)

Unknown license