PCADA: Partial Correlation Aware Data Augmentation for random forest classifier

Bachelor Thesis (2022)
Author(s)

Oskar Lorek (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A. Ionescu – Mentor (TU Delft - Web Information Systems)

R. Hai – Mentor (TU Delft - Web Information Systems)

D.H.J. Epema – Graduation committee member (TU Delft - Data-Intensive Systems)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2022
Language
English
Graduation Date
22-06-2022
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Machine learning models require rich, quality data sets to achieve high accuracy. With current exponential growth of data being generated it is becoming increasingly hard to prepare high-quality tables within reasonable time frame. To combat this issue automated data augmentation methods has emerged in recent years. However, existing solution do not focus on specific ML algorithm used for training the data.

In this paper we propose data augmentation framework designed specifically for the random forest classifier. The algorithm uses sample joins to estimate partial correlation between features in the neighbouring tables and the target column, while controlling for all other features.

Moreover, we show that partial correlation is the most optimal characteristic for determining features’ importance for random forest classifier. Apart from it, we demonstrate hat PCADA can improve accuracy and run-time in comparison with other baseline data augmentation approaches.

Finally, we show that the framework can also be used for other decision trees classifiers (CART, XGBoost) and linear classifier (Support Vector Machine).

Files

License info not available