Human Interaction in Tabular Data Augmentation in Data Science Workflows

More Info
expand_more

Abstract

The advancement of artificial intelligence (AI) has led to an increased demand for both a greater volume and quality of data. In many companies, data is dispersed across multiple tables, yet AI models typically require data in a single table format. This necessitates the merging of these tables and the selection of optimal features for the model, a process known as Tabular Data Augmentation (TDA). With the rapid growth of TDA, automated tools have been developed to streamline this process. However, these state-of-the-art tools often make assumptions about user workflows that may not align with the actual needs of data specialists, potentially making them efficient yet not fully user-friendly. Additionally, without thorough evaluation through user studies, these tools may overlook critical steps in the TDA process.

This thesis is divided into two main parts. The first part is dedicated to uncovering the assumptions and oversights within current TDA research through an exhaustive review of recent literature. This is followed by conducting interviews with 19 data specialists. These discussions aim to verify the identified assumptions and reveal any missing elements in state-of-the-art research. The second part focuses on creating a new tool to meet the requirements identified from validated assumptions and the gaps discovered. This tool is then subjected to evaluation interviews to assess its effectiveness.

The findings indicate that data specialists prefer a TDA tool that offers enhanced control and deeper insights into the data augmentation process. To meet these preferences, Human in the Loop AutoTDA was developed, embodying the desired functionalities. Feedback from the evaluation phase confirmed that data specialists find Human in the Loop AutoTDA suitable for their TDA workflows, marking a significant advancement in the field.