AI

A. Ionescu

info

Please Note

10 records found

The advancement of artificial intelligence (AI) has led to an increased demand for both a greater volume and quality of data. In many companies, data is dispersed across multiple tables, yet AI models typically require data in a single table format. This necessitates the merging of these tables and the selection of optimal features for the model, a process known as Tabular Data Augmentation (TDA). With the rapid growth of TDA, automated tools have been developed to streamline this process. However, these state-of-the-art tools often make assumptions about user workflows that may not align with the actual needs of data specialists, potentially making them efficient yet not fully user-friendly. Additionally, without thorough evaluation through user studies, these tools may overlook critical steps in the TDA process.

This thesis is divided into two main parts. The first part is dedicated to uncovering the assumptions and oversights within current TDA research through an exhaustive review of recent literature. This is followed by conducting interviews with 19 data specialists. These discussions aim to verify the identified assumptions and reveal any missing elements in state-of-the-art research. The second part focuses on creating a new tool to meet the requirements identified from validated assumptions and the gaps discovered. This tool is then subjected to evaluation interviews to assess its effectiveness.

The findings indicate that data specialists prefer a TDA tool that offers enhanced control and deeper insights into the data augmentation process. To meet these preferences, Human in the Loop AutoTDA was developed, embodying the desired functionalities. Feedback from the evaluation phase confirmed that data specialists find Human in the Loop AutoTDA suitable for their TDA workflows, marking a significant advancement in the field.
...

A comparative analysis for linear models, decision trees, and support vector machines

Bachelor thesis (2023) - A. Udilă, A. Ionescu, A. Katsifodimos, E. Isufi
This paper presents a comprehensive evaluation and comparison of encoding methods for categorical data in the context of machine learning. The study focuses on five popular encoding techniques: one-hot, ordinal, target, catboost, and count encoders. These methods are evaluated using linear models, decision trees, and support vector machines (SVMs).

The results demonstrate that one-hot encoding consistently achieves the highest accuracy across all evaluated machine learning algorithms. However, it also incurs a higher runtime, especially when feature cardinality is high. Catboost encoding emerges as a promising alternative, striking a balance between accuracy and runtime efficiency. The ordinal, target, and catboost encoders perform similarly, with small variations depending on the specific machine learning algorithm used.

Based on the findings, practitioners are advised to select one-hot encoding when accuracy is of utmost importance and computational resources are sufficient. For scenarios where runtime efficiency is critical, the catboost encoder offers competitive accuracy while minimizing training time. The ordinal encoder can be a suitable alternative when dealing with high feature cardinality. ...
Bachelor thesis (2023) - K.V. Vasilev, A. Katsifodimos, A. Ionescu, E. Isufi
The data used in machine learning algorithms strongly influences the algorithms' capabilities. Feature selection techniques can choose a set of columns that meet a certain learning goal. There is a wide variety of feature selection methods, however, the ones we cover in this comparative analysis are part of the information-theoretical-based family. We evaluate MIFS, MRMR, CIFE, and JMI using the machine learning algorithms Logistic Regression, XGBoost, and Support Vector Machines.
Multiple datasets with a variety of feature types are used during evaluation. We find that MIFS and MRMR are 2-4 times faster than CIFE and JMI. MRMR and JMI choose columns that lead to significantly higher accuracy and lower root mean squared error earlier. The results we present here can help data scientists pick the right feature selection method depending on the datasets used. ...

A comparative study between filter and wrapper feature selection techniques

The curse of dimensionality is a common challenge in machine learning, and feature selection techniques are commonly employed to address this issue by selecting a subset of relevant features. However, there is no consistently superior approach for choosing the most significant subset of features. We conducted a comprehensive analysis comparing filter and wrapper techniques to guide future work in selecting the most appropriate method based on specific circumstances. We quantified the performance of these techniques using a diverse collection of datasets. We utilised simple decision trees, linear machine learning algorithms, and support vector machines to assess the performance with varying percentages of features selected by the filter and wrapper techniques. The findings demonstrate that filter methods (Chi-Squared and ANOVA) perform better than wrapper methods (Forward Selection and Backward Elimination) regarding the classification accuracy, regression root mean squared error, and runtime. ...
Bachelor thesis (2023) - Florena Buse, A. Ionescu, A. Katsifodimos, E. Isufi
Thus far the democratization of machine learning, which resulted in the field of AutoML, has focused on the automation of model selection and hyperparameter optimization. Nevertheless, the need for high-quality databases to increase performance has sparked interest in correlation-based feature selection, a simple and fast, yet effective approach to removing noise and redundancy in relational data. However, little to no attention has been paid to what correlation metric to choose in order to maximize the performance of ML systems. Our research investigates the effectiveness and efficiency of four widely-known correlation measures, in particular Pearson, Spearman, Cramér's V, Symmetric Uncertainty, in a manner that simulates an AutoML-like setting. We show that the exact theoretical assumptions of the methods do not always hold in practice, as well as shed light on the main aspects that need to be considered when integrating correlation-based feature selection in ML systems. Notably, the results indicate that the performance obtained by correlation-based methods is highly tied to the types and number of features present in the underlying database rather than the choice of ML algorithm. We devise promising conclusions that can further serve the advancement of AutoML systems by making feature selection fully automatic and computationally tractable. ...
Bachelor thesis (2023) - D. Anceaux, A. Katsifodimos, A. Ionescu
Since every day more and more data is collected, it becomes more and more expensive to process. To reduce these costs, you can use dimensionality reduction to reduce the number of features per instance in a given dataset.

In this paper, we will compare four possible methods of dimensionality reduction. The feature extraction methods PCA, LDA, and GDA, and the feature selection method Lasso. We will mainly be comparing how the amount of features left over by these methods affects the accuracy of certain classification algorithms, and how long the methods take to achieve their task.

Our research highlights LDA as a highly effective method for significantly reducing the dimensionality of data used in logistic regression and Support Vector Machines (SVMs) with remarkable success. Additionally, we identified Lasso as the preferred choice for situations involving a limited training dataset or when utilizing the random forest algorithm for classification. Notably, Principal Component Analysis (PCA) was observed to occupy a middle ground between LDA’s strengths in aggressive data reduction and Lasso’s accuracy while retaining. GDA (with a linear kernel function) turned out to be significantly slower than the other methods, while its results where most of the time on par with LDA. ...
Bachelor thesis (2022) - E. Cruset Pla, R. Hai, A. Ionescu, D.H.J. Epema
The democratization of data science, and in particular of the machine learning pipeline, has focused on the automation of model selection, feature processing, and hyperparameter tuning. Nevertheless, the need for high-quality data for increased performance has sparked interest in the inclusion of data augmentation in these automatic machine learning techniques. This research approaches this topic by examining different feature selection techniques that will ultimately allow devising what makes a feature desirable. We introduce an automatic data augmentation process, tailored for support vector machines, that employs sample joins. This approach is evaluated through different setups, datasets, and other machine learning models: CART, random forests, and XGBoost. The results are mixed: the algorithm identifies the features containing the signal, resulting in accuracy scores close to the models trained with all the data. However, the computational time is higher. A theoretical analysis suggest that the methodology might be helpful in particular cases where data is structured in specific ways. ...
Bachelor thesis (2022) - Oskar Lorek, A. Ionescu, R. Hai, D.H.J. Epema
Machine learning models require rich, quality data sets to achieve high accuracy. With current exponential growth of data being generated it is becoming increasingly hard to prepare high-quality tables within reasonable time frame. To combat this issue automated data augmentation methods has emerged in recent years. However, existing solution do not focus on specific ML algorithm used for training the data.

In this paper we propose data augmentation framework designed specifically for the random forest classifier. The algorithm uses sample joins to estimate partial correlation between features in the neighbouring tables and the target column, while controlling for all other features.

Moreover, we show that partial correlation is the most optimal characteristic for determining features’ importance for random forest classifier. Apart from it, we demonstrate hat PCADA can improve accuracy and run-time in comparison with other baseline data augmentation approaches.

Finally, we show that the framework can also be used for other decision trees classifiers (CART, XGBoost) and linear classifier (Support Vector Machine). ...
Bachelor thesis (2022) - O.L.C. Neut, A. Ionescu, R. Hai, D.H.J. Epema
Automatic machine learning is a subfield of machine learning that automates the common procedures faced in predictive tasks. The problem of one such procedure is automatic data augmentation, where one desires to enrich the existing data to increase model performance. In relational data repositories, the data is stored in normal form. This causes problems, since joining all tables and subsequently performing feature selection is highly inefficient. This paper provides AFAR, an approach to efficiently and effectively perform automated feature augmentation by ranking candidate joins in a data repository. Additionally, an experimental evaluation that validates the approach’s capabilities, is presented.
...
Master thesis (2022) - W.H. Wang, A. Katsifodimos, G.J.P.M. Houben, Y. Chen, A. Ionescu
Current speed of data growth has exponentially increased over the past decade, highlighting the need of modern organizations for data discovery systems. Several (automated) schema matching approaches have been proposed to find related data, exploiting different parts of schema information (e.g. data type, data distribution, column name, etc.). However, research showed that single schema matching techniques fails to effectively match schemas, whilst combinatorial schema matching systems show more promise. With the introduction of combinatorial schema matching systems, new challenges arise regarding selection and combining strategies. This research attempts to explore different techniques for determining the importance of each matcher in a combinatorial schema matching system by determining the weights of each matcher and comparing them through a comprehensive evaluation. ...