R. Hai | TU Delft Repository

Feature Discovery for Data-Centric AI

Doctoral thesis (2025) - Andra Ionescu (author) , G.J. Houben (promotor) , A. Katsifodimos (copromotor) , R. Hai (copromotor)

We are witnessing a paradigm shift in machine learning (ML) and artificial intelligence (AI) from a focus primarily on innovating ML models, the model-centric paradigm, to prioritising high-quality, reliable data for AI/ML applications, the data-centric paradigm. This emphasis on ...

We are witnessing a paradigm shift in machine learning (ML) and artificial intelligence (AI) from a focus primarily on innovating ML models, the model-centric paradigm, to prioritising high-quality, reliable data for AI/ML applications, the data-centric paradigm. This emphasis on data has led to the development of an economy around data, creating data marketplace platforms where data is traded as a commodity. However, trading data involves constraints that reflect the specific needs of users, such as enriching or augmenting their datasets or creating datasets with particular properties. These constraints pose challenges the data management community has already addressed independently of the marketplace platform context. As such, in this thesis, as a first act of research, we integrate approaches and practices from the data management community into the context of an open-source data marketplace platform, following a survey of industry professionals who produce, trade, and purchase data assets.

Aligned with the objectives of the data-centric AI paradigm to create high-quality training datasets, our research is focused on developing automated methods to identify relevant and related features (e.g., columns) that can be augmented to a given dataset. This effort has led to the research and design of feature discovery, which sits at the intersection of dataset discovery by discovering related datasets, data integration by joining datasets, and feature selection by selecting high-predictive features for ML models. We have developed an automated approach for feature discovery that improves upon existing automated data augmentation techniques, improving the effectiveness and efficiency of finding the most relevant features.

However, with the adoption of automatic approaches, we discovered that in moving towards data-centric AI, we risk detaching not only from model-centric but also from user-centric AI. To assess the extent to which users (e.g., data scientists, data engineers, ML engineers) rely on and trust automatic approaches and to determine their feature discovery pipeline, we conducted 19 interviews based on a use-case study. The results revealed that users doubt the automated methods and want to be involved in the process instead. Consequently, we decided to incorporate the users into the feature discovery process and to explore whether their involvement (e.g., by adding domain and business knowledge) improves the quality of the resulting dataset and the feature discovery process.

Thus, we created a human-in-the-loop approach for feature discovery, which was evaluated by conducting interviews with a subset of our initial candidate pool. The results confirmed that a human-in-the-loop method is more approachable for users as it provides control over and insights into the process, as well as the opportunity to inject their knowledge, ensuring that the resulting dataset is relevant for their data tasks.

With this thesis, we make scientific contributions to the field of data management by offering novel insights into users' workflows and designing and developing resources that enhance feature discovery. We hope our contributions will serve as a valuable resource for future work in user-centric and data-centric feature discovery.

Large Language Models Meet Commit Message Generation: An Empirical Study

Master thesis (2024) - Y. Tang (author) , Ujwal Gadiraju (mentor) , Rihan Hai (mentor) , Maliheh Izadi (graduation committee member)

In the realm of software development, commit messages are vital for understanding code changes, enhancing maintainability, and improving collaboration. Despite their importance, generating high-quality commit messages remains a challenging task, with existing methods often facing ...

In the realm of software development, commit messages are vital for understanding code changes, enhancing maintainability, and improving collaboration. Despite their importance, generating high-quality commit messages remains a challenging task, with existing methods often facing issues such as limited flexibility and high training costs. This paper addresses the research gaps in automated commit message generation (CMG) by exploring the capabilities of large language models (LLMs) in this domain. We specifically investigate the potential of the prompt engineering method to enhance LLM performance compared to state-of-the-art (SOTA) techniques such as RACE.

Our research begins with a comprehensive literature review of CMG methodologies, focusing on the effectiveness of various message formats and the limitations of existing approaches. To fill the gap, we introduce a unified commit message formats dataset and evaluate the previous LLM-based method on the dataset, utilizing the GPT model as a representative example of LLMs. By comparing the LLM zero-shot method with previous retrieval-based and hybrid methods, we provide a detailed analysis of the strengths and weaknesses of LLM-based approaches.

We further explore the impact of different retrieval augmented generation (RAG) configurations on CMG performance and investigate what constitutes a good demonstration of the LLM RAG method for the CMG task. That is followed by proposing a new prompt engineering method called Adaptive Retrieval Augmented Generation With Commit Type Classification And Partitioned Retrieval (ARC-PR), which incorporates a classification module and a database partitioning module to the LLM RAG system. Validating through comprehensive testing on the unified message format dataset, our experiments demonstrate that the proposed method shows significant improvements in message effectiveness compared to the previous LLM-based methods and in the aspects of informativeness, message format consistency and the balance between precision and recall, our method surpasses the state-of-the-art methods in the field. Further generalizability study illustrates the robustness of our proposed method. With the introduction of human evaluation, we further confirmed the superiority of our proposed method over the state-of-the-art methods in terms of informativeness and expressiveness.

In summary, this study makes several key contributions: it provides a thorough comparison of previous LLM-based methods with existing techniques, proposes an enhanced LLM prompt engineering approach specifically tailored for commit message generation (CMG) tasks that address the issue of low informativeness and expressiveness seen in past state-of-the-art methods and demonstrates performance that surpasses other LLM-based methods.

Finding the Needle in the Pre-Trained Model Zoo

The Use of Rich Metadata and Graph Learning to Estimate Task Transferability

Master thesis (2024) - Hilco Van Der Wilk (author) , Rihan Hai (mentor) , Ziyu Li (mentor) , A. Anand (graduation committee member) , Q. Song (graduation committee member)

The democratization of machine learning through public repositories, often known as model zoos, has significantly increased the availability of pre-trained models for practitioners. However, this abundance can make it difficult to choose the most suitable pre-trained model for fi ...

Watermarking Time Series Diffusion Models

Bachelor thesis (2024) - L. Fatas Lynas (author) , R. Hai (mentor) , Lydia Y. Chen (mentor) , Jeroen Galjaard (mentor) , C. Zhu (mentor)

In many scientific fields, time series data is essen- tial, yet maintaining the integrity and legitimacy of such data is still difficult. Traditional watermarking techniques have mainly been used for multimedia. Although approaches for watermarking non-media data have been develo ...

Optimizing Database Joins

Cost Models and Benchmarking for CPU and GPU Systems

Master thesis (2024) - M. Matušovič (author) , R. Hai (mentor) , Christoph Lofi (graduation committee member) , Jeremie Decouchant (graduation committee member) , W. Sun (coach)

Optimizing SQL query execution through effective cost models is a critical challenge in database management systems (DBMS). This thesis introduces a modular benchmarking system for cost models, with a pluggable architecture for both cost models and execution engines, enabling com ...

Cost Estimation for Factorized Machine Learning

Master thesis (2024) - P.H. te Marvelde (author) , R. Hai (mentor) , W. Sun (mentor) , A Katsifodimos (graduation committee member) , S.S. Chakraborty (graduation committee member)

In the realm of machine learning (ML), the need for efficiency in training processes is paramount. The conventional first step in an ML workflow involves collecting data from various sources and merging them into a single table, a process known as materialization, which can intro ...

In the realm of machine learning (ML), the need for efficiency in training processes is paramount. The conventional first step in an ML workflow involves collecting data from various sources and merging them into a single table, a process known as materialization, which can introduce inefficiencies caused by redundant data. Factorized ML strives to reduce this by maintaining the original data forms and performing model training on the separate source tables. This approach can lead to significant increases in training efficiency.

However, factorized training does not always reduce cost compared to traditional materialized training. This research tackles this issue by examining the multidimensional cost optimization problem that emerges when deciding between factorized and traditional materialized learning methods. It fills in gaps left by prior research, which is focused on CPU-based training, by investigating the cost estimation landscape for factorized ML, with a special emphasis on GPU performance compared to CPUs. The used factorized ML framework is expanded to incorporate GPU training, a topic not explored in previous research. We demonstrate that GPU training exhibits significantly different cost characteristics than CPU training, which has substantial implications for the design of cost models and the optimization of factorized ML.

Through an empirical study, an ML-based cost model is developed that can accurately predict the faster training method for a wide range of scenarios. On an extensive evaluation with real-world datasets this model boasts an average speedup of 3.8x, versus the state-of-the-art's 0.9x. We also show that it generalizes to scenarios with datasets and hardware settings on which the model is not trained, keeping 82% of training set performance.

Our innovative cost model for factorized ML enables significant time savings in training-intensive scenarios and further underlines the benefits of factorized training. However, effort should be invested into incorporating factorized training into existing ML frameworks so this method of training a model, and our cost model, can be evaluated in a larger set of realistic scenarios.

LLM-PQA

A Natural Language Interface for Machine Learning Model Selection and Training

Master thesis (2024) - W. Zhao (author) , R. Hai (mentor) , M.S. Pera (graduation committee member) , Huijuan Wang (graduation committee member)

Many domain experts encounter significant challenges in leveraging machine learning due to the technical complexity of model selection and development. This thesis presents LLM-PQA, a system that enables natural language interaction with machine learning functionalities through l ...

Rainbow RAG: An LLM-Powered RAG System for Contract Review

Master thesis (2024) - J. Wang (author) , R. Hai (mentor) , M.S. Pera (graduation committee member) , Huijuan Wang (graduation committee member)

Contract review is a critical yet time-consuming process in legal practice, with significant financial implications when errors occur. While Large Language Models (LLMs) have shown promise in legal document processing, they still face challenges with lengthy contracts and complex ...

Contract review is a critical yet time-consuming process in legal practice, with significant financial implications when errors occur. While Large Language Models (LLMs) have shown promise in legal document processing, they still face challenges with lengthy contracts and complex legal relationships. This research presents an advanced approach to automated contract review by integrating knowledge graphs into Retrieval-Augmented Generation (RAG) frameworks, addressing the limitations of current methodologies.

Through a comprehensive literature review of contract review automation and RAG systems, we conducted systematic experiments comparing RAG approaches with LLMs' in-context learning capabilities. Our empirical analysis validated that RAG-based methods significantly enhance long-context text analysis and information extraction in legal documents, particularly in terms of accuracy and consistency.

Building on these findings, we extensively investigated optimization techniques for the RAG retrieval phase, recognizing its critical role in contract review accuracy. Our experimental evaluation encompassed various chunking strategies, query expansion methods, and re-ranking approaches, establishing best practices for legal document processing.

Our primary contribution is a novel KG-RAG system that enhances contextual understanding in legal document analysis. We evaluate our approach using the Contract Understanding Atticus Dataset (CUAD) and ContractNLI dataset, demonstrating improved performance over traditional RAG implementations and long-context models. The research also explores optimal chunking strategies and investigates the efficiency-effectiveness trade-offs between different model architectures.

Results indicate that our KG-enhanced RAG framework achieves superior performance in identifying and analyzing complex legal relationships while maintaining computational efficiency. The integration of knowledge graphs particularly excels in capturing hierarchical and cross-referential relationships within legal documents, a crucial aspect often overlooked by conventional approaches.

This work advances the field of legal AI by providing a more robust and context-aware approach to contract review, while offering practical insights for implementing AI systems in legal practice. Our findings suggest promising directions for future research in legal document processing, particularly in areas requiring deep contextual understanding and relationship modeling.

Optimised Private Set Intersection for Vertical Federated Tree Models

Master thesis (2024) - M.C.H. Li (author) , R. Hai (mentor) , Danning Zhan (mentor) , C. Lofi (mentor) , Jeremie Decouchant (graduation committee member)

In recent years, the rapid advancements in big data, machine learning, and artificial intelligence have led to a corresponding rise in privacy concerns. One of the solutions to address these concerns is federated learning. In this thesis, we will look at the setting of vertical f ...

Enriching Machine Learning Model Metadata

Collecting performance metadata through automatic evaluation

Master thesis (2023) - H.G.J. Kant (author) , A Bozzon (mentor) , Asterios Katsifodimos (mentor) , R. Hai (mentor) , Ziyu Li (mentor) , S. Proksch (graduation committee member)

As the sharing of machine learning (ML) models has increased in popularity, more so-called model zoos are created. These repositories facilitate the sharing of models and their metadata, and other people to find and re-use an existing model. However, the metadata provided for mod ...

Reference-free biomarker mining in metagenomic data using language embedding

Master thesis (2023) - I. Agrawal (author) , Thomas Abeel (mentor) , R. Hai (coach) , Chengyao Peng (coach)

Metagenomic Next-Generation Sequencing (mNGS) presents a promising avenue to generate massive volume of sequence reads in a short period of time. This has opened opportunities for disease diagnosis based on individual variations and mutations by considering the microbiome profile ...

Metagenomic Next-Generation Sequencing (mNGS) presents a promising avenue to generate massive volume of sequence reads in a short period of time. This has opened opportunities for disease diagnosis based on individual variations and mutations by considering the microbiome profile of each patient. However, the effective use of this data requires the design of appropriate algorithms which can closely represent the metagenomic data in an accurate and condensed manner.

In this work, we acknowledged the efficiency of current approaches such as reference based methods and frequency encoding. However, we also recognized the limitations of current methods, such as limiting findings to pre-existing knowledge and inadequate representation of reads and metagenomic samples. Accordingly, we explored a natural
language embedding technique, called Doc2vec, as a potential embedding approach for metagenomic study and phenotype prediction.

We introduced some modifications in the original Doc2Vec architecture to remove a bottleneck in analysing long reads. This was done by replacing k-mer-level encoding with nucleotide-level representation. We used the embeddings obtained from this method as input to logistic classifier and ridge regression models. We compared the results with Kraken2 on colorectal cancer and type-2 diabetes classification, and for regression tasks on type-2 diabetes-related measures.

The results suggest a comparable performance between the proposed method and reference-based method for colorectal cancer classification. For type-2 diabetes dataset, reference-based method performs significantly better. In regression tasks to predict various metrics associated with type-2 diabetes, the proposed representation was comparable to reference-based method for some phenotypes, but lacked flexibility in others, indicating that the applicability of proposed approach strongly depends on the objective, dataset, and target phenotype.

PCADA: Partial Correlation Aware Data Augmentation for random forest classifier

Bachelor thesis (2022) - Oskar Lorek (author) , A. Ionescu (mentor) , R. Hai (mentor) , D.H.J. Epema (graduation committee member)

Machine learning models require rich, quality data sets to achieve high accuracy. With current exponential growth of data being generated it is becoming increasingly hard to prepare high-quality tables within reasonable time frame. To combat this issue automated data augmentation ...

From Feature Selection to Data Augmentation: the ADA Algorithm

Bachelor thesis (2022) - E. Cruset Pla (author) , R. Hai (mentor) , Andra Ionescu (mentor) , D.H.J. Epema (graduation committee member)

The democratization of data science, and in particular of the machine learning pipeline, has focused on the automation of model selection, feature processing, and hyperparameter tuning. Nevertheless, the need for high-quality data for increased performance has sparked interest in ...

Efficient and effective feature discovery for CART decision tree model

Bachelor thesis (2022) - A.B.C. Bien (author) , R. Hai (mentor)

A common challenge in feature discovery and feature selection is the trade-off between effectiveness and efficiency. The paper proposes a solution that is efficient and effective at ranking features for feature discovery.
This paper aims to improve feature discovery tech ...

Automatic feature augmentation ranking: XGBoost

Bachelor thesis (2022) - O.L.C. Neut (author) , Andra Ionescu (mentor) , R. Hai (mentor) , D.H.J. Epema (graduation committee member)

Automatic machine learning is a subfield of machine learning that automates the common procedures faced in predictive tasks. The problem of one such procedure is automatic data augmentation, where one desires to enrich the existing data to increase model performance. In relationa ...

Cost Estimation for Factorized Machine Learning in Data Integration Scenarios

Master thesis (2022) - J. van Schijndel (author) , R. Hai (mentor) , A Katsifodimos (mentor) , Y. Chen (graduation committee member) , G.J.P.M. Houben (graduation committee member)

The workflow of a data science practitioner includes gathering information from different sources and applying machine learning (ML) models. Such dispersed information can be combined through a process known as Data Integration (DI), which defines relations between entities and a ...