RH
R. Hai
16 records found
1
We are witnessing a paradigm shift in machine learning (ML) and artificial intelligence (AI) from a focus primarily on innovating ML models, the model-centric paradigm, to prioritising high-quality, reliable data for AI/ML applications, the data-centric paradigm. This emphasis on
...
In the realm of software development, commit messages are vital for understanding code changes, enhancing maintainability, and improving collaboration. Despite their importance, generating high-quality commit messages remains a challenging task, with existing methods often facing
...
Finding the Needle in the Pre-Trained Model Zoo
The Use of Rich Metadata and Graph Learning to Estimate Task Transferability
The democratization of machine learning through public repositories, often known as model zoos, has significantly increased the availability of pre-trained models for practitioners. However, this abundance can make it difficult to choose the most suitable pre-trained model for fi
...
In many scientific fields, time series data is essen- tial, yet maintaining the integrity and legitimacy of such data is still difficult. Traditional watermarking techniques have mainly been used for multimedia. Although approaches for watermarking non-media data have been develo
...
Optimizing Database Joins
Cost Models and Benchmarking for CPU and GPU Systems
Optimizing SQL query execution through effective cost models is a critical challenge in database management systems (DBMS). This thesis introduces a modular benchmarking system for cost models, with a pluggable architecture for both cost models and execution engines, enabling com
...
In the realm of machine learning (ML), the need for efficiency in training processes is paramount. The conventional first step in an ML workflow involves collecting data from various sources and merging them into a single table, a process known as materialization, which can intro
...
LLM-PQA
A Natural Language Interface for Machine Learning Model Selection and Training
Many domain experts encounter significant challenges in leveraging machine learning due to the technical complexity of model selection and development. This thesis presents LLM-PQA, a system that enables natural language interaction with machine learning functionalities through l
...
Contract review is a critical yet time-consuming process in legal practice, with significant financial implications when errors occur. While Large Language Models (LLMs) have shown promise in legal document processing, they still face challenges with lengthy contracts and complex
...
In recent years, the rapid advancements in big data, machine learning, and artificial intelligence have led to a corresponding rise in privacy concerns. One of the solutions to address these concerns is federated learning. In this thesis, we will look at the setting of vertical f
...
Enriching Machine Learning Model Metadata
Collecting performance metadata through automatic evaluation
As the sharing of machine learning (ML) models has increased in popularity, more so-called model zoos are created. These repositories facilitate the sharing of models and their metadata, and other people to find and re-use an existing model. However, the metadata provided for mod
...
Metagenomic Next-Generation Sequencing (mNGS) presents a promising avenue to generate massive volume of sequence reads in a short period of time. This has opened opportunities for disease diagnosis based on individual variations and mutations by considering the microbiome profile
...
Machine learning models require rich, quality data sets to achieve high accuracy. With current exponential growth of data being generated it is becoming increasingly hard to prepare high-quality tables within reasonable time frame. To combat this issue automated data augmentation
...
The democratization of data science, and in particular of the machine learning pipeline, has focused on the automation of model selection, feature processing, and hyperparameter tuning. Nevertheless, the need for high-quality data for increased performance has sparked interest in
...
A common challenge in feature discovery and feature selection is the trade-off between effectiveness and efficiency. The paper proposes a solution that is efficient and effective at ranking features for feature discovery.
This paper aims to improve feature discovery tech ...
Automatic machine learning is a subfield of machine learning that automates the common procedures faced in predictive tasks. The problem of one such procedure is automatic data augmentation, where one desires to enrich the existing data to increase model performance. In relationa
...
The workflow of a data science practitioner includes gathering information from different sources and applying machine learning (ML) models. Such dispersed information can be combined through a process known as Data Integration (DI), which defines relations between entities and a
...