R. Hai | TU Delft Repository

Accelerating machine learning queries with linear algebra query processing

Journal article (2025) - Wenbo Sun (author) , Asterios Katsifodimos (author) , R. Hai (author)

The rapid growth of large-scale machine learning (ML) models has led numerous commercial companies to utilize ML models for generating predictive results to help business decision-making. As two primary components in traditional predictive pipelines, data processing, and model pr ...

Human-in-the-Loop Feature Discovery for Tabular Data

Conference paper (2024) - A. Ionescu (author) , Zeger Mouw (author) , E. Aivaloglou (author) , Rihan Hai (author) , Asterios Katsifodimos (author)

In recent years, researchers have developed several methods to automate discovering datasets and augmenting features for training Machine Learning (ML) models. Together with feature selection, these efforts have paved the way towards what is termed the feature discovery process. ...

Amalur

The Convergence of Data Integration and Machine Learning

Journal article (2024) - Ziyu Li (author) , Wenbo Sun (author) , D. Zhan (author) , Yan Kang (author) , Y. Chen (author) , A. Bozzon (author) , Rihan Hai (author)

Machine learning (ML) training data is often scattered across disparate collections of datasets, called <italic>data silos</italic>. This fragmentation poses a major challenge for data-intensive ML applications: integrating and transforming data residing in different ...

SiloFuse

Cross-silo Synthetic Data Generation with Latent Tabular Diffusion Models

Conference paper (2024) - A. Shankar (author) , J.C. Brouwer (author) , R. Hai (author) , Lydia Y. Chen (author)

Synthetic tabular data is crucial for sharing and augmenting data across silos, especially for enterprises with proprietary data. However, existing synthesizers are designed for centrally stored data. Hence, they struggle with real-world scenarios where features are distributed a ...

LLM-PQA

LLM-enhanced Prediction Query Answering

Conference paper (2024) - Z. Li (author) , Wenjie Zhao (author) , Asterios Katsifodimos (author) , R. Hai (author)

The advent of Large Language Models (LLMs) provides an opportunity to change the way queries are processed, moving beyond the constraints of conventional SQL-based database systems. However, using an LLM to answer a prediction query is still challenging, since an external ML mode ...

Model Selection with Model Zoo via Graph Learning

Conference paper (2024) - Z. Li (author) , Hilco Van Der Wilk (author) , D. Zhan (author) , M. Khosla (author) , A. Bozzon (author) , Rihan Hai (author)

Pre-trained deep learning (DL) models are increasingly accessible in public repositories, i.e., model zoos. Given a new prediction task, finding the best model to fine-tune can be computationally intensive and costly, especially when the number of pre-trained models is large. Sel ...

Quantum Data Management

From Theory to Opportunities

Conference paper (2024) - Rihan Hai (author) , Shih Han Hung (author) , S. Feld (author)

Quantum computing has emerged as a transformative tool for future data management. Classical problems in database domains, including query optimization, data integration, and transaction management, have recently been addressed using quantum computing techniques. This tutorial ai ...

Macaroni: Crawling and Enriching Metadata from Public Model Zoos

Conference paper (2023) - Z. Li (author) , R. Hai (author) , A Katsifodimos (author) , Alessandro Bozzon (author)

Machine learning (ML) researchers and practitioners are building repositories of pre-trained models, called model zoos. These model zoos contain metadata that detail various properties of the ML models and datasets, which are useful for reporting, auditing, reproducibility, and i ...

Amalur

Data Integration Meets Machine Learning

Conference paper (2023) - R. Hai (author) , Christos Koutras (author) , A. Ionescu (author) , Ziyu Li (author) , Wenbo Sun (author) , Jessie van Schijndel (author) , Yan Kang (author) , A. Katsifodimos (author)

Machine learning (ML) training data is often scattered across disparate collections of datasets, called data silos. This fragmentation poses a major challenge for data-intensive ML applications: integrating and transforming data residing in different sources demand a lot of manua ...

Accelerating Machine Learning Queries with Linear Algebra Query Processing

Conference paper (2023) - Wenbo Sun (author) , Asterios Katsifodimos (author) , R. Hai (author)

The rapid growth of large-scale machine learning (ML) models has led numerous commercial companies to utilize ML models for generating predictive results to help business decision-making. As two primary components in traditional predictive pipelines, data processing, and model pr ...

An Empirical Performance Comparison between Matrix Multiplication Join and Hash Join on GPUs

Conference paper (2023) - Wenbo Sun (author) , Asterios Katsifodimos (author) , R. Hai (author)

Recent advances in Graphic Processing Units (GPUs) have facilitated a significant performance boost for database operators, in particular, joins. It has been intensively studied how conventional join implementations, such as hash joins, benefit from the massive parallelism of GPU ...

Optimizing Machine Learning Inference Queries for Multiple Objectives

Conference paper (2023) - Z. Li (author) , Mariette Schonfeld (author) , R. Hai (author) , Alessandro Bozzon (author) , A Katsifodimos (author)

Given a set of pre-trained Machine Learning (ML) models, can we solve complex analytic tasks that make use of those models by formulating ML inference queries? Can we mitigate different tradeoffs, e.g., high accuracy, low execution costs and memory footprint, when optimizing the ...

Metadata Representations for Queryable Repositories of Machine Learning Models

Journal article (2023) - Z. Li (author) , Henk Kant (author) , R. Hai (author) , A Katsifodimos (author) , Marco Brambilla (author) , Alessandro Bozzon (author)

Machine learning (ML) practitioners and organizations are building model repositories of pre-trained models, referred to as model zoos. These model zoos contain metadata describing the properties of the ML models and datasets. The metadata serves crucial roles for reporting, audi ...

Data Lakes

A Survey of Functions and Systems

Journal article (2023) - R. Hai (author) , C. Koutras (author) , Christoph Quix (author) , Matthias Jarke (author)

Data lakes are becoming increasingly prevalent for Big Data management and data analytics. In contrast to traditional 'schema-on-write' approaches such as data warehouses, data lakes are repositories storing raw data in its original formats and providing a common access interface ...

Topio Marketplace: Search and Discovery of Geospatial Data

Conference paper (2023) - Andra Ionescu (author) , Alexandra Alexandridou (author) , K. Psarakis (author) , Kostas Patroumpas (author) , Georgios Chatzigeorgakidis (author) , Dimitrios Skoutas (author) , Spiros Athanasiou (author) , Rihan Hai (author) , A Katsifodimos (author)

The increasing need for data trading has created a high demand for data marketplaces. These marketplaces require a set of valueadded services, such as advanced search and discovery, that have been proposed in the database research community for years, but are yet to be put to pra ...

Optimizing ML Inference Queries Under Constraints

Conference paper (2023) - Z. Li (author) , Wenbo Sun (author) , Rihan Hai (author) , Alessandro Bozzon (author) , A Katsifodimos (author)

The proliferation of pre-trained ML models in public Web-based model zoos facilitates the engineering of ML pipelines to address complex inference queries over datasets and streams of unstructured content. Constructing optimal plan for a query is hard, especially when constraints ...

Join Path-Based Data Augmentation for Decision Trees

Conference paper (2022) - A. Ionescu (author) , R. Hai (author) , Marios Fragkoulis (author) , A Katsifodimos (author)

Machine Learning (ML) applications require high-quality datasets. Automated data augmentation techniques can help increase the richness of training data, thus increasing the ML model accuracy. Existing solutions focus on efficiency and ML model accuracy but do not exploit the ric ...

Amalur

Next-generation Data Integration in Data Lakes

Abstract (2022) - Rihan Hai (author) , Christos Koutras (author) , Andra Ionescu (author) , A. Katsifodimos (author)

Data science workflows often require extracting, preparing and integrating data from multiple data sources. This is a cumbersome and slow process: most of the times, data scientists prepare data in a data processing system or a data lake, and export it as a table, in order for it ...

Metadata Representations for Queryable ML Model Zoos

Conference paper (2022) - Z. Li (author) , R. Hai (author) , Alessandro Bozzon (author) , A Katsifodimos (author)

Machine learning (ML) practitioners and organizations are building model zoos of pre-trained models, containing metadata describing properties of the ML models and datasets that are useful for reporting, auditing, reproducibility, and interpretability purposes. The metatada is cu ...

Dynamic Digital Twin

Diagnosis, Treatment, Prediction, and Prevention of Disease During the Life Course

Journal article (2022) - S.T. Mulder (author) , Amir-Houshang Omidvari (author) , A.J. Rueten-Budde (author) , R. Hai (author) , Ömer Can Akgün (author) , David Tax (author) , Marcel Reinders (author) , Marcel .J.T. Reinders (author) , V. T. Visch (author) , G.B. More Authors (author)

A digital twin (DT), originally defined as a virtual representation of a physical asset, system, or process, is a new concept in health care. A DT in health care is not a single technology but a domain-adapted multimodal modeling approach incorporating the acquisition, management ...