RH

R. Hai

34 records found

SiloFuse

Cross-silo Synthetic Data Generation with Latent Tabular Diffusion Models

Synthetic tabular data is crucial for sharing and augmenting data across silos, especially for enterprises with proprietary data. However, existing synthesizers are designed for centrally stored data. Hence, they struggle with real-world scenarios where features are distributed a ...

Amalur

The Convergence of Data Integration and Machine Learning

Machine learning (ML) training data is often scattered across disparate collections of datasets, called <italic>data silos</italic>. This fragmentation poses a major challenge for data-intensive ML applications: integrating and transforming data residing in different ...

Quantum Data Management

From Theory to Opportunities

Quantum computing has emerged as a transformative tool for future data management. Classical problems in database domains, including query optimization, data integration, and transaction management, have recently been addressed using quantum computing techniques. This tutorial ai ...
Pre-trained deep learning (DL) models are increasingly accessible in public repositories, i.e., model zoos. Given a new prediction task, finding the best model to fine-tune can be computationally intensive and costly, especially when the number of pre-trained models is large. Sel ...

LLM-PQA

LLM-enhanced Prediction Query Answering

The advent of Large Language Models (LLMs) provides an opportunity to change the way queries are processed, moving beyond the constraints of conventional SQL-based database systems. However, using an LLM to answer a prediction query is still challenging, since an external ML mode ...
In recent years, researchers have developed several methods to automate discovering datasets and augmenting features for training Machine Learning (ML) models. Together with feature selection, these efforts have paved the way towards what is termed the feature discovery process. ...
Recent advances in Graphic Processing Units (GPUs) have facilitated a significant performance boost for database operators, in particular, joins. It has been intensively studied how conventional join implementations, such as hash joins, benefit from the massive parallelism of GPU ...
The proliferation of pre-trained ML models in public Web-based model zoos facilitates the engineering of ML pipelines to address complex inference queries over datasets and streams of unstructured content. Constructing optimal plan for a query is hard, especially when constraints ...
Machine learning (ML) practitioners and organizations are building model repositories of pre-trained models, referred to as model zoos. These model zoos contain metadata describing the properties of the ML models and datasets. The metadata serves crucial roles for reporting, audi ...
The increasing need for data trading has created a high demand for data marketplaces. These marketplaces require a set of valueadded services, such as advanced search and discovery, that have been proposed in the database research community for years, but are yet to be put to pra ...
Machine learning (ML) researchers and practitioners are building repositories of pre-trained models, called model zoos. These model zoos contain metadata that detail various properties of the ML models and datasets, which are useful for reporting, auditing, reproducibility, and i ...

Data Lakes

A Survey of Functions and Systems

Data lakes are becoming increasingly prevalent for Big Data management and data analytics. In contrast to traditional 'schema-on-write' approaches such as data warehouses, data lakes are repositories storing raw data in its original formats and providing a common access interface ...
Given a set of pre-trained Machine Learning (ML) models, can we solve complex analytic tasks that make use of those models by formulating ML inference queries? Can we mitigate different tradeoffs, e.g., high accuracy, low execution costs and memory footprint, when optimizing the ...

Amalur

Data Integration Meets Machine Learning

Machine learning (ML) training data is often scattered across disparate collections of datasets, called data silos. This fragmentation poses a major challenge for data-intensive ML applications: integrating and transforming data residing in different sources demand a lot of manua ...
The rapid growth of large-scale machine learning (ML) models has led numerous commercial companies to utilize ML models for generating predictive results to help business decision-making. As two primary components in traditional predictive pipelines, data processing, and model pr ...
Machine learning (ML) practitioners and organizations are building model zoos of pre-trained models, containing metadata describing properties of the ML models and datasets that are useful for reporting, auditing, reproducibility, and interpretability purposes. The metatada is cu ...
Machine Learning (ML) applications require high-quality datasets. Automated data augmentation techniques can help increase the richness of training data, thus increasing the ML model accuracy. Existing solutions focus on efficiency and ML model accuracy but do not exploit the ric ...

Amalur

Next-generation Data Integration in Data Lakes

Data science workflows often require extracting, preparing and integrating data from multiple data sources. This is a cumbersome and slow process: most of the times, data scientists prepare data in a data processing system or a data lake, and export it as a table, in order for it ...

Dynamic Digital Twin

Diagnosis, Treatment, Prediction, and Prevention of Disease During the Life Course

A digital twin (DT), originally defined as a virtual representation of a physical asset, system, or process, is a new concept in health care. A DT in health care is not a single technology but a domain-adapted multimodal modeling approach incorporating the acquisition, management ...
This chapter introduces the most important features of data lake systems, and from there it outlines an architecture for these systems. The vision for a data lake system is based on a generic and extensible architecture with a unified data model, facilitating the ingestion, stora ...