A. Katsifodimos
Please Note
75 records found
1
Event Horizon
Asymmetric Dependencies for Fast Geo-Distributed Operations
Low-latency geo-distributed applications currently face the barrier of cross-site coordination for ensuring state consistency. Existing mixed-consistency models leverage the existence of strongly- and weakly-consistent operations in a given application, to avoid coordination whenever possible. However, existing approaches are rather pessimistic, coordinating more than is necessary. In this paper, we introduce Semi-Linearizability (SL): a consistency model that executes application operations with linearizability guarantees only when strictly necessary, avoiding over-coordination. Specifically, we propose novel operation semantics that can encode ordering relationships between application operations and map them to coordination primitives. Our proposed semantics can be used to reason over latent, asymmetric dependencies between different operations and optimize their execution. We show how SL enables a new class of safe, uncoordinated operations that previous models would otherwise execute under globally strict order, while offering substantial performance gains without violating application invariants. To demonstrate the advantages of SL, we implemented DeMon, a system that achieves four orders of magnitude lower latency on the most frequent operation in the widely used RUBiS benchmark compared to state-of-the-art systems.
In this paper, we argue that the principles behind the streaming dataflow execution model and deterministic transactional protocols provide a powerful and suitable substrate for executing transactional cloud applications. To this end, we introduce Styx, a transactional application runtime based on streaming dataflows that enables an object-oriented programming model for scalable, faulttolerant cloud applications with serializable guarantees. ...
In this paper, we argue that the principles behind the streaming dataflow execution model and deterministic transactional protocols provide a powerful and suitable substrate for executing transactional cloud applications. To this end, we introduce Styx, a transactional application runtime based on streaming dataflows that enables an object-oriented programming model for scalable, faulttolerant cloud applications with serializable guarantees.
Cascade
From Imperative Code to Stateful Dataflows
This paper evaluates the state-of-the-art control-based solutions in the autoscaling area with diverse, dynamic workloads, applying specific metrics. We investigate different aspects of the autoscaling problem as performance and convergence. Our experiments reveal that current control-based autoscaling techniques fail to account for generated lag cost by rescaling or underprovisioning and cannot efficiently handle practical scenarios of intensely dynamic workloads. Unexpectedly, we discovered that an autoscaling method not tailored for streaming can outperform others in certain scenarios. ...
This paper evaluates the state-of-the-art control-based solutions in the autoscaling area with diverse, dynamic workloads, applying specific metrics. We investigate different aspects of the autoscaling problem as performance and convergence. Our experiments reveal that current control-based autoscaling techniques fail to account for generated lag cost by rescaling or underprovisioning and cannot efficiently handle practical scenarios of intensely dynamic workloads. Unexpectedly, we discovered that an autoscaling method not tailored for streaming can outperform others in certain scenarios.
LLM-PQA
LLM-enhanced Prediction Query Answering
The advent of Large Language Models (LLMs) provides an opportunity to change the way queries are processed, moving beyond the constraints of conventional SQL-based database systems. However, using an LLM to answer a prediction query is still challenging, since an external ML model has to be employed and inference has to be performed in order to provide an answer. This paper introduces LLM-PQA, a novel tool that addresses prediction queries formulated in natural language. LLM-PQA is the first to combine the capabilities of LLMs and retrieval-augmented mechanism for the needs of prediction queries by integrating data lakes and model zoos. This integration provides users with access to a vast spectrum of heterogeneous data and diverse ML models, facilitating dynamic prediction query answering. In addition, LLM-PQA can dynamically train models on demand, based on specific query requirements, ensuring reliable and relevant results even when no pre-trained model in a model zoo, available for the task.
Multiple works in data management research focus on automating the processes of data augmentation and feature discovery to save users from having to perform these tasks manually. Yet, this automation often leads to a disconnect with the users, as it fails to consider the specific needs and preferences of the actual end-users of data management systems for machine learning. To explore this issue further, we conducted 19 semi-structured, think-aloud use-case studies based on a scenario in which data specialists were tasked with augmenting a base table with additional features to train a machine learning model. In this paper, we share key insights into the practices of feature discovery on tabular data performed by real-world data specialists derived from our user study. Our research uncovered differences between the user assumptions reported in the literature and the actual practices, as well as some areas where literature and real-world practices align.
In this paper, we introduce the first user-driven human-in-the-loop feature discovery method called HILAutoFeat. We demonstrate the capabilities of HILAutoFeat, which effectively combines automated feature discovery with user-driven insights. Our demonstration is centred around two scenarios: (i) an automated feature discovery scenario -- HILAutoFeat acts as a steward in a large data lake where the user is unaware of the quality and relevance of the data, and (ii) a scenario where HILAutoFeat and the user work together -- the user drives the feature discovery process by adding his domain and business knowledge, while HILAutoFeat performs the intensive computations. ...
In this paper, we introduce the first user-driven human-in-the-loop feature discovery method called HILAutoFeat. We demonstrate the capabilities of HILAutoFeat, which effectively combines automated feature discovery with user-driven insights. Our demonstration is centred around two scenarios: (i) an automated feature discovery scenario -- HILAutoFeat acts as a steward in a large data lake where the user is unaware of the quality and relevance of the data, and (ii) a scenario where HILAutoFeat and the user work together -- the user drives the feature discovery process by adding his domain and business knowledge, while HILAutoFeat performs the intensive computations.
Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. This survey provides a comprehensive overview of fundamental aspects of stream processing systems and their evolution in the functional areas of out-of-order data management, state management, fault tolerance, high availability, load management, elasticity, and reconfiguration. We review noteworthy past research findings, outline the similarities and differences between the first (’00–’10) and second (’11–’23) generation of stream processing systems, and discuss future trends and open problems.
Amalur
Data Integration Meets Machine Learning
Machine learning (ML) training data is often scattered across disparate collections of datasets, called data silos. This fragmentation poses a major challenge for data-intensive ML applications: integrating and transforming data residing in different sources demand a lot of manual work and computational resources. With data privacy and security constraints, data often cannot leave the premises of data silos, hence model training should proceed in a decentralized manner. In this work, we present a vision of how to bridge the traditional data integration (DI) techniques with the requirements of modern machine learning. We explore the possibilities of utilizing metadata obtained from data integration processes for improving the effectiveness and efficiency of ML models. Towards this direction, we analyze two common use cases over data silos, feature augmentation and federated learning. Bringing data integration and machine learning together, we highlight new research opportunities from the aspects of systems, representations, factorized learning and federated learning.
Machine learning (ML) practitioners and organizations are building model repositories of pre-trained models, referred to as model zoos. These model zoos contain metadata describing the properties of the ML models and datasets. The metadata serves crucial roles for reporting, auditing, ensuring reproducibility, and enhancing interpretability. Despite the growing adoption of descriptive formats like datasheets and model cards, the metadata available in existing model zoos remains notably limited. Moreover, existing formats have limited expressiveness, thus constraining the potential use of model repositories, extending their purpose beyond mere storage for pre-trained models. This paper proposes a unified metadata representation format for model zoos. We illustrate that comprehensive metadata enables a diverse range of applications, encompassing model search, reuse, comparison, and composition of ML models. We also detail the design and highlight the implementation of an advanced model zoo system built on top of our proposed metadata representation.
In this work, we evaluate autoscaling solutions for stream processing engines. Although autoscaling has become a mainstream subject of research in the last decade, the database research community has yet to evaluate different autoscaling techniques under a proper benchmarking setting and evaluation framework. As a result, every newly proposed autoscaling solution only performs a shallow performance evaluation and comparison against existing solutions. In this paper, we evaluate autoscaling solutions by employing two streaming queries and a dynamic workload that follows a cosinus pattern. Our experiments reveal that current autoscaling techniques fail to account for generated lag due to rescaling or underprovisioning and cannot efficiently handle practical scenarios of intensely dynamic workloads.
The rapid growth of large-scale machine learning (ML) models has led numerous commercial companies to utilize ML models for generating predictive results to help business decision-making. As two primary components in traditional predictive pipelines, data processing, and model predictions often operate in separate execution environments, leading to redundant engineering and computations. Additionally, the diverging mathematical foundations of data processing and machine learning hinder cross-optimizations by combining these two components, thereby overlooking potential opportunities to expedite predictive pipelines. In this paper, we propose an operator fusing method based on GPU-accelerated linear algebraic evaluation of relational queries. Our method leverages linear algebra computation properties to merge operators in machine learning predictions and data processing, significantly accelerating predictive pipelines by up to 317x. We perform a complexity analysis to deliver quantitative insights into the advantages of operator fusion, considering various data and model dimensions. Furthermore, we extensively evaluate matrix multiplication query processing utilizing the widely-used Star Schema Benchmark. Through comprehensive evaluations, we demonstrate the effectiveness and potential of our approach in improving the efficiency of data processing and machine learning workloads on modern hardware.