Circular Image

A. Katsifodimos

64 records found

While database systems have matured significantly over the past few decades, the rapid growth of real-time analytics to feed quick decision making has paved a way for multipurpose and high performant systems. As stream processing also matures, it is of interest to explore its ful ...

Heuristic Optimization of Amazon Redshift Table Configurations

Focusing on Distribution Style, Sort Keys and Column Encodings in Amazon Redshift

This thesis presents a comprehensive, heuristic cost-driven framework for optimizing database table configuration in Amazon Redshift focusing on distribution styles, sort keys and column encodings. Unlike existing approaches that treat optimization parameters independently, this ...
Stateful Functions-as-a-Service (SFaaS) platforms, such as Styx, are emerging as powerful abstractions for building distributed, serverless cloud applications. By combining the abilities of FaaS with strong transactional guarantees, they enable complex, stateful workflows without ...
Building scalable and consistent cloud applications is notoriously difficult due to the challenges of state management and execution consistency in distributed environments. Functions-as-a-Service (FaaS) platforms offer flexible scalability, but weak execution guarantees forces e ...

Benchmarking Geo-distributed Databases

Evaluating Performance using the Product-Parts-Supplier Workload

Existing evaluations of geo-distributed databases still rely almost exclusively on standard limited workloads such as TPC-C and YCSB+T, which reveal little information about the true cost of wide-area coordination. In this paper, we present a configurable benchmarking framework b ...

Benchmarking geo-distributed databases

Evaluation using the DeathStar hotel reservation benchmark

As modern applications become more global and resource intensive, geo-distributed databases have become critical for fast, reliable data storage. Evaluating the performance of these databases through traditional benchmarks such as TPC-C and YCSB-T is not sufficient to expose all ...

DeathStar Movie for Geo-Distributed Databases

Stressing databases using a movie review site

Geo-distributed databases offer the scalability and low latency that contemporary applications demand, but are challenging to implement. It is therefore crucial that they are tested well. Established benchmarks, such as TPC-C and YCSB-T, are limited and do not cover the entire se ...

Benchmarking geo-distributed databases

Evaluation using the SmallBank benchmark

In recent years, applications have started using geo-distributed databases, even though their behavior under different workloads remains complex. Therefore, this project analyses how several databases handle transactional workloads using the SmallBank benchmark. We implement and ...

MovR as a Benchmark for Geo-Distributed Databases

Performance Evaluation and Insights

Distributed systems are vital for handling large-scale data and rely on geo-distributed databases to ensure low latency and high availability. Traditional benchmarks, such as TPC-C and YCSB-T, are not designed to handle the complexities of geo-distributed environments and do not ...
This thesis investigates the effectiveness and efficiency of embedding-based drift detection in machine learning systems, focusing on synthetic simulations and real-world production data. Through controlled experiments, we compare vector-based and distribution-based metrics regar ...
We are witnessing a paradigm shift in machine learning (ML) and artificial intelligence (AI) from a focus primarily on innovating ML models, the model-centric paradigm, to prioritising high-quality, reliable data for AI/ML applications, the data-centric paradigm. This emphasis on ...
In the digital era, XML data is fundamental for various applications, requiring robust methods to ensure data integrity and security. Traditional digital watermarking techniques face challenges due to XML's hierarchical structure. Zero-watermarking, which derives a watermark from ...
In the digital age, the proliferation of personal data within databases has made them prime targets for cyberattacks. As the volume of data increases, so does the frequency and sophistication of these attacks. This thesis investigates database security threats by deploying open s ...
Security researchers and industry firms employ Internet-wide scanning for information collection, vulnerability detection and security evaluation, while cybercriminals make use of it to find and attack unsecured devices. Internet scanning plays a considerable role in threat ...
The advancement of artificial intelligence (AI) has led to an increased demand for both a greater volume and quality of data. In many companies, data is dispersed across multiple tables, yet AI models typically require data in a single table format. This necessitates the merging ...
This thesis embarks on the quest to efficiently compute similarities between data streams in real-time, a task burgeoning in importance with the advent of big data and real-time analytics. At the heart of this endeavor is the expansion of the Condor framework to accommodate new p ...
Over the last two decades, the machine learning (ML) field has witnessed a dramatic expansion, propelled by burgeoning data volumes and the advancement of computational technologies. Deep learning (DL) in particular has demonstrated remarkable success across a wide range of domai ...
Data processing has heavily evolved in the last two decades, from single-node processing to distributed processing and from the MapReduce paradigm to the stream processing paradigm. At the same time, cloud computing has emerged as the primary means of deploying and operating a da ...
Schema matching is a critical data integration process, which aims at capturing relevance between elements of different datasets; when datasets are tabular, it translates to the process of discovering related columns among them. Accurately discovering column matches is integral f ...
Similarity joins are operations which involve identifying similar pairs of records within one or multiple datasets. These operations are typically time-sensitive, as timely identification of relations can lead to increased profitability. Therefore, it is advantageous to analyze t ...