A. Katsifodimos | TU Delft Repository

Global State Queries in Stream Processing

Master thesis (2025) - M.S. Patil (author) , A Katsifodimos (mentor) , A. Voulimeneas (graduation committee member)

While database systems have matured significantly over the past few decades, the rapid growth of real-time analytics to feed quick decision making has paved a way for multipurpose and high performant systems. As stream processing also matures, it is of interest to explore its ful ...

Heuristic Optimization of Amazon Redshift Table Configurations

Focusing on Distribution Style, Sort Keys and Column Encodings in Amazon Redshift

Master thesis (2025) - X.L. Hu (author) , N. Yorke-Smith (mentor) , A. Katsifodimos (mentor) , C. Lofi (graduation committee member) , Derek van den Broek (mentor)

This thesis presents a comprehensive, heuristic cost-driven framework for optimizing database table configuration in Amazon Redshift focusing on distribution styles, sort keys and column encodings. Unlike existing approaches that treat optimization parameters independently, this ...

This thesis presents a comprehensive, heuristic cost-driven framework for optimizing database table configuration in Amazon Redshift focusing on distribution styles, sort keys and column encodings. Unlike existing approaches that treat optimization parameters independently, this research develops a sequential optimization methodology that captures complex interdependencies between configuration choices and their performance impacts across different data scales.

The study addresses four research questions examining individual parameter optimization strategies and their integrated effects on system performance. The experimental evaluation employs two datasets: a primary table containing 300 million data records across 23 columns where optimization is performed, and a secondary join table with 117.5 million data records across 12 columns that remains unchanged. Scale-dependent analysis is conducted using subsets of 10 million and 100 million data records selected from the primary dataset to enable controlled comparison across different data volumes.

Key findings demonstrate that table configuration optimization in Amazon Redshift exhibits pronounced scale-dependent performance characteristics across the experimental datasets, with three distinct performance regimes identified: a small-scale regime (10M data records) characterized by query-type dependent optimization effectiveness, a medium-scale regime (100M data records) showing optimization trade-off transitions, and a large-scale regime (300M data records) dominated by I/O and storage optimizations. The research reveals mixed optimization outcomes, with performance improvements ranging from 21 percent CPU reduction at small scales to 62 percent I/O improvement at large scales, while demonstrating that optimization strategies effective at one scale can become counterproductive at another. Overall, the framework shows variable success in parameter selection for distribution style, sort key and encoding selection.

The research identifies fundamental challenges in optimizing Amazon Redshift table configurations where internal algorithms remain opaque and optimization benefits exhibit non-linear scaling patterns across the tested different data volumes. While the framework provides valuable insights into scale-dependent optimization patterns, the mixed results highlight the complexity of achieving consistent performance improvements across different scales and query types. These findings challenge assumptions about uniform optimization benefits and emphasize the need for empirical validation approaches in cloud database optimization, providing practical insights for database administrators and theoretical foundations for developing adaptive optimization systems.

Global-State Querying in Stream Processing using Snapshots

Master thesis (2025) - S.S. Kshirsagar (author) , A. Katsifodimos (mentor) , K. Psarakis (mentor) , G.C. Christodoulou (mentor) , George Iosifidis (graduation committee member) , Burcu Kulahcioglu Ozkan (graduation committee member)

Stateful Functions-as-a-Service (SFaaS) platforms, such as Styx, are emerging as powerful abstractions for building distributed, serverless cloud applications. By combining the abilities of FaaS with strong transactional guarantees, they enable complex, stateful workflows without ...

An Intermediate Representation for Stateful Dataflows

Master thesis (2025) - L. Van Mol (author) , M. Schutte (mentor) , G.C. Christodoulou (mentor) , Asterios Katsifodimos (mentor) , Soham Chakraborty (graduation committee member)

Building scalable and consistent cloud applications is notoriously difficult due to the challenges of state management and execution consistency in distributed environments. Functions-as-a-Service (FaaS) platforms offer flexible scalability, but weak execution guarantees forces e ...

Benchmarking Geo-distributed Databases

Evaluating Performance using the Product-Parts-Supplier Workload

Bachelor thesis (2025) - E. Mihai (author) , Asterios Katsifodimos (mentor) , O. Mráz (mentor) , Koen Langendoen (graduation committee member)

Existing evaluations of geo-distributed databases still rely almost exclusively on standard limited workloads such as TPC-C and YCSB+T, which reveal little information about the true cost of wide-area coordination. In this paper, we present a configurable benchmarking framework b ...

MovR as a Benchmark for Geo-Distributed Databases

Performance Evaluation and Insights

Bachelor thesis (2025) - W.P.A. Marcu (author) , Asterios Katsifodimos (mentor) , O. Mráz (mentor) , G.C. Christodoulou (mentor) , K. Psarakis (mentor) , Koen Langendoen (graduation committee member)

Distributed systems are vital for handling large-scale data and rely on geo-distributed databases to ensure low latency and high availability. Traditional benchmarks, such as TPC-C and YCSB-T, are not designed to handle the complexities of geo-distributed environments and do not ...

DeathStar Movie for Geo-Distributed Databases

Stressing databases using a movie review site

Bachelor thesis (2025) - S.E. van den Houten (author) , O. Mráz (mentor) , Asterios Katsifodimos (mentor) , Koen Langendoen (graduation committee member)

Geo-distributed databases offer the scalability and low latency that contemporary applications demand, but are challenging to implement. It is therefore crucial that they are tested well. Established benchmarks, such as TPC-C and YCSB-T, are limited and do not cover the entire se ...

Benchmarking geo-distributed databases

Evaluation using the SmallBank benchmark

Bachelor thesis (2025) - F. Cirtog (author) , Asterios Katsifodimos (mentor) , O. Mráz (mentor) , Koen Langendoen (graduation committee member)

In recent years, applications have started using geo-distributed databases, even though their behavior under different workloads remains complex. Therefore, this project analyses how several databases handle transactional workloads using the SmallBank benchmark. We implement and ...

Benchmarking geo-distributed databases

Evaluation using the DeathStar hotel reservation benchmark

Bachelor thesis (2025) - A.J. Eickhoff (author) , O. Mráz (mentor) , Asterios Katsifodimos (mentor) , Koen Langendoen (graduation committee member)

As modern applications become more global and resource intensive, geo-distributed databases have become critical for fast, reliable data storage. Evaluating the performance of these databases through traditional benchmarks such as TPC-C and YCSB-T is not sufficient to expose all ...

Balancing Efficiency and Sensitivity in Embedding-Based Concept Drift Detection for Deep Learning

Master thesis (2025) - J. Bruin (author) , Jan S. Rellermeyer (mentor) , A. Katsifodimos (mentor)

This thesis investigates the effectiveness and efficiency of embedding-based drift detection in machine learning systems, focusing on synthetic simulations and real-world production data. Through controlled experiments, we compare vector-based and distribution-based metrics regar ...

Feature Discovery for Data-Centric AI

Doctoral thesis (2025) - Andra Ionescu (author) , G.J. Houben (promotor) , A. Katsifodimos (copromotor) , R. Hai (copromotor)

We are witnessing a paradigm shift in machine learning (ML) and artificial intelligence (AI) from a focus primarily on innovating ML models, the model-centric paradigm, to prioritising high-quality, reliable data for AI/ML applications, the data-centric paradigm. This emphasis on ...

We are witnessing a paradigm shift in machine learning (ML) and artificial intelligence (AI) from a focus primarily on innovating ML models, the model-centric paradigm, to prioritising high-quality, reliable data for AI/ML applications, the data-centric paradigm. This emphasis on data has led to the development of an economy around data, creating data marketplace platforms where data is traded as a commodity. However, trading data involves constraints that reflect the specific needs of users, such as enriching or augmenting their datasets or creating datasets with particular properties. These constraints pose challenges the data management community has already addressed independently of the marketplace platform context. As such, in this thesis, as a first act of research, we integrate approaches and practices from the data management community into the context of an open-source data marketplace platform, following a survey of industry professionals who produce, trade, and purchase data assets.

Aligned with the objectives of the data-centric AI paradigm to create high-quality training datasets, our research is focused on developing automated methods to identify relevant and related features (e.g., columns) that can be augmented to a given dataset. This effort has led to the research and design of feature discovery, which sits at the intersection of dataset discovery by discovering related datasets, data integration by joining datasets, and feature selection by selecting high-predictive features for ML models. We have developed an automated approach for feature discovery that improves upon existing automated data augmentation techniques, improving the effectiveness and efficiency of finding the most relevant features.

However, with the adoption of automatic approaches, we discovered that in moving towards data-centric AI, we risk detaching not only from model-centric but also from user-centric AI. To assess the extent to which users (e.g., data scientists, data engineers, ML engineers) rely on and trust automatic approaches and to determine their feature discovery pipeline, we conducted 19 interviews based on a use-case study. The results revealed that users doubt the automated methods and want to be involved in the process instead. Consequently, we decided to incorporate the users into the feature discovery process and to explore whether their involvement (e.g., by adding domain and business knowledge) improves the quality of the resulting dataset and the feature discovery process.

Thus, we created a human-in-the-loop approach for feature discovery, which was evaluated by conducting interviews with a subset of our initial candidate pool. The results confirmed that a human-in-the-loop method is more approachable for users as it provides control over and insights into the process, as well as the opportunity to inject their knowledge, ensuring that the resulting dataset is relevant for their data tasks.

With this thesis, we make scientific contributions to the field of data management by offering novel insights into users' workflows and designing and developing resources that enhance feature discovery. We hope our contributions will serve as a valuable resource for future work in user-centric and data-centric feature discovery.

Enhancing XML Zero-Watermarking Robustness Using Usability Queries and Functional Dependencies

Bachelor thesis (2024) - B. Benedek Székács (author) , Zekeriya Erkin (mentor) , Devris Isler (mentor) , Asterios Katsifodimos (mentor)

In the digital era, XML data is fundamental for various applications, requiring robust methods to ensure data integrity and security. Traditional digital watermarking techniques face challenges due to XML's hierarchical structure. Zero-watermarking, which derives a watermark from ...

Leveraging Database Honeypots to Gather Threat Intelligence

Master thesis (2024) - Y. Song (author) , Harm Griffioen (mentor) , G. Smaragdakis (graduation committee member) , Asterios Katsifodimos (coach) , Jie Yang (coach)

In the digital age, the proliferation of personal data within databases has made them prime targets for cyberattacks. As the volume of data increases, so does the frequency and sophistication of these attacks. This thesis investigates database security threats by deploying open s ...

The Good, the Bad, and the Scanned: An Empirical Study of the Origins of Internet-wide Scanners

Master thesis (2024) - G. KOURSIOUNIS (author) , Georgios Smaragdakis (mentor) , Harm Griffioen (mentor) , Asterios Katsifodimos (coach)

Security researchers and industry firms employ Internet-wide scanning for information collection, vulnerability detection and security evaluation, while cybercriminals make use of it to find and attack unsecured devices. Internet scanning plays a considerable role in threat ...

Security researchers and industry firms employ Internet-wide scanning for information collection, vulnerability detection and security evaluation, while cybercriminals make use of it to find and attack unsecured devices. Internet scanning plays a considerable role in threat detection & response, and cyber threat intelligence. We adopt a data-driven approach, analyzing a large dataset of network traffic collected through a network telescope, to identify the origins of Internet scanners and their affiliations. We provide a traffic analysis of two monthly snapshots in two different years (2023 & 2024) of approximately 10 billion packets each. We also provide a methodology for data collection and aggregation of known/institutional scanners.

The study reveals that a small number of source IP addresses account for almost the entire portion of traffic volume, with 1% of total addresses contributing 97.38% of total traffic in June 2023 and 96.65% in February 2024. Traffic analysis identifies 40 to 44 known scanners, accounting for 0.36 to 0.62% of source IPs and 50.86 to 51.31% of total telescope traffic in each month. However, seven to ten organizations are responsible for around half of the total telescope traffic each month. The study also identifies 34 commercial bots, with a negligible footprint, accounting for up to 0.25% of total source IPs and less than 0.01% of total traffic per month. Mirai probes contribute 1 to 1.5% of monthly scanning traffic, with a burst in IP addresses in 2023. Similarly, traffic from Tor exit nodes appears small, constituting 0.01% of overall Darknet traffic and 0.04-0.06% of source IPs per month. The study also reports on the current usage of scanning software such as ZMap and Masscan, finding that around 40% of each monthly traffic volume contains the ZMap signature. Lastly, we highlight the further need for mutual exchange of threat intelligence among defenders, as well as the extension of data collection period and the establishment of a pipeline for continuous discovery and integration of known scanners from a research perspective, in order to efficiently differentiate institutional scanners and malicious actors, within an evolving cyber landscape.

Human Interaction in Tabular Data Augmentation in Data Science Workflows

Master thesis (2024) - Z.F. Mouw (author) , Asterios Katsifodimos (mentor) , E.A. Aivaloglou (mentor) , Andra Ionescu (mentor) , N.M. Gürel (graduation committee member)

The advancement of artificial intelligence (AI) has led to an increased demand for both a greater volume and quality of data. In many companies, data is dispersed across multiple tables, yet AI models typically require data in a single table format. This necessitates the merging ...

Estimation of Similarity Between Data Streams Using Probabilistic Data Structures

Master thesis (2024) - P. Reppas (author) , A Katsifodimos (mentor) , G. Siachamis (coach)

This thesis embarks on the quest to efficiently compute similarities between data streams in real-time, a task burgeoning in importance with the advent of big data and real-time analytics. At the heart of this endeavor is the expansion of the Condor framework to accommodate new p ...

Adaptivity for Streaming Dataflow Engines

Doctoral thesis (2024) - George Siachamis (author) , A. Van Van Deursen (promotor) , G.J. Houben (promotor) , A Katsifodimos (copromotor)

Data processing has heavily evolved in the last two decades, from single-node processing to distributed processing and from the MapReduce paradigm to the stream processing paradigm. At the same time, cloud computing has emerged as the primary means of deploying and operating a da ...

Data processing has heavily evolved in the last two decades, from single-node processing to distributed processing and from the MapReduce paradigm to the stream processing paradigm. At the same time, cloud computing has emerged as the primary means of deploying and operating a data processing system. In the cloud era, flexible resource allocation combined with flexible pricing schemes have brought forward new opportunities and have democratized access to computing resources. However, streaming dataflow or stream processing engines were originally designed for in-house clusters of fixed resources with limited needs for adaptivity. Therefore, they lack the mechanisms to adapt to unexpected changes in the needs of the processing workload. When solutions have been proposed in the literature, their experimental evaluation is limited hindering the progress of the field. The same applies to the native fault tolerance mechanisms that virtually every stream processing engine employs. In this thesis, we study the problem of adaptivity for streaming dataflow engines, and we focus on three major adaptivity subproblems: adaptivity to 𝑖) statistical changes, 𝑖𝑖) infrastructure failures, and 𝑖𝑖𝑖) input rate changes.
In Chapter 2, we study adaptivity to statistical changes through the important task of streaming similarity joins that is heavily affected by imbalanced loads, a by-product of statistical changes. We propose S3J ; the first adaptive distributed streaming similarity joins method in the general metric space that employs a two-layered adaptive partitioning scheme to reduce unnecessary similarity computations and distribute the load to the available workers. Our partitioning scheme is paired with an efficient load balancing scheme that leverages the existing partitioning in order to rebalance any imbalanced load. Our results show that S3J outperforms the employed baseline, inspired by a MapReduce method, in terms of partitioning efficiency. Additionally, our experiments show that the load balancing scheme can gradually defuse the imbalanced load and involve all the available workers in the processing.
The majority of the stream processing engines employ a checkpoint-based fault tolerance mechanism. In Chapter 3, we look at the adaptivity to infrastructure failures through the existing checkpointing protocols. We propose CheckMate, a principled experimental framework for evaluating checkpointing protocols for streaming dataflows. First, we summarize all the essential preliminaries required to study checkpoint-based fault tolerance. Then, we discuss in detail, implement, and evaluate in different scenarios the three main checkpointing protocols. Our evaluation shows that when the load is uniformly distributed, the implemented by most stream processing engines coordinated checkpointing protocol outperforms the alternatives. However, the uncoordinated prevails in the presence of skew, while it shows no domino effect when cyclic queries are employed.
Finally, in Chapter 4, we address the problem of adaptivity to input rate changes. Although multiple solutions have been proposed, their experimental evaluation is shallow and does not include detailed comparisons with other solutions. We propose a principled evaluation framework for stream processing autoscalers. We establish important metrics, queries, and workloads in order to provide guidelines for the evaluation of autoscaling solutions for stream processing. We discuss the state-of-the-art control-based autoscalers, and we evaluate them using the proposed framework. Our results show that, for complex queries, none of the evaluated autoscalers can adapt efficiently, while for simple stateless queries, a simple generic autoscaler outperforms the solutions tailored to stream processing.
We conclude this thesis by summarizing our main findings and discussing the limitations of our work. Based on the valuable insights we gained while designing and implementing the research work included in this thesis, we propose a series of interesting and important future research directions that are not limited to adaptivity problems but address stream processing in general.

Tabular Schema Matching for Modern Settings

Doctoral thesis (2024) - C. Koutras (author) , G.J. Houben (promotor) , Asterios Katsifodimos (copromotor) , Christoph Lofi (copromotor)

Schema matching is a critical data integration process, which aims at capturing relevance between elements of different datasets; when datasets are tabular, it translates to the process of discovering related columns among them. Accurately discovering column matches is integral f ...

Schema matching is a critical data integration process, which aims at capturing relevance between elements of different datasets; when datasets are tabular, it translates to the process of discovering related columns among them. Accurately discovering column matches is integral for several applications, such as entity resolution, data cleaning and data augmentation. While there exists a multitude of schema matching methods in the literature, we identify three major issues: i) there is no comprehensive study of comparing them in terms of effectiveness and efficiency, due to not available implementations and lack of evaluation datasets, ii) existing methods might be impractical and even inapplicable in certain modern settings, and iii) the heterogeneity and complexity of data can impede capturing relevance among columns for existing methods, as certain assumptions might not be holding for the entirety of underlying datasets. In this thesis, we tackle these issues by reviewing existing schema matching techniques and proposing novel methods capable to address challenges imposed by modern settings.
Starting with Chapter 2, we present an extensive comparison study on existing schema matching methods, by introducing Valentine. Specifically, Valentine constitutes an open-source experimental suite, which encompasses several state-of-the-art schema matching solutions. To guide the evaluation process towards modern applications, we extract four relatedness scenarios from the dataset discovery literature. To tackle the lack of existing datasets with ground truth, we devise a principled fabrication process. Our findings lead to insights that can help to improve future research on the field of schema matching, while they affect the design choices we make for novel methods we present in the following chapters.
Next, in Chapter 3, we turn our focus on applying schema matching among datasets stored in different data silos, which cannot be collocated and each contains information about column matches. Towards this direction, we introduce SiMa, a matching method that leverages existing matches in each silo, to build a column match prediction model, powered by the employment of a Graph Neural Network (GNN). To do so, SiMa transforms columns and matches among them in each silo to a graph, while it performs targeted negative edge sampling and incremental training to enhance the learning process. In our experimental evaluation, we show the benefits of using SiMa over state-of-the-art techniques, both in terms of effectiveness and efficiency.
Finally, Chapter 4 discusses the problem of discovering join relationships among datasets in a repository. To ameliorate the shortcomings of previous methods, we propose OmniMatch, a self-supervised method that can effectively capture both equi- and fuzzy-joins among tabular data. At the core of the method is the exploitation of a comprehensive set of similarity signals among columns, which are then transformed into a similarity graph. This graph, in conjunction with automatically generated positive and negative column match examples, enable the employment of a Relational Graph Convolution Network (RGCN) towards training a generalizable join prediction model. We compare the effectiveness of OmniMatch with several other state-of-the-art matching and column representation methods, while we verify the usefulness of utilizing a wide-spectrum of similarity signals to capture joins.
We conclude the thesis by reviewing our main findings, reflecting on our contributions and discussing potential limitations of the methods and approaches presented. Moreover, based on the insights we gain from surveying and developing novel matching methods, we discuss challenges and future directions in the field.

On the utility of metadata to optimize machine learning workflows

Doctoral thesis (2024) - Z. Li (author) , G.J. Houben (promotor) , Alessandro Bozzon (promotor) , Asterios Katsifodimos (copromotor)

Over the last two decades, the machine learning (ML) field has witnessed a dramatic expansion, propelled by burgeoning data volumes and the advancement of computational technologies. Deep learning (DL) in particular has demonstrated remarkable success across a wide range of domai ...

Over the last two decades, the machine learning (ML) field has witnessed a dramatic expansion, propelled by burgeoning data volumes and the advancement of computational technologies. Deep learning (DL) in particular has demonstrated remarkable success across a wide range of domains, including healthcare, mobility, life sciences, and energy systems. This success has been further accelerated by the availability and efficiency of open-source ML frameworks like TensorFlow and PyTorch, making ML methodologies more accessible than ever.

However, this rapid growth has brought its own set of challenges. The proliferation of ML models and related artifacts, such as datasets, have brought abundant information during the ML lifecycle. The descriptive and property information of these artifacts is referred as metadata. Yet current practices, such as model cards used in public model zoos and tools to track metadata within scripts, cannot fully captured the metadata of these artifacts, let alone a standardized approach for their management, and access. In addition, the prevailing practice of managing ML/DL scripts via traditional software repositories, while adequate for software engineering, falls short in addressing the unique needs of ML workflows, such as model reuse and comparative analysis. These practices hinder the effective use of structured and comprehensive metadata representation. This disconnect points to a pressing need for improved methodologies and tools in the ML field.

In response to these challenges, this thesis delves into the development and exploitation of structured metadata representations within ML model zoos. In Chapter 2, we first propose a metamodel that represent different types of metadata, thus transforming the metadata from being merely descriptive to being queryable and machine-readable. The structured nature of our metamodel allows for more efficient querying and retrieval of information, which is a substantial improvement over the traditional, text-based descriptions.

Additionally, the thesis explores the use of metadata to optimize various ML processes, particularly in the selection of appropriate models for specific tasks, i.e., model inference and fine-tuning. In Chapter 3, we investigate the optimization of ML inference queries in heterogeneous model zoos using a Mixed-Integer-Programming-based optimizer. This optimizer, which considers multiple objectives such as accuracy and inference speed, provides a robust framework for model selection and execution planning. In Chapter 4, the research extends to model selection for fine-tuning. We investigate on predicting model performance, particularly accuracy, in scenarios where data domains shift, thus negating the need for constant model fine-tuning. By selectively choosing only the most promising candidates, this method substantially lowers the computational burden and associated costs of extensive model fine-tuning.

Overall, this thesis investigates the representation and application of metadata. The insights and methodologies presented not only improve the efficiency and effectiveness of ML workflows but also pave the way for further exploration in the integration of metadata within ML practices, highlighting the continual development and potential for advancements in ML.

Experimental evaluation of distributed similarity joins in stream processing environments

Master thesis (2023) - T. Hernandez Quintanilla (author) , A Katsifodimos (mentor) , G. Siachamis (graduation committee member)

Similarity joins are operations which involve identifying similar pairs of records within one or multiple datasets. These operations are typically time-sensitive, as timely identification of relations can lead to increased profitability. Therefore, it is advantageous to analyze t ...