J.S. Rellermeyer | TU Delft Repository

Improving the Reliability of Failure Prediction Models through Concept Drift Monitoring

Conference paper (2025) - Lorena Poenaru-Olaru, Luis Cruz, Jan S. Rellermeyer, Arie van Deursen

Failure prediction models can be significantly beneficial for managing large-scale complex software systems, but their trustworthiness is severely affected by changes in the data over time, also known as concept drift. Thus, monitoring these models against concept drift and retraining them when the data changes becomes crucial in designing reliable failure prediction models. In this work, we evaluate the effects of monitoring failure prediction models over time using label-independent (unsupervised) drift detectors. We show that retraining based on unsupervised drift detectors instead of periodically reduces the cost of acquiring true labels without compromising accuracy. Furthermore, we propose a novel feature reduction for unsupervised drift detectors and an evaluation pipeline that practitioners can employ to select the most suitable unsupervised drift detector for their application. ...

Sustainable Machine Learning Retraining

Optimizing Energy Efficiency Without Compromising Accuracy

Conference paper (2025) - Lorena Poenaru-Olaru, June Sallou, Luis Cruz, Jan S. Rellermeyer, Arie Van Deursen

The reliability of machine learning (ML) software systems is heavily influenced by changes in data over time. For that reason, ML systems require regular maintenance, typically based on model retraining. However, retraining requires significant computational demand, which makes it energy-intensive and raises concerns about its environmental impact. To understand which retraining techniques should be considered when designing sustainable ML applications, in this work, we study the energy consumption of common retraining techniques. Since the accuracy of ML systems is also essential, we compare retraining techniques in terms of both energy efficiency and accuracy. We showcase that retraining with only the most recent data compared to all available data reduces energy consumption by up to 25%, being a sustainable alternative to the status quo. Furthermore, our findings show that retraining a model only when there is evidence that updates are necessary, rather than on a fixed schedule, can reduce energy consumption by up to 40%, provided a reliable data change detector is in place. Our findings pave the way for better recommendations for ML practitioners, guiding them toward more energy-efficient retraining techniques when designing sustainable ML software systems. ...

Prepared for the Unknown

Adapting AIOps Capacity Forecasting Models to Data Changes

Conference paper (2025) - Lorena Poenaru-Olaru, Wouter Van't Hof, Adrian Stańdo, Arkadiusz P. Trawiński, Eileen Kapel, Jan S. Rellermeyer, Luis Cruz, Arie Van Deursen

Capacity management is critical for software organizations to allocate resources effectively and meet operational demands. An important step in capacity management is predicting future resource needs often relies on data-driven analytics and machine learning (ML) forecasting models, which require frequent retraining to stay relevant as data evolves. Continuously retraining the forecasting models can be expensive and difficult to scale, posing a challenge for engineering teams tasked with balancing accuracy and efficiency. Retraining only when the data changes appears to be a more computationally efficient alternative, but its impact on accuracy requires further investigation. In this work, we investigate the effects of retraining capacity forecasting models for time series based on detected changes in the data compared to periodic retraining. Our results show that drift-based retraining achieves comparable forecasting accuracy to periodic retraining in most cases, making it a costeffective strategy. However, in cases where data is changing rapidly, periodic retraining is still preferred to maximize the forecasting accuracy. These findings offer actionable insights for software teams to enhance forecasting systems, reducing retraining overhead while maintaining robust performance. ...

Is your anomaly detector ready for change? adapting aiops solutions to the real world

Conference paper (2024) - Lorena Poenaru-Olaru, Natalia Karpova, Luis Cruz, Jan S. Rellermeyer, Arie Van Deursen

Anomaly detection techniques are essential in automating the monitoring of IT systems and operations. These techniques imply that machine learning algorithms are trained on operational data corresponding to a specific period of time and that they are continuously evaluated on newly emerging data. Operational data is constantly changing over time, which affects the performance of deployed anomaly detection models. Therefore, continuous model maintenance is required to preserve the performance of anomaly detectors over time. In this work, we analyze two different anomaly detection model maintenance techniques in terms of the model update frequency, namely blind model retraining and informed model retraining. We further investigate the effects of updating the model by retraining it on all the available data (full-history approach) and only the newest data (sliding window approach). Moreover, we investigate whether a data change monitoring tool is capable of determining when the anomaly detection model needs to be updated through retraining. ...

Maintaining and Monitoring AIOps Models Against Concept Drift

Conference paper (2023) - Lorena Poenaru-Olaru, Luis Cruz, Jan S. Rellermeyer, Arie Van Deursen

AIOps solutions enable faster discovery of failures in operational large-scale systems through machine learning models trained on operation data. These models become outdated during the occurrence of concept drift, a term used to describe shifts in data distributions. In operation data concept drift is inevitable and it impacts the performance of AIOps solutions over time. Therefore, concept drift should be closely monitored and immediate maintenance to prevent erroneous predictions is required. In this work, we propose an automated maintenance pipeline for AIOps models that monitors the occurrence of concept drift and chooses the most appropriate model retraining technique according to the drift type. ...

Retrain AI Systems Responsibly! Use Sustainable Concept Drift Adaptation Techniques

Conference paper (2023) - Lorena Poenaru-Olaru, June Sallou, Luis Cruz, Jan S. Rellermeyer, Arie Van Deursen

Deployed machine learning systems often suffer from accuracy degradation over time generated by constant data shifts, also known as concept drift. Therefore, these systems require regular maintenance, in which the machine learning model needs to be adapted to concept drift. The literature presents plenty of model adaptation techniques. The most common technique is periodically executing the whole training pipeline with all the data gathered until a particular point in time, yielding a massive energy footprint. In this paper, we propose a research path that uses concept drift detection and adaptation to enable sustainable AI systems. ...

Are Concept Drift Detectors Reliable Alarming Systems?

A Comparative Study

Conference paper (2022) - Lorena Poenaru-Olaru, Luis Cruz, Arie van Deursen, Jan Rellermeyer

As machine learning models increasingly replace traditional business logic in the production system, their lifecycle management is becoming a significant concern. Once deployed into production, the machine learning models are constantly evaluated on new streaming data. Given the continuous data flow, shifting data, also known as concept drift, is ubiquitous in such settings. Concept drift usually impacts the performance of machine learning models, thus, identifying the moment when concept drift occurs is required. Concept drift is identified through concept drift detectors. In this work, we assess the reliability of concept drift detectors to identify drift in time by exploring how late are they reporting drifts and how many false alarms are they signaling. We compare the performance of the most popular drift detectors belonging to two different concept drift detector groups, error rate-based detectors and data distribution-based detectors. We assess their performance on both synthetic and real-world data. In the case of synthetic data, we investigate the performance of detectors to identify two types of concept drift, abrupt and gradual. Our findings aim to help practitioners understand which drift detector should be employed in different situations and, to achieve this, we share a list of the most important observations made throughout this study, which can serve as guidelines for practical usage. Furthermore, based on our empirical results, we analyze the suitability of each concept drift detection group to be used as an alarming system. ...

Roadmap for edge AI

A Dagstuhl Perspective

Journal article (2022) - Aaron Yi Ding, Ella Peltonen, Tobias Meuser, Atakan Aral, Christian Becker, Schahram Dustdar, Thomas Hiessl, Nitinder Mohan, Jan S. Rellermeyer, More authors...

Based on the collective input of Dagstuhl Seminar (21342), this paper presents a comprehensive discussion on AI methods and capabilities in the context of edge computing, referred as Edge AI. In a nutshell, we envision Edge AI to provide adaptation for data-driven applications, enhance network and radio access, and allow the creation, optimisation, and deployment of distributed AI/ML pipelines with given quality of experience, trust, security and privacy targets. The Edge AI community investigates novel ML methods for the edge computing environment, spanning multiple sub-fields of computer science, engineering and ICT. The goal is to share an envisioned roadmap that can bring together key actors and enablers to further advance the domain of Edge AI. ...

In-Memory Indexed Caching for Distributed Data Processing

Conference paper (2022) - Alexandru Uta, Bogdan Ghit, Ankur Dave, Jan Rellermeyer, Peter Boncz

Powerful abstractions such as dataframes are only as efficient as their underlying runtime system. The de-facto distributed data processing framework, Apache Spark, is poorly suited for the modern cloud-based data-science workloads due to its outdated assumptions: static datasets analyzed using coarse-grained transformations. In this paper, we introduce the Indexed DataFrame, an in-memory cache that supports a dataframe abstraction which incorporates indexing capabilities to support fast lookup and join operations. Moreover, it supports appends with multi-version concurrency control. We implement the Indexed DataFrame as a lightweight, standalone library which can be integrated with minimum effort in existing Spark programs. We analyze the performance of the Indexed DataFrame in cluster and cloud deployments with real-world datasets and benchmarks using both Apache Spark and Databricks Runtime. In our evaluation, we show that the Indexed DataFrame significantly speeds-up query execution when compared to a non-indexed dataframe, incurring modest memory overhead. ...

Systematic Mapping Study on the Machine Learning Lifecycle

Conference paper (2021) - Yuanhao Xie, Luís Cruz, Petra Heck, Jan S. Rellermeyer

The development of artificial intelligence (AI) has made various industries eager to explore the benefits of AI. There is an increasing amount of research surrounding AI, most of which is centred on the development of new AI algorithms and techniques. However, the advent of AI is bringing an increasing set of practical problems related to AI model lifecycle management that need to be investigated. We address this gap by conducting a systematic mapping study on the lifecycle of AI model. Through quantitative research, we provide an overview of the field, identify research opportunities, and provide suggestions for future research. Our study yields 405 publications published from 2005 to 2020, mapped in 5 different main research topics, and 31 sub-topics. We observe that only a minority of publications focus on data management and model production problems, and that more studies should address the AI lifecycle from a holistic perspective. ...

A Fresh Look at the Architecture and Performance of Contemporary Isolation Platforms

Conference paper (2021) - V.J. van Rijn, Jan S. Rellermeyer

With the ever-increasing pervasiveness of the cloud computing paradigm, strong isolation guarantees and low performance overhead from isolation platforms are paramount. An ideal isolation platform offers both: an impermeable isolation boundary while imposing a negligible performance overhead. In this paper, we examine various isolation platforms (containers, secure containers, hypervisors, unikernels), and conduct a wide array of experiments to measure the performance overhead and degree of isolation offered by the platforms. We find that container platforms have the best, near-native, performance while the newly emerging secure containers suffer from various overheads. The highest degree of isolation is achieved by unikernels, closely followed by traditional containers. ...

SGD_Tucker

A Novel Stochastic Optimization Strategy for Parallel Sparse Tucker Decomposition

Journal article (2021) - Hao Li, Zixuan Li, Kenli Li, Jan S. Rellermeyer, Lydia Chen, Keqin Li

Sparse Tucker Decomposition (STD) algorithms learn a core tensor and a group of factor matrices to obtain an optimal low-rank representation feature for the High-Order, High-Dimension, and Sparse Tensor (HOHDST). However, existing STD algorithms face the problem of intermediate variables explosion which results from the fact that the formation of those variables, i.e., matrices Khatri-Rao product, Kronecker product, and matrix-matrix multiplication, follows the whole elements in sparse tensor. The above problems prevent deep fusion of efficient computation and big data platforms. To overcome the bottleneck, a novel stochastic optimization strategy (SGD__Tucker) is proposed for STD which can automatically divide the high-dimension intermediate variables into small batches of intermediate matrices. Specifically, SGD__Tucker only follows the randomly selected small samples rather than the whole elements, while maintaining the overall accuracy and convergence rate. In practice, SGD__Tucker features the two distinct advancements over the state of the art. First, SGD__Tucker can prune the communication overhead for the core tensor in distributed settings. Second, the low data-dependence of SGD__Tucker enables fine-grained parallelization, which makes SGD__Tucker obtaining lower computational overheads with the same accuracy. Experimental results show that SGD__Tucker runs at least 2XX faster than the state of the art. ...

Is Big Data Performance Reproducible in Modern Cloud Networks?

Conference paper (2020) - Alexandru Uta, Alexandru Custura, Dmitry Duplyakin, Ivo Jimenez, Jan S. Rellermeyer, Carlos Maltzahn, Robert Ricci, Alexandru Iosup

Performance variability has been acknowledged as a problem for over a decade by cloud practitioners and performance engineers. Yet, our survey of top systems conferences reveals that the research community regularly disregards variability when running experiments in the cloud. Focusing on networks, we assess the impact of variability on cloud-based big-data workloads by gathering traces from mainstream commercial clouds and private research clouds. Our dataset consists of millions of datapoints gathered while transferring over 9 petabytes on cloud providers' networks. We characterize the network variability present in our data and show that, even though commercial cloud providers implement mechanisms for quality-of-service enforcement, variability still occurs, and is even exacerbated by such mechanisms and service provider policies. We show how big-data workloads suffer from significant slowdowns and lack predictability and replicability, even when state-of-the-art experimentation techniques are used. We provide guidelines to reduce the volatility of big data performance, making experiments more repeatable. ...

A Survey on Distributed Machine Learning

Review (2020) - Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, Jan S. Rellermeyer

The demand for artificial intelligence has grown significantly over the past decade, and this growth has been fueled by advances in machine learning techniques and the ability to leverage hardware acceleration. However, to increase the quality of predictions and render machine learning solutions feasible for more complex applications, a substantial amount of training data is required. Although small machine learning models can be trained with modest amounts of data, the input for training larger models such as neural networks grows exponentially with the number of parameters. Since the demand for processing training data has outpaced the increase in computation power of computing machinery, there is a need for distributing the machine learning workload across multiple machines, and turning the centralized into a distributed system. These distributed systems present new challenges: first and foremost, the efficient parallelization of the training process and the creation of a coherent model. This article provides an extensive overview of the current state-of-the-art in the field by outlining the challenges and opportunities of distributed machine learning over conventional (centralized) machine learning, discussing the techniques used for distributed machine learning, and providing an overview of the systems that are available. ...

Self-adaptive Executors for Big Data Processing

Conference paper (2019) - Sobhan Omranian Khorasani, Jan S. Rellermeyer, Dick Epema

The demand for additional performance due to the rapid increase in the size and importance of data-intensive applications has considerably elevated the complexity of computer architecture. In response, systems offer pre-determined behaviors based on heuristics and then expose a large number of configuration parameters for operators to adjust them to their particular infrastructure. Unfortunately, in practice this leads to a substantial manual tuning effort. In this work, we focus on one of the most impactful tuning decisions in big data systems: the number of executor threads. We first show the impact of I/O contention on the runtime of workloads and a simple static solution to reduce the number of threads for I/O-bound phases. We then present a more elaborate solution in the form of self-adaptive executors which are able to continuously monitor the underlying system resources and detect contentions. This enables the executors to tune their thread pool size dynamically at runtime in order to achieve the best performance. Our experimental results show that being adaptive can significantly reduce the execution time especially in I/O intensive applications such as Terasort and PageRank which see a 34% and 54% reduction in runtime. ...

IoT resource-aware orchestration framework for edge computing

Abstract (2019) - Niket Agrawal, Jan Rellermeyer, Aaron Yi Ding

Existing edge computing solutions in the Internet of Things (IoT) domain operate with the control plane residing in the cloud and edge as a slave that executes the workload deployed by the cloud. The growing diversity in the IoT applications requires the edge to be able to run multiple distinct workloads corresponding to the dedicated inputs it receives, each catering to a specific task. Achieving this with the current approach poses a limitation as the cloud lacks the local knowledge at the edge and sharing this knowledge regularly between the edge and the cloud will defeat the very purpose of edge computing, i.e., low latency, less network congestion and data privacy. To solve this problem, we propose an orchestration framework for edge computing that enables the edge to actively initiate and orchestrate the workloads on request by using the local knowledge available in the form of IoT resources at the edge. ...

The coming age of pervasive data processing

Conference paper (2019) - Jan S. Rellermeyer, Sobhan Omranian Khorasani, Dan Graur, Apourva Parthasarathy

Emerging Big Data analytics and machine learning applications require a significant amount of computational power. While there exists a plethora of large-scale data processing frameworks which thrive in handling the various complexities of data-intensive workloads, the ever-increasing demand of applications have made us reconsider the traditional ways of scaling (e.g., scale-out) and seek new opportunities for improving the performance. In order to prepare for an era where data collection and processing occur on a wide range of devices, from powerful HPC machines to small embedded devices, it is crucial to investigate and eliminate the potential sources of inefficiency in the current state of the art platforms. In this paper, we address the current and upcoming challenges of pervasive data processing and present directions for designing the next generation of large-scale data processing systems. ...

Message from international workshop on resource brokering with blockchain (RBchain 2018)

Foreword postscript (2018) - Chunming Rong, Marc X. Makkes, Martin Gilje Jaatun, Oskar Van Deventer, Tom Hacker, Ronald Jabangwe, Chunlei Li, Jan S. Rellermeyer, Torben Worm

Presents the introductory welcome message from the conference proceedings. May include the conference officers' congratulations to all involved with the conference event and publication of the proceedings record. ...

Increasing Memory Density through Dynamic Memory Extension with Memory1 through Flash

Poster (2018) - Jan S. Rellermeyer, Maher Amer, Richard Smutzer, Karthick Rajamani

Container Density Improvements with Dynamic Memory Extension using NAND Flash

Conference paper (2018) - Jan S. Rellermeyer, Maher Amer, Richard Smutzer, Karthick Rajamani

While containers efficiently implement the idea of operating-system-level application virtualization, they are often insufficient to increase the server utilization to a desirable level. The reason is that in practice many containerized applications experience a limited amount of load while there are few containers with a high load. In such a scenario, the virtual memory management system can become the limiting factor to container density even though the working set of active containers would fit into main memory. In this paper, we describe and evaluate a system for transparently moving memory pages in and out of DRAM and to a NAND Flash medium which is attached through the memory bus. This technique, called Diablo Memory Expansion (DMX), operates on a prediction model and is able to relieve the pressure on the memory system. We present a benchmark for container density and show that even under an overall constant workload, adding additional containers adversely affects performance-critical applications in Docker. When using the DMX technology of the Memory1 system, however, the performance of the critical workload remains stable. ...