EK

E. Kapel

info

Please Note

5 records found

Foreword postscript (2026) - Sallam Abualhaija, Domenico Bianculli, Eileen Kapel

Adapting AIOps Capacity Forecasting Models to Data Changes

Conference paper (2025) - Lorena Poenaru-Olaru, Wouter Van't Hof, Adrian Stańdo, Arkadiusz P. Trawiński, Eileen Kapel, Jan S. Rellermeyer, Luis Cruz, Arie Van Deursen
Capacity management is critical for software organizations to allocate resources effectively and meet operational demands. An important step in capacity management is predicting future resource needs often relies on data-driven analytics and machine learning (ML) forecasting models, which require frequent retraining to stay relevant as data evolves. Continuously retraining the forecasting models can be expensive and difficult to scale, posing a challenge for engineering teams tasked with balancing accuracy and efficiency. Retraining only when the data changes appears to be a more computationally efficient alternative, but its impact on accuracy requires further investigation. In this work, we investigate the effects of retraining capacity forecasting models for time series based on detected changes in the data compared to periodic retraining. Our results show that drift-based retraining achieves comparable forecasting accuracy to periodic retraining in most cases, making it a costeffective strategy. However, in cases where data is changing rapidly, periodic retraining is still preferred to maximize the forecasting accuracy. These findings offer actionable insights for software teams to enhance forecasting systems, reducing retraining overhead while maintaining robust performance. ...
Effective change management is crucial for businesses heavily reliant on software and services to minimise incidents induced by changes. Unfortunately, in practice it is often difficult to effectively use artificial intelligence for IT Operations (AIOps) to enhance service management, primarily due to inadequate data quality. Establishing reliable links between changes and the induced incidents is crucial for identifying patterns, improving change deployment, identifying high-risk changes, and enhancing incident response. In this research, we investigate the enhancement of traceability between changes and incidents through AIOps methods. Our approach involves a close examination of incident-inducing changes, the replication of methods linking incidents to the changes that caused them, introducing an adapted method, and demonstrating its results using historical data and practical evaluations. Our findings reveal that incident-inducing changes exhibit different characteristics dependent on context. Furthermore, a significant disparity exists between assessments based on historical data and real-world observation, with an increased occurrence of false positives when identifying links between unlabeled changes and incidents. This study highlights the complex nature of identifying links between changes and incidents, emphasising the contextual influence on AIOps method effectiveness. While we are actively working on improving the quality of current data through AIOps approaches, it remains apparent that further measures are necessary to address issues like data imbalances and promote a postmortem culture that brings attention to the value of properly administrating tickets. A better overview of change failure rates contributes to improved risk compliance and reliable change management. ...
Conference paper (2023) - Eileen Kapel
Ensuring the reliability of changes deployment is essential to prevent incidents in businesses that strongly depend on software and services. Incidents should be avoided since they may lead to customer dissatisfaction, financial losses and reputational damage. Currently, the majority of outages are being caused by changes, so we believe there is a need for a higher focus on the risk management pre-change deployment. This paper presents a research plan that proposes a risk management AIOps framework utilising real-world change, CI/CD pipeline and incident data for incident prevention through reliable changes deployment. This research will explore 1) obtaining background information on the current state of practice of service management with a case study on a software-defined business; 2) a risk management AIOps framework that utilises the traces of change, incident and CI/CD pipeline code for predicting the risk of changes deployment; and 3) testing the generalisability of the framework for reducing the risk of change deployment. ...
Context: An incident management process is necessary in businesses that depend strongly on software and services. A proper process is essential to guarantee that incidents are well-handled, especially in a software-defined financial services company needing to adhere to guidelines and regulations.

Objective: This paper aims to improve the understanding of the current state of practice through a case study of the incident management process on a software-defined company.

Method: We conduct a single-case exploratory case study by interviewing 15 subject matter experts on how tools are used, the challenges experienced and the future opportunities of the process. The findings are triangulated with documentation and data.

Results: We make nine core observations in this paper. Certain tools are prescribed to teams to be used in the incident management process, complemented with flexible support for diverse tooling that aid teams in handling incidents. Core challenges include monitoring data quality, the complexity of the environment, and the balance between minimising incident resolution time and following procedural guidelines. Future opportunities can lessen these challenges by making better use of available tooling and employing machine learning approaches. This requires tight supervision on the use of best practices and good monitoring data quality.

Conclusion: The tools, challenges, and future opportunities for incident management in software-defined businesses identified in this paper call for a strengthened focus on improving the quality of monitoring data, handling environment complexity, issue clustering, and better support for regulatory compliance. ...