What Secondary Issues Contribute to Operational Problems?

An Investigation Based on Public Postmortems

Bachelor Thesis (2025)
Author(s)

A. Muresan (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

E. Kapel – Mentor (TU Delft - Software Engineering)

D. Spinellis – Mentor (TU Delft - Software Engineering)

Benedikt Ahrens – Graduation committee member (TU Delft - Programming Languages)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
25-06-2025
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Operational incidents in software-defined systems can lead to significant disruptions, and while primary faults such as bugs or misconfigurations are well studied, secondary issues that exacerbate these failures remain underexplored. This research investigates what secondary issues contribute to operational problems by analyzing 1,500 publicly available incident reports from platforms such as GitHub and the Verica Open Incident Database (VOID). Using a large language model (LLM) and a predefined classification schema, the study extracts and categorizes these issues at scale. The results show that communication failures (48.2%), monitoring and transparency deficiencies (46.5%), and documentation issues (41.1%) are the most prevalent secondary issues. These often co-occur, with the most common issue pair, communication failures and monitoring deficiencies, appearing together in over 600 reports, suggesting interdependent systemic weaknesses. Furthermore, these secondary issues show strong associations with different primary fault types, such as misconfigurations and software bugs, revealing distinct amplification patterns that affect incident severity and resolution time. A reproducible data pipeline was developed to enable large-scale analysis, and manual validation of model annotations yielded an accuracy of 81.9%, confirming the reliability of the LLM-based classification approach. The study addresses the feasibility of AI-assisted analysis for postmortem diagnostics and provides actionable insights into operational fragility, emphasizing the need to address not only technical faults but also organizational and process-level weaknesses.

Files

License info not available