Understanding Software Failures Through Incident Report Analysis

None, None

Understanding Software Failures Through Incident Report Analysis

An Empirical Study of 348 Incident Reports from the VOID

Bachelor Thesis (2025)

Author(s)

I.M. Aldea (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

D. Spinellis – Mentor (TU Delft - Software Engineering)

E. Kapel – Mentor (TU Delft - Software Engineering)

Benedikt Ahrens – Graduation committee member (TU Delft - Programming Languages)

Faculty

Electrical Engineering, Mathematics and Computer Science

Text mining AIOps Incident Characterization Change-Induced Failures Incident Archetypes Software Reliability

To reference this document use:

https://resolver.tudelft.nl/uuid:e8da60fe-3db0-4a02-ac2f-5c7b2859f7ee

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

25-06-2025

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Software changes are a leading cause of operational failures in complex production systems. Despite the increasing use of Artificial Intelligence for Development Operations and the availability of postmortem data, research on software incidents remains fragmented and narrowly scoped. This study aims to provide a generalizable understanding of software and change-induced incidents through structured analysis of 348 real-world incident reports from the Verica Open Incident Database. Using few-shot prompting with the GPT-4.1 Mini model, we extract key incident characteristics (root cause, triggering change, impact, severity, and remediation) and apply clustering to identify recurring incident archetypes. Our method achieves over 80% annotation accuracy on a manually labeled subset. We find that over half of incidents stem from software changes, with deployments and configuration updates disproportionately associated with high severity and manual remediation. Capacity issues and code defects are leading root causes. Clustering uncovers several prominent archetypes, including capacity-driven outages, defect-induced degradations, and hybrid failures involving improper changes. These findings support scalable incident analysis and can inform more context-aware operational strategies.

Files

Research_Paper.pdf

(pdf | 1.19 Mb)

License info not available