Linking Software Changes to Incident Reports

Investigating Correlations Between Root Causes and the Mean Time To Repair of Incidents

Bachelor Thesis (2025)
Author(s)

D.M. Bunschoten (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

D. Spinellis – Mentor (TU Delft - Software Engineering)

E. Kapel – Mentor (TU Delft - Software Engineering)

Benedikt Ahrens – Graduation committee member (TU Delft - Programming Languages)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
25-06-2025
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The availability and reliability of online systems form the cornerstone of modern civilization. Companies actively try to minimize downtime during incidents, and publishing incident reports afterwards is a standard practice. However, what is missing is an overview of the distribution of the categories of causes leading up to the incident and their characteristics. This paper fills that research gap by answering the question of a relation between different categories of software changes causing the incidents, and their respective mean time to repair (MTTR). A taxonomy for classifying the root causes and time to detect (TTD) of incident reports was derived. A total of 258 publicly available incident reports authored by Google were scraped, and a zero-shot classification model was chosen to classify these. Additionally, the analysis focused on the time to repair (TTR) for each category. This found that incidents caused by software version incompatibilities have the highest MTTR of 69.8 hours, followed by code defects of 54.0 hours, while the other categories have values between 13 and 20 hours. Given that the TTR of an incident is primarily impacted by the number of skilled engineers available, having an estimate of the difficulty based on empirical data could help improve resource distribution based on early indications of root causes.

Files

License info not available