The availability and reliability of online systems form the cornerstone of modern civilization. Companies actively try to minimize downtime during incidents, and publishing incident reports afterwards is a standard practice. However, what is missing is an overview of the distribu
...
The availability and reliability of online systems form the cornerstone of modern civilization. Companies actively try to minimize downtime during incidents, and publishing incident reports afterwards is a standard practice. However, what is missing is an overview of the distribution of the categories of causes leading up to the incident and their characteristics. This paper fills that research gap by answering the question of a relation between different categories of software changes causing the incidents, and their respective mean time to repair (MTTR). A taxonomy for classifying the root causes and time to detect (TTD) of incident reports was derived. A total of 258 publicly available incident reports authored by Google were scraped, and a zero-shot classification model was chosen to classify these. Additionally, the analysis focused on the time to repair (TTR) for each category. This found that incidents caused by software version incompatibilities have the highest MTTR of 69.8 hours, followed by code defects of 54.0 hours, while the other categories have values between 13 and 20 hours. Given that the TTR of an incident is primarily impacted by the number of skilled engineers available, having an estimate of the difficulty based on empirical data could help improve resource distribution based on early indications of root causes.