Understanding IT System Failures: Primary Fault Types, Severity Patterns, and Evolution in Modern Operations

An Analysis of Public Incident Reports Using Large Language Models

Bachelor Thesis (2025)
Author(s)

J.A. Rutkowski (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

E. Kapel – Mentor (TU Delft - Software Engineering)

D. Spinellis – Mentor (TU Delft - Software Engineering)

Benedikt Ahrens – Graduation committee member (TU Delft - Programming Languages)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
25-06-2025
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Modern businesses increasingly rely on software-driven operations, making system reliability a critical concern. Despite advances in automated operations, gaps remain in understanding how the primary causes of system failures manifest, impact operational severity, and evolve in cloud-native environments. This study analyzes 7,804 publicly available incident reports spanning 2014–2022 to examine trends in operational fault types across modern IT systems. A state-of-the-art large language model was employed to classify incidents into a consolidated fault taxonomy with an overall accuracy of 92 % and a macro-averaged F1-score of 0.89. The results reveal that Misconfigurations and Deployment Failures (32.3 %), External Dependency Failures (30.0 %), and Capacity Issues (16.1 %) are the most frequent fault types. Significant correlations were found between fault types and incident duration, with Security Incidents exhibiting particularly long resolution times. Temporal analysis shows a rising prevalence of Misconfigurations/Deployment Failures and Software Bugs, alongside a decline in Infrastructure Failures, reflecting the growing complexity and automation of modern IT environments. These findings contribute to a deeper understanding of evolving digital fragility, reveal how different fault types impact operational resilience, and offer actionable insights for improving incident management and system reliability.

Files

Research_paper_final.pdf
(pdf | 1.43 Mb)
License info not available