Understanding IT System Failures: Primary Fault Types, Severity Patterns, and Evolution in Modern Operations
An Analysis of Public Incident Reports Using Large Language Models
J.A. Rutkowski (TU Delft - Electrical Engineering, Mathematics and Computer Science)
E. Kapel – Mentor (TU Delft - Software Engineering)
D. Spinellis – Mentor (TU Delft - Software Engineering)
Benedikt Ahrens – Graduation committee member (TU Delft - Programming Languages)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Modern businesses increasingly rely on software-driven operations, making system reliability a critical concern. Despite advances in automated operations, gaps remain in understanding how the primary causes of system failures manifest, impact operational severity, and evolve in cloud-native environments. This study analyzes 7,804 publicly available incident reports spanning 2014–2022 to examine trends in operational fault types across modern IT systems. A state-of-the-art large language model was employed to classify incidents into a consolidated fault taxonomy with an overall accuracy of 92 % and a macro-averaged F1-score of 0.89. The results reveal that Misconfigurations and Deployment Failures (32.3 %), External Dependency Failures (30.0 %), and Capacity Issues (16.1 %) are the most frequent fault types. Significant correlations were found between fault types and incident duration, with Security Incidents exhibiting particularly long resolution times. Temporal analysis shows a rising prevalence of Misconfigurations/Deployment Failures and Software Bugs, alongside a decline in Infrastructure Failures, reflecting the growing complexity and automation of modern IT environments. These findings contribute to a deeper understanding of evolving digital fragility, reveal how different fault types impact operational resilience, and offer actionable insights for improving incident management and system reliability.