Modern businesses increasingly rely on software-driven operations, making system reliability a critical concern. Despite advances in automated operations, gaps remain in understanding how the primary causes of system failures manifest, impact operational severity, and evolve in c
...
Modern businesses increasingly rely on software-driven operations, making system reliability a critical concern. Despite advances in automated operations, gaps remain in understanding how the primary causes of system failures manifest, impact operational severity, and evolve in cloud-native environments. This study analyzes 7,804 publicly available incident reports spanning 2014–2022 to examine trends in operational fault types across modern IT systems. A state-of-the-art large language model was employed to classify incidents into a consolidated fault taxonomy with an overall accuracy of 92 % and a macro-averaged F1-score of 0.89. The results reveal that Misconfigurations and Deployment Failures (32.3 %), External Dependency Failures (30.0 %), and Capacity Issues (16.1 %) are the most frequent fault types. Significant correlations were found between fault types and incident duration, with Security Incidents exhibiting particularly long resolution times. Temporal analysis shows a rising prevalence of Misconfigurations/Deployment Failures and Software Bugs, alongside a decline in Infrastructure Failures, reflecting the growing complexity and automation of modern IT environments. These findings contribute to a deeper understanding of evolving digital fragility, reveal how different fault types impact operational resilience, and offer actionable insights for improving incident management and system reliability.