Understanding IT System Failures: Primary Fault Types, Severity Patterns, and Evolution in Modern Operations

None, None

Understanding IT System Failures: Primary Fault Types, Severity Patterns, and Evolution in Modern Operations

An Analysis of Public Incident Reports Using Large Language Models

Bachelor Thesis (2025)

Author(s)

J.A. Rutkowski (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

E. Kapel – Mentor (TU Delft - Software Engineering)

D. Spinellis – Mentor (TU Delft - Software Engineering)

Benedikt Ahrens – Graduation committee member (TU Delft - Programming Languages)

Faculty

Electrical Engineering, Mathematics and Computer Science

Large Language Models (LLMs) DevOps Incident management Fault classification Root Cause Analysis IT system failures

To reference this document use:

https://resolver.tudelft.nl/uuid:16aeee13-032e-4f40-8801-e02abccbf14c

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

25-06-2025

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Modern businesses increasingly rely on software-driven operations, making system reliability a critical concern. Despite advances in automated operations, gaps remain in understanding how the primary causes of system failures manifest, impact operational severity, and evolve in cloud-native environments. This study analyzes 7,804 publicly available incident reports spanning 2014–2022 to examine trends in operational fault types across modern IT systems. A state-of-the-art large language model was employed to classify incidents into a consolidated fault taxonomy with an overall accuracy of 92 % and a macro-averaged F1-score of 0.89. The results reveal that Misconfigurations and Deployment Failures (32.3 %), External Dependency Failures (30.0 %), and Capacity Issues (16.1 %) are the most frequent fault types. Significant correlations were found between fault types and incident duration, with Security Incidents exhibiting particularly long resolution times. Temporal analysis shows a rising prevalence of Misconfigurations/Deployment Failures and Software Bugs, alongside a decline in Infrastructure Failures, reflecting the growing complexity and automation of modern IT environments. These findings contribute to a deeper understanding of evolving digital fragility, reveal how diﬀerent fault types impact operational resilience, and oﬀer actionable insights for improving incident management and system reliability.

Files

Research_paper_final.pdf

(pdf | 1.43 Mb)

License info not available