Beyond the Noise: Leveraging Machine Learning vs LLMs to Prioritize ASAT Warnings based on Actionability Probability

Master Thesis (2026)
Author(s)

V.Y. Ning (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A.E. Zaidman – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Miroslav Zivkovic – Mentor (Software Improvement Group)

M.A. Costea – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Z. Erkin – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
01-07-2026
Awarding Institution
Delft University of Technology
Sponsors
Software Improvement Group
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
6
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Automated Static Analysis Tools (ASATs) generate a massive volume of non-actionable warnings. To address this, this thesis investigates the performance and resource trade-offs between classical Machine Learning (ML) models and Large Language Models (LLMs) for generating actionability probability scores. Utilizing the NASCAR dataset of over 1.2 million Java warnings, we evaluate optimized classical models (Random Forest and Logistic Regression) against the Claude 4.x LLM family using classification metrics (F1-score, AUC) and probabilistic calibration (Brier scores), supplemented by a qualitative user study of 15 industry professionals. Empirical results demonstrate that an optimized Random Forest yields superior predictive performance (F1-score: 76.85\%, AUC: 0.87) and reliable uncertainty calibration (Brier score: 0.1549), rendering the massive computational overhead of miscalibrated LLMs unnecessary. However, the user study identifies a human-AI feature disconnect: while the Random Forest relies heavily on historical metadata, developers universally demand source code context and severity indicators. Ultimately, an optimized Random Forest provides a significantly more efficient framework for scoring ASAT warnings, provided the scores are tightly coupled with the structural evidence required to sustain human trust.

Files

VivianNing_MasterThesis.pdf
(pdf | 3.27 Mb)
License info not available