Beyond the Noise: Leveraging Machine Learning vs LLMs to Prioritize ASAT Warnings based on Actionability Probability

None, None

Beyond the Noise: Leveraging Machine Learning vs LLMs to Prioritize ASAT Warnings based on Actionability Probability

Master Thesis (2026)

Author(s)

V.Y. Ning (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A.E. Zaidman – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Miroslav Zivkovic – Mentor (Software Improvement Group)

M.A. Costea – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Z. Erkin – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Random Forest Machine Learning LLM Automated Static Analysis Tools Warning Prioritization Logistic Regression

To reference this document use

https://resolver.tudelft.nl/uuid:bd449323-7b85-4e05-8239-c4e218dc73a8

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

01-07-2026

Awarding Institution

Delft University of Technology

Abstract

Automated Static Analysis Tools (ASATs) generate a massive volume of non-actionable warnings. To address this, this thesis investigates the performance and resource trade-offs between classical Machine Learning (ML) models and Large Language Models (LLMs) for generating actionability probability scores. Utilizing the NASCAR dataset of over 1.2 million Java warnings, we evaluate optimized classical models (Random Forest and Logistic Regression) against the Claude 4.x LLM family using classification metrics (F1-score, AUC) and probabilistic calibration (Brier scores), supplemented by a qualitative user study of 15 industry professionals. Empirical results demonstrate that an optimized Random Forest yields superior predictive performance (F1-score: 76.85\%, AUC: 0.87) and reliable uncertainty calibration (Brier score: 0.1549), rendering the massive computational overhead of miscalibrated LLMs unnecessary. However, the user study identifies a human-AI feature disconnect: while the Random Forest relies heavily on historical metadata, developers universally demand source code context and severity indicators. Ultimately, an optimized Random Forest provides a significantly more efficient framework for scoring ASAT warnings, provided the scores are tightly coupled with the structural evidence required to sustain human trust.

Files

VivianNing_MasterThesis.pdf

(pdf | 3.27 Mb)

License info not available