VN
V.Y. Ning
info
Please Note
<p>This page displays the records of the person named above and is not linked to a unique person identifier. This record may need to be merged to a profile.</p>
1 records found
1
Automated Static Analysis Tools (ASATs) generate a massive volume of non-actionable warnings. To address this, this thesis investigates the performance and resource trade-offs between classical Machine Learning (ML) models and Large Language Models (LLMs) for generating actionability probability scores. Utilizing the NASCAR dataset of over 1.2 million Java warnings, we evaluate optimized classical models (Random Forest and Logistic Regression) against the Claude 4.x LLM family using classification metrics (F1-score, AUC) and probabilistic calibration (Brier scores), supplemented by a qualitative user study of 15 industry professionals. Empirical results demonstrate that an optimized Random Forest yields superior predictive performance (F1-score: 76.85\%, AUC: 0.87) and reliable uncertainty calibration (Brier score: 0.1549), rendering the massive computational overhead of miscalibrated LLMs unnecessary. However, the user study identifies a human-AI feature disconnect: while the Random Forest relies heavily on historical metadata, developers universally demand source code context and severity indicators. Ultimately, an optimized Random Forest provides a significantly more efficient framework for scoring ASAT warnings, provided the scores are tightly coupled with the structural evidence required to sustain human trust.
...
Automated Static Analysis Tools (ASATs) generate a massive volume of non-actionable warnings. To address this, this thesis investigates the performance and resource trade-offs between classical Machine Learning (ML) models and Large Language Models (LLMs) for generating actionability probability scores. Utilizing the NASCAR dataset of over 1.2 million Java warnings, we evaluate optimized classical models (Random Forest and Logistic Regression) against the Claude 4.x LLM family using classification metrics (F1-score, AUC) and probabilistic calibration (Brier scores), supplemented by a qualitative user study of 15 industry professionals. Empirical results demonstrate that an optimized Random Forest yields superior predictive performance (F1-score: 76.85\%, AUC: 0.87) and reliable uncertainty calibration (Brier score: 0.1549), rendering the massive computational overhead of miscalibrated LLMs unnecessary. However, the user study identifies a human-AI feature disconnect: while the Random Forest relies heavily on historical metadata, developers universally demand source code context and severity indicators. Ultimately, an optimized Random Forest provides a significantly more efficient framework for scoring ASAT warnings, provided the scores are tightly coupled with the structural evidence required to sustain human trust.