A. Asgari

info

Please Note

<p>This page displays the records of the person named above and is not linked to a unique person identifier. This record may need to be merged to a profile.</p>

Conference paper (1)

Journal article (2)

3 records found

Adaptive Probabilistic Operational Testing for Large Language Models Evaluation

Conference paper (2025) - Ali Asgari, Antonio Guerriero, Roberto Pietrantuono, Stefano Russo

Large Language Models (LLM) empower many modern software systems, and are required to be highly accurate and reliable. Evaluating LLM poses challenges due to the high costs of manual labeling and of validation of labeled data. This study investigates the suitability of probabilistic operational testing for effective and efficient evaluation of LLM, focusing on a case study with DistilBERT. To this aim, we adopt an existing framework (DeepSample) for Deep Neural Network (DNN) testing and adapt it to the LLM domain by introducing auxiliary variables tailored to LLM and classification tasks. Through a comprehensive evaluation, we demonstrate how sampling-based operational testing can yield reliable LLM accuracy estimates and effectively expose failures, or, under testing budget constraints, it can find a trade off between accuracy estimation and failure exposure. The experimental results, using DistilBERT on three sentiment analysis datasets, show that sampling-based methods can provide cost effective and reliable operational accuracy assessment for LLM. These findings offer practical insights for testers and help address critical gaps in current LLM evaluation practices. ...

Exploring the black box

Analysing explainable AI challenges and best practices through stack exchange discussions

Journal article (2025) - Mohammad Mahdi Sayyadnejad, Ali Asgari, Ashkan Sami, Hooman Tahayori

Explainable Artificial Intelligence (XAI) is a crucial domain within research and industry, aiming to develop AI models that provide human-understandable explanations for their decisions. While the challenges in AI, deep learning, and big data have been extensively explored, the specific concerns of XAI developers have received limited attention. To address this gap, we analysed discussions on Stack Exchange websites to delve into these issues. Through a combination of automated and Manual analysis, we identified 6 overarching categories, 10 distinct topics, and 40 sub-topics commonly discussed by developers. Our examination revealed a steady rise in discussions on XAI since late 2015, initially focusing on conceptualisation and practical applications, with a notable surge in activity across all topic categories since 2019. Notably, Concepts and Applications, Tools Troubleshooting, and Neural Networks Interpretation emerged as the most popular topics. Troubleshooting challenges were commonly encountered with tools like SHAP, ELI5, and AIF360, while visualisation issues were prevalent with Yellowbrick and SHAP. Furthermore, our analysis suggests that addressing questions related to XAI poses greater difficulty compared to other machine-learning questions. ...

Metamorphic Testing of Deep Code Models: A Systematic Literature Review

Journal article (2025) - A. Asgari, M. de Koning, P. Derakhshanfar, A. Panichella

Large language models and deep learning models designed for code intelligence have revolutionized the software engineering field due to their ability to perform various code-related tasks. These models can process source code and software artifacts with high accuracy in tasks such as code completion, defect detection, and code summarization; therefore, they can potentially become an integral part of modern software engineering practices. Despite these capabilities, robustness remains a critical quality attribute for deep-code models as they may produce different results under varied and adversarial conditions (e.g., variable renaming). Metamorphic testing has become a widely used approach to evaluate models’ robustness by applying semantic-preserving transformations to input programs and analyzing the stability of model outputs. While prior research has explored testing deep learning models, this systematic literature review focuses specifically on metamorphic testing for deep code models. By studying 45 primary papers, we analyze the transformations, techniques, and evaluation methods used to assess robustness. Our review summarizes the current landscape, identifying frequently evaluated models, programming tasks, datasets, target languages, and evaluation metrics, and highlights key challenges and future directions for advancing the field. ...