A. Asgari
Please Note
3 records found
1
Large Language Models (LLM) empower many modern software systems, and are required to be highly accurate and reliable. Evaluating LLM poses challenges due to the high costs of manual labeling and of validation of labeled data. This study investigates the suitability of probabilistic operational testing for effective and efficient evaluation of LLM, focusing on a case study with DistilBERT. To this aim, we adopt an existing framework (DeepSample) for Deep Neural Network (DNN) testing and adapt it to the LLM domain by introducing auxiliary variables tailored to LLM and classification tasks. Through a comprehensive evaluation, we demonstrate how sampling-based operational testing can yield reliable LLM accuracy estimates and effectively expose failures, or, under testing budget constraints, it can find a trade off between accuracy estimation and failure exposure. The experimental results, using DistilBERT on three sentiment analysis datasets, show that sampling-based methods can provide cost effective and reliable operational accuracy assessment for LLM. These findings offer practical insights for testers and help address critical gaps in current LLM evaluation practices.
Exploring the black box
Analysing explainable AI challenges and best practices through stack exchange discussions
Explainable Artificial Intelligence (XAI) is a crucial domain within research and industry, aiming to develop AI models that provide human-understandable explanations for their decisions. While the challenges in AI, deep learning, and big data have been extensively explored, the specific concerns of XAI developers have received limited attention. To address this gap, we analysed discussions on Stack Exchange websites to delve into these issues. Through a combination of automated and Manual analysis, we identified 6 overarching categories, 10 distinct topics, and 40 sub-topics commonly discussed by developers. Our examination revealed a steady rise in discussions on XAI since late 2015, initially focusing on conceptualisation and practical applications, with a notable surge in activity across all topic categories since 2019. Notably, Concepts and Applications, Tools Troubleshooting, and Neural Networks Interpretation emerged as the most popular topics. Troubleshooting challenges were commonly encountered with tools like SHAP, ELI5, and AIF360, while visualisation issues were prevalent with Yellowbrick and SHAP. Furthermore, our analysis suggests that addressing questions related to XAI poses greater difficulty compared to other machine-learning questions.