Investigation and Comparison of Evaluation Methods of Model-Agnostic Explainable AI Models

More Info
expand_more

Abstract

Many artificial intelligence (AI) systems are built using black-box machine learning (ML) algorithms. The lack of transparency and interpretability reduces their trustworthiness. In recent years, research into explainable AI (XAI) has increased. These systems are designed to tackle common ML issues such as trust, accountability, and transparency. However, research into the evaluation of XAI is still low. In this paper, common trends in the evaluation of state-of-the-art model-agnostic XAI models and any missing or undervalued evaluation methods are identified. First, a taxonomy is explored, and an overview of existing evaluation metrics found in literature is made. Then, using this overview, a thorough analysis and comparison of the evaluation methods of 5 state-of-the-art model-agnostic XAI models (LIME, SHAP, Anchors, PASTLE, and CASTLE) is done. It has been discovered that only a small subset of the found evaluation metrics is used in the evaluation of the state-of-the-art models. Metrics that are not often assessed in user-studies but deserve more attention are (appropriate) trust, task time length, and task performance. For synthetic experiments, only fidelity is commonly assessed. The models are also only assessed using proxy tasks, none of them are assessed using real-world tasks. In addition, each identified metric was found to have various different measurement methods and units of measurement, indicating a lack of standardization.