Learning Multimodal Explainable AI Models from Medical Images and Tabular Data

Proof of Concept

Conference Paper (2025)
Author(s)

Mafalda Malafaia (Centrum Wiskunde & Informatica (CWI))

Thalea Schlender (Leiden University Medical Center, TU Delft - Algorithmics)

Peter Bosman (TU Delft - Algorithmics, Centrum Wiskunde & Informatica (CWI))

Tanja Alderliesten (Leiden University Medical Center, TU Delft - Algorithmics)

Research Group
Algorithmics
DOI related publication
https://doi.org/10.1117/12.3040402
More Info
expand_more
Publication Year
2025
Language
English
Research Group
Algorithmics
Bibliographical Note
Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public. @en
ISBN (electronic)
9781510685901
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Medical applications often involve several data modalities, particularly medical images and clinical information, which can be combined to enhance the decision-making process by improving accuracy. Multimodal learning approaches can leverage all available data for increased robustness in the resulting models, consequently outperforming unimodal approaches. Furthermore, AI frameworks must be human-verifiable and interpretable to be deployed in real-world situations, considering legal and privacy aspects. Due to the opaque nature of Deep Learning (DL) methods, interpretability is often limited despite their state-of-the-art performance in many tasks. Genetic Programming (GP) can provide compact and interpretable symbolic expressions for tabular data but is less effective for image analysis. We introduce MultiFIX: a new interpretability-focused pipeline for multimodal learning that leverages the strengths of DL and GP to explicitly engineer features from different data types and combine them to make the final prediction. The MultiFIX pipeline comprises two stages: the training stage, where a DL (black-box) model is trained using different training procedures to extract relevant features from each modality; and the inference stage, where the resulting model is transformed to be interpretable. Image features are explained with attention maps by Grad-CAM, and inherently interpretable symbolic expressions evolved with GP fully replace the tabular feature engineering block, and the fusion of the extracted features to predict the target label. To show the application potential of the presented pipeline, we demonstrate MultiFIX with a Melanoma Risk Assessment dataset. Results show that MultiFIX outperforms unimodal models while offering explanations that can be straightforwardly analysed and are consistent with the expectations.

Files

1340612.pdf
(pdf | 2.09 Mb)
License info not available
warning

File under embargo until 13-10-2025