Enhanced phishing detection using multimodal data

Journal Article (2025)
Author(s)

Lázaro Bustio-Martínez (Universidad Iberoamericana)

Vitali Herrera-Semenets (Advanced Technologies Application Center)

Jorge Ángel González-Ordiano (Universidad Iberoamericana Ciudad de México)

Yamel Pérez-Guadarramas (Universidad Iberoamericana Ciudad de México)

Luis Zúñiga-Morales (Universidad Iberoamericana Ciudad de México)

Daniela Montoya-Godínez (Universidad Iberoamericana Ciudad de México)

Miguel Ángel Álvarez-Carmona (Centro de Investigacion en Matematicas, CIMAT)

Jan van den Berg (TU Delft - Cyber Security)

Research Group
Cyber Security
DOI related publication
https://doi.org/10.1016/j.knosys.2025.115105
More Info
expand_more
Publication Year
2025
Language
English
Research Group
Cyber Security
Volume number
334
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Phishing remains one of the most persistent cybersecurity threats, increasingly exploiting not only technical vulnerabilities but also human cognitive biases. Existing detection systems often rely on single-modality features and black-box models, which restrict both generalization and interpretability. This study presents an explainable multimodal framework that combines textual and technical cues, including message content, URL structure, and Principles of Persuasion, to capture both objective and subjective aspects of phishing. Several classifiers were evaluated using 10-fold stratified cross-validation, with Random Forest achieving the best balance between performance and transparency (ROC-AUC = 0.9840), supported by SHAP explanations that identify the most influential linguistic and structural features. Comparative analysis shows that the proposed framework outperforms unimodal baselines while preserving interpretability, enabling a clear rationale for classification outcomes. These results indicate that integrating multimodal representation with explainable learning strengthens phishing detection accuracy, improves user trust, and supports reliable deployment in real-world environments.