Phishing remains one of the most persistent cybersecurity threats, increasingly exploiting not only technical vulnerabilities but also human cognitive biases. Existing detection systems often rely on single-modality features and black-box models, which restrict both generalizatio
...
Phishing remains one of the most persistent cybersecurity threats, increasingly exploiting not only technical vulnerabilities but also human cognitive biases. Existing detection systems often rely on single-modality features and black-box models, which restrict both generalization and interpretability. This study presents an explainable multimodal framework that combines textual and technical cues, including message content, URL structure, and Principles of Persuasion, to capture both objective and subjective aspects of phishing. Several classifiers were evaluated using 10-fold stratified cross-validation, with Random Forest achieving the best balance between performance and transparency (ROC-AUC = 0.9840), supported by SHAP explanations that identify the most influential linguistic and structural features. Comparative analysis shows that the proposed framework outperforms unimodal baselines while preserving interpretability, enabling a clear rationale for classification outcomes. These results indicate that integrating multimodal representation with explainable learning strengthens phishing detection accuracy, improves user trust, and supports reliable deployment in real-world environments.