Spanish phishing and legitimate email dataset with technical and psychological annotations
Lázaro Bustio-Martínez (Iberoamericana University)
Viviana Inés Fuentes-Fuentes (Independent researcher)
Luisa Fernanda Agudelo Fuentes (Iberoamericana University)
Vitali Herrera-Semenets (On the Dime S.r.l.)
Darián Llanes-Guilarte (Reparto Siboney)
Felipe Antonio Trujillo-Fernández (Iberoamericana University)
Antonio Carlos Cardeña-Matamoros (Iberoamericana University)
Carlos Francisco Betancourt-Moreno (Iberoamericana University)
Andrés Guillermo Molano-Jiménez (Iberoamericana University)
Jan van den Berg (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
The SpaPhish dataset is a curated corpus of 1395 anonymized Spanish-language emails collected from the personal and institutional inboxes of the dataset authors. The collection comprises 731 phishing messages and 664 legitimate communications, spanning 2014–2025. Each record integrates raw textual content (subject and body), derived technical metadata, and psychological annotations. Technical variables include extracted URLs (url_count, urls), routing depth (hops_count), and attachment metadata (attachments_count, types, and size-related fields). All personally identifying elements were anonymized through manual redaction and controlled substitution to preserve readability while preventing re-identification. A central component of SpaPhish is the persuasion-annotation layer aligned with Ana Ferreira’s Principles of Persuasion framework. Three independent annotators performed a triple-blind protocol, assigning binary presence labels (0/1) for five dimensions (Authority, Social Proof, Liking/Similarity/Deception, Commitment/Integrity/Reciprocation, and Distraction), accompanied by brief Spanish justifications and consolidated consensus labels. SpaPhish supports Spanish-language phishing research, hybrid text–metadata modeling, and annotation reliability studies, and enables explainable analyses grounded in human-provided evidence. The dataset is publicly available in Mendeley Data as “SpaPhish: A Spanish Dataset for Phishing and Psychological Pattern Detection” (version 5, doi:10.17632/hz2d6gz7pc.5 ; https://data.mendeley.com/datasets/hz2d6gz7pc/5).