Spanish phishing and legitimate email dataset with technical and psychological annotations

None, None; None, None; None, None; None, None; None, None; None, None; None, None; None, None; None, None; None, None

Spanish phishing and legitimate email dataset with technical and psychological annotations

Journal Article (2026)

Author(s)

Lázaro Bustio-Martínez (Iberoamericana University)

Viviana Inés Fuentes-Fuentes (Independent researcher)

Luisa Fernanda Agudelo Fuentes (Iberoamericana University)

Vitali Herrera-Semenets (On the Dime S.r.l.)

Darián Llanes-Guilarte (Reparto Siboney)

Felipe Antonio Trujillo-Fernández (Iberoamericana University)

Antonio Carlos Cardeña-Matamoros (Iberoamericana University)

Carlos Francisco Betancourt-Moreno (Iberoamericana University)

Andrés Guillermo Molano-Jiménez (Iberoamericana University)

Jan van den Berg (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Research Group

Cyber Security

Email dataset Human annotation Persuasion annotations Phishing emails Spanish language Technical metadata

DOI related publication

https://doi.org/10.1016/j.dib.2026.112890 Final published version

To reference this document use

https://resolver.tudelft.nl/uuid:d58e8fb0-09ff-4d8e-a0f6-b718bde9887a

More Info

expand_more

Publication Year

2026

Language

English

Research Group

Cyber Security

Journal title

Data in Brief

Volume number

67

Article number

112890

Downloads counter

4

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The SpaPhish dataset is a curated corpus of 1395 anonymized Spanish-language emails collected from the personal and institutional inboxes of the dataset authors. The collection comprises 731 phishing messages and 664 legitimate communications, spanning 2014–2025. Each record integrates raw textual content (subject and body), derived technical metadata, and psychological annotations. Technical variables include extracted URLs (url_count, urls), routing depth (hops_count), and attachment metadata (attachments_count, types, and size-related fields). All personally identifying elements were anonymized through manual redaction and controlled substitution to preserve readability while preventing re-identification. A central component of SpaPhish is the persuasion-annotation layer aligned with Ana Ferreira’s Principles of Persuasion framework. Three independent annotators performed a triple-blind protocol, assigning binary presence labels (0/1) for five dimensions (Authority, Social Proof, Liking/Similarity/Deception, Commitment/Integrity/Reciprocation, and Distraction), accompanied by brief Spanish justifications and consolidated consensus labels. SpaPhish supports Spanish-language phishing research, hybrid text–metadata modeling, and annotation reliability studies, and enables explainable analyses grounded in human-provided evidence. The dataset is publicly available in Mendeley Data as “SpaPhish: A Spanish Dataset for Phishing and Psychological Pattern Detection” (version 5, doi:10.17632/hz2d6gz7pc.5 ; https://data.mendeley.com/datasets/hz2d6gz7pc/5).

Files

1-s2.0-S2352340926004427-main.... (pdf)

(pdf | 2.06 Mb)