Lázaro Bustio-Martínez
Please Note
11 records found
1
The SpaPhish dataset is a curated corpus of 1395 anonymized Spanish-language emails collected from the personal and institutional inboxes of the dataset authors. The collection comprises 731 phishing messages and 664 legitimate communications, spanning 2014–2025. Each record integrates raw textual content (subject and body), derived technical metadata, and psychological annotations. Technical variables include extracted URLs (url_count, urls), routing depth (hops_count), and attachment metadata (attachments_count, types, and size-related fields). All personally identifying elements were anonymized through manual redaction and controlled substitution to preserve readability while preventing re-identification. A central component of SpaPhish is the persuasion-annotation layer aligned with Ana Ferreira’s Principles of Persuasion framework. Three independent annotators performed a triple-blind protocol, assigning binary presence labels (0/1) for five dimensions (Authority, Social Proof, Liking/Similarity/Deception, Commitment/Integrity/Reciprocation, and Distraction), accompanied by brief Spanish justifications and consolidated consensus labels. SpaPhish supports Spanish-language phishing research, hybrid text–metadata modeling, and annotation reliability studies, and enables explainable analyses grounded in human-provided evidence. The dataset is publicly available in Mendeley Data as “SpaPhish: A Spanish Dataset for Phishing and Psychological Pattern Detection” (version 5, doi:10.17632/hz2d6gz7pc.5 ; https://data.mendeley.com/datasets/hz2d6gz7pc/5).
Social security programs aim to protect vulnerable populations; however, accurately identifying individuals with significantly lower incomes than their peers (accounting for age, occupation, and education level) remains an operational challenge. This article proposes an innovative method for detecting economic vulnerability by combining income data enrichment with large language models in a multi-agent architecture, unsupervised clustering techniques, and statistical heuristics. The developed algorithm analyzes demographic and labor-related variables to estimate expected annual income by profile, thereby identifying atypical discrepancies that suggest vulnerability. This approach not only optimizes the prioritization of beneficiaries for targeted assistance but also serves as a preventive mechanism against the inadvertent exclusion of eligible groups. Preliminary results demonstrate the method’s effectiveness in detecting hidden vulnerability particularly among young adults aged 17–23, whose high underemployment rates (≈40%) in recent national statistics closely align with the concentration of vulnerability detected. These findings underscore its potential as a complementary tool to enhance equity and efficiency in social policy implementation.
Unmasking Phishing Attempts
A Study on Detection in Spanish Emails
Every day the number of devices interacting through telecommunications networks grows resulting into an increase in the volume of data and information generated. At the same time, a growing number of information security incidents is being observed including the occurrence of unauthorized accesses, also named intrusions. As a consequence of these two developments, Information and Communications services providers require automated processes to detect and solve such intrusions, and this should done quickly in order to keep the related cybersecurity risks at acceptable levels. However, the presence of large volumes of data negatively interferes with the performance of classifiers used in intrusion detection tasks, which limits their applicability in practical cases. The research reported in this paper focuses on proposing a novel feature selection algorithm for intrusion detection scenarios. To this end, an extensive literature review was executed to first discover issues in the feature selection algorithms reported. Based on the insights obtained, the new multi-measure feature selection algorithm was designed that uses qualitative information provided by multiple feature selection measures, and reduces the dimensionality of the training data set. The algorithm proposed was next extensively tested using various data sets. It provides greater efficacy than other feature selection algorithms used for intrusion detection purposes. We finalize by providing some ideas on future research in order to further improve the algorithm.