Lázaro Bustio-Martínez | TU Delft Repository

Spanish phishing and legitimate email dataset with technical and psychological annotations

Journal article (2026) - Lázaro Bustio-Martínez, Viviana Inés Fuentes-Fuentes, Luisa Fernanda Agudelo Fuentes, Vitali Herrera-Semenets, Darián Llanes-Guilarte, Felipe Antonio Trujillo-Fernández, Antonio Carlos Cardeña-Matamoros, Carlos Francisco Betancourt-Moreno, Andrés Guillermo Molano-Jiménez, Jan van den Berg

The SpaPhish dataset is a curated corpus of 1395 anonymized Spanish-language emails collected from the personal and institutional inboxes of the dataset authors. The collection comprises 731 phishing messages and 664 legitimate communications, spanning 2014–2025. Each record integrates raw textual content (subject and body), derived technical metadata, and psychological annotations. Technical variables include extracted URLs (url_count, urls), routing depth (hops_count), and attachment metadata (attachments_count, types, and size-related fields). All personally identifying elements were anonymized through manual redaction and controlled substitution to preserve readability while preventing re-identification. A central component of SpaPhish is the persuasion-annotation layer aligned with Ana Ferreira’s Principles of Persuasion framework. Three independent annotators performed a triple-blind protocol, assigning binary presence labels (0/1) for five dimensions (Authority, Social Proof, Liking/Similarity/Deception, Commitment/Integrity/Reciprocation, and Distraction), accompanied by brief Spanish justifications and consolidated consensus labels. SpaPhish supports Spanish-language phishing research, hybrid text–metadata modeling, and annotation reliability studies, and enables explainable analyses grounded in human-provided evidence. The dataset is publicly available in Mendeley Data as “SpaPhish: A Spanish Dataset for Phishing and Psychological Pattern Detection” (version 5, doi:10.17632/hz2d6gz7pc.5 ; https://data.mendeley.com/datasets/hz2d6gz7pc/5). ...

Detecting Economic Vulnerability via Multi-Agent LLM Architecture and Context-Aware Cluster Analysis

Conference paper (2026) - Vitali Herrera-Semenets, Lázaro Bustio-Martínez, Jan van den Berg, Miguel Ángel Álvarez-Carmona

Social security programs aim to protect vulnerable populations; however, accurately identifying individuals with significantly lower incomes than their peers (accounting for age, occupation, and education level) remains an operational challenge. This article proposes an innovative method for detecting economic vulnerability by combining income data enrichment with large language models in a multi-agent architecture, unsupervised clustering techniques, and statistical heuristics. The developed algorithm analyzes demographic and labor-related variables to estimate expected annual income by profile, thereby identifying atypical discrepancies that suggest vulnerability. This approach not only optimizes the prioritization of beneficiaries for targeted assistance but also serves as a preventive mechanism against the inadvertent exclusion of eligible groups. Preliminary results demonstrate the method’s effectiveness in detecting hidden vulnerability particularly among young adults aged 17–23, whose high underemployment rates (≈40%) in recent national statistics closely align with the concentration of vulnerability detected. These findings underscore its potential as a complementary tool to enhance equity and efficiency in social policy implementation. ...

Enhanced phishing detection using multimodal data

Journal article (2025) - Lázaro Bustio-Martínez, Vitali Herrera-Semenets, Jorge Ángel González-Ordiano, Yamel Pérez-Guadarramas, Luis Zúñiga-Morales, Daniela Montoya-Godínez , Miguel Ángel Álvarez-Carmona, Jan van den Berg

Phishing remains one of the most persistent cybersecurity threats, increasingly exploiting not only technical vulnerabilities but also human cognitive biases. Existing detection systems often rely on single-modality features and black-box models, which restrict both generalization and interpretability. This study presents an explainable multimodal framework that combines textual and technical cues, including message content, URL structure, and Principles of Persuasion, to capture both objective and subjective aspects of phishing. Several classifiers were evaluated using 10-fold stratified cross-validation, with Random Forest achieving the best balance between performance and transparency (ROC-AUC = 0.9840), supported by SHAP explanations that identify the most influential linguistic and structural features. Comparative analysis shows that the proposed framework outperforms unimodal baselines while preserving interpretability, enabling a clear rationale for classification outcomes. These results indicate that integrating multimodal representation with explainable learning strengthens phishing detection accuracy, improves user trust, and supports reliable deployment in real-world environments. ...

Unmasking Phishing Attempts

A Study on Detection in Spanish Emails

Conference paper (2025) - Vitali Herrera-Semenets, Lázaro Bustio-Martínez, Yamel Pérez-Guadarramas, Jorge Ángel González-Ordiano, Jan van den Berg

Phishing, a pervasive cybersecurity issue, involves fraudulent attempts to obtain sensitive information and to provoke unintentional money transfers or malware downloads, among others, by disguising as trustworthy entities in electronic communications. This paper presents an innovative approach to phishing detection in Spanish emails using patterns represented as rules. Through a comprehensive, still efficient analysis of emails, we identify interpretable recurring patterns and relevant phrases used in phishing attempts. These phrases and words often aim to persuade victims into revealing personal or financial information. These patterns are translated into a set of rules that are applied to evaluate incoming emails. Additionally, a proof-of-concept is carried out using a phishing data set of Spanish emails created for this study. Our method achieved promising results in identifying phishing attempts, providing an additional layer of security for email users. Moreover, this approach can be adapted to detect phishing in other languages, making it a potentially global solution to this persistent cybersecurity issue. ...

Making Adequate Use of Generative Language Models in Scientific Research

Preprint (2025) - Lázaro Bustio-Martínez, Vitali Herrera-Semenets, Claudia Feregrino-Uribe, Jan van den Berg

The rapid proliferation of Large Language Models (LLMs) has led to their widespread adoption across scientific disciplines. However, a growing number of academic publications rely solely on off-the-shelf solutions applied to familiar tasks, without a well-elaborated methodological innovation, theoretical grounding, or critical reflection. This trend has given rise to a form of superficial research in which generative LLMs are used without sound methodological motivation and critical risk-based reflection on their use. Based on these observations, this paper presents a critical examination of observed applications of LLMs in scientific publications, contrasting performance-driven applications with conceptually rigorous studies that integrate LLMs within structured scientific frameworks. This paper starts by providing a concise description of the technical internal working of LLMs and, based on that, some of their limited capabilities. Next, through a review of recent literature, the analysis identifies epistemological risks, structural incentives, and reproducibility challenges that compromise the integrity of scientific practice. The study concludes by proposing guidelines for the responsible and meaningful use of LLMs in research, emphasizing the need for theoretical alignment, methodological transparency, and the preservation of human epistemic agency. ...

Tax Underreporting Detection Using an Unsupervised Learning Approach

Conference paper (2024) - Vitali Herrera-Semenets, Lázaro Bustio-Martínez, Jorge Ángel González-Ordiano, Jan van den Berg

Governmental adminstrative domains can potentially benefit from a wide variety of currently available big data analysis methods. The tax administration is such an area that requires massive data processing to identify hidden patterns and trends of possible tax evasion. The use of supervised methods can be effective in these cases, but the lack of available labeled data limits their practical application in real-world scenarios. An alternative is the use of unsupervised methods, which have potential benefits in certain cases. In this sense, unsupervised methods are considered to be feasible as a decision support tool in tax evasion risk management systems. This paper proposes an unsupervised approach to identify signs of tax evasion by detecting, possible, tax underreporting. The proposed strategy is evaluated on a data set associated with individual income tax statistics of the United States. The results achieved are considered to be useful in decision-making and preventive actions on cases reported as suspicious. ...

Towards Automatic Principles of Persuasion Detection Using Machine Learning Approach

Conference paper (2024) - Lázaro Bustio-Martínez, Vitali Herrera-Semenets, Juan-Luis García-Mendoza, Jorge Ángel González-Ordiano, Luis Zúñiga-Morales, Rubén Sánchez Rivero, José Emilio Quiróz-Ibarra, Pedro Antonio Santander-Molina, Jan van den Berg, Davide Buscaldi

Persuasion is a human activity of influence. In marketing, persuasion can help customers find solutions to their problems, make informed choices, or convince someone to buy a useful (or useless) product or service. In computer crimes, persuasion can trick users into revealing sensitive information, or even performing actions that benefit attackers. Phishing is one of the most common and dangerous forms of persuasion-based attacks, as it exploits human vulnerabilities rather than technical ones. Therefore, an intelligent system capable of detecting and classifying persuasion attempts might be useful in protecting users. In this work, an approach that uses Machine Learning to analyze messages based on principles of persuasion and different data representations is presented. The aim of this research is to detect which data representation and which classification algorithm obtain the best results in detecting each principle of persuasion as a prior step to detecting phishing attacks. The results obtained indicate that among the combinations tested, there is one combination of data representation and classification algorithm that performs best. The related classification models obtained can detect the principles of persuasion at a rate that varies between 0.78 and 0.86 of AUC-ROC. ...

Uncovering phishing attacks using principles of persuasion analysis

Journal article (2024) - Lázaro Bustio-Martínez, Vitali Herrera-Semenets, Juan-Luis García-Mendoza, Miguel Ángel Álvarez-Carmona, Jorge Ángel González-Ordiano, Luis Zúñiga-Morales, José Emilio Quiróz-Ibarra, Pedro Antonio Santander-Molina, Jan van den Berg

With the rising of Internet in early ’90s, many fraudulent activities have migrated from physical to digital: one of them is phishing. Phishing is a deceptive practice focused on exploiting the human factor, which is the most vulnerable aspect of any security process. In this scam, social engineering techniques are extensively utilized, specifically focusing on the principles of persuasion, to deceive individuals into disclosing sensitive information or engaging in malicious actions. This research explores the use of message subjectivity for detecting phishing attacks. It does so by assessing the impact of various data representations and classifiers on automatically identifying principles of persuasion. Furthermore, it investigates how these detected principles of persuasion can be leveraged for identifying phishing attacks. The experiments conducted revealed that there is no universal solution for data representation and classifier selection to effectively detect all principles of persuasion. Instead, a tailored combination of data representation and classifiers is required for detecting each principle. The Machine Learning models created automatically detect principles of persuasion with confidence levels ranging from 0.7306 to 0.8191 for AUC-ROC. Next, principles of persuasion detected are used for phishing detection. This study also emphasizes the need for user-friendly and comprehensible models. To validate the proposal presented, several families of classifiers were tested, but among all of them, tree-based models (and Random Forest in particular) stand out as preferred option. These models achieve similar level of effectiveness as alternative methods while offering improved clarity and user-friendliness, with an AUC-ROC of 0.859842. ...

A Decision Tree Induction Algorithm for Efficient Rule Evaluation Using Shannon’s Expansion

Conference paper (2023) - Vitali Herrera-Semenets, Lázaro Bustio-Martínez, Raudel Hernández-León, Jan van den Berg

Decision trees are one of the most popular structures for decision-making and the representation of a set of rules. However, when a rule set is represented as a decision tree, some quirks in its structure may negatively affect its performance. For example, duplicate sub-trees and rule filters, that need to be evaluated more than once, could negatively affect the efficiency. This paper presents a novel algorithm based on Shannon’s expansion, which guarantees that the same rule filter is not evaluated more than once, even if repeated in other rules. This fact increases efficiency during the evaluation process using the induced decision tree. Experiments demonstrated the viability of the proposed algorithm in processing-intensive scenarios, such as in intrusion detection and data stream analysis. ...

Red Light/Green Light: A Lightweight Algorithm for, Possibly, Fraudulent Online Behavior Change Detection

Conference paper (2022) - V. Herrera Semenets, Raudel Hernández-León, Lázaro Bustio-Martínez, J. van den Berg

Telecommunications services have become a constant in people’s lives. This has inspired fraudsters to carry out malicious activities causing economic losses to people and companies. Early detection of signs that suggest the possible occurrence of malicious activity would allow analysts to act in time and avoid unintended consequences. Modeling the behavior of users could identify when a significant change takes place. Following this idea, an algorithm for online behavior change detection in telecommunication services is proposed in this paper. The experimental results show that the new algorithm can identify behavioral changes related to unforeseen events. ...

A multi-measure feature selection algorithm for efficacious intrusion detection

Journal article (2021) - Vitali Herrera-Semenets, Lázaro Bustio-Martínez, Raudel Hernández-León, Jan van den Berg

Every day the number of devices interacting through telecommunications networks grows resulting into an increase in the volume of data and information generated. At the same time, a growing number of information security incidents is being observed including the occurrence of unauthorized accesses, also named intrusions. As a consequence of these two developments, Information and Communications services providers require automated processes to detect and solve such intrusions, and this should done quickly in order to keep the related cybersecurity risks at acceptable levels. However, the presence of large volumes of data negatively interferes with the performance of classifiers used in intrusion detection tasks, which limits their applicability in practical cases. The research reported in this paper focuses on proposing a novel feature selection algorithm for intrusion detection scenarios. To this end, an extensive literature review was executed to first discover issues in the feature selection algorithms reported. Based on the insights obtained, the new multi-measure feature selection algorithm was designed that uses qualitative information provided by multiple feature selection measures, and reduces the dimensionality of the training data set. The algorithm proposed was next extensively tested using various data sets. It provides greater efficacy than other feature selection algorithms used for intrusion detection purposes. We finalize by providing some ideas on future research in order to further improve the algorithm. ...