Evaluating the Impact of Explicit Hate Speech Definitions on the Stability of LLM-based Hate Speech Classification

None, None

Evaluating the Impact of Explicit Hate Speech Definitions on the Stability of LLM-based Hate Speech Classification

Bachelor Thesis (2026)

Author(s)

R.M. Martins dos Santos (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

U. Khurana – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

P.K. Murukannaiah – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

C.C.S. Liem – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

NLP Calibration Stability analysis LLM Consistency Uncertainty Hate speech detection

To reference this document use

https://resolver.tudelft.nl/uuid:14446a08-72b0-4c74-92b1-2f5acd6a5e58

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

26-06-2026

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

7

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Automated hate speech detection is crucial to keep up with the high demand for moderation online, yet current models struggle to produce stable and consistent results. While metrics such as accuracy evaluate a model's overall performance, they fail to detect instability, meaning predictions on identical inputs fluctuate. Better metrics exist that can detect this, such as micro-consistency, which looks at the consistency on the individual test case level. This paper looks at what the impact is of providing explicit definitions of hate speech to LLMs for hate speech classification, using micro-consistency metrics and uncertainty metrics. The research was done using the Llama-3-8B-Instruct model for binary classification on the HateCheck dataset. The results show that providing explicit definitions for hate speech classification using zero-shot prompting worsened micro-consistency and uncertainty, and that the differences are statistically significant. However, more research is required to conclude with certainty that this decline in stability is caused by the model's inherent limitations, rather than a suboptimal setup for this task.

Files

Hate_Speech_RP_Final_Paper.pdf

(pdf | 0.953 Mb)

License info not available