U. Khurana | TU Delft Repository

What Types of Hate Speech Samples Do LLMs Struggle With?

The Alignment of Large Language Models’ Responses to Subjective Variations in Hate Speech

Bachelor thesis (2026) - M. Dragomir, P.K. Murukannaiah, U. Khurana, C.C.S. Liem

Hate speech detection remains challenging because harmful language is often contextual, indirect, and difficult to distinguish from legitimate discussion, criticism, or reporting. While previous work has highlighted the influence of differing hate speech definitions on annotation and evaluation, less attention has been paid to the specific types of samples that remain difficult for large language models (LLMs), regardless of how hate speech is defined. This paper investigates which hate speech and non-hate speech samples are most challenging for LLaMA 3-8B-Instruct and Qwen 2.5-7B-Instruct using HateCheck Extended and seven hate speech definitions representing platform policies, legal frameworks, and theoretical perspectives.

The results show that overall performance remains relatively stable across definitions, but sample-level analysis reveals substantial differences in error patterns. Explicit hateful cues are generally classified correctly, whereas context-dependent phenomena remain difficult across definitions. Cross-definition analysis further identifies errors that persist regardless of definition, suggesting that these failures stem from model limitations rather than definitional ambiguity alone. These findings demonstrate that sample-level evaluation provides insights not visible through aggregate performance metrics alone and highlight the continuing challenge of contextual reasoning in LLM-based moderation systems. ...

Evaluating the Impact of Explicit Hate Speech Definitions on the Stability of LLM-based Hate Speech Classification

Bachelor thesis (2026) - R.M. Martins dos Santos, U. Khurana, P.K. Murukannaiah, C.C.S. Liem

Automated hate speech detection is crucial to keep up with the high demand for moderation online, yet current models struggle to produce stable and consistent results. While metrics such as accuracy evaluate a model's overall performance, they fail to detect instability, meaning predictions on identical inputs fluctuate. Better metrics exist that can detect this, such as micro-consistency, which looks at the consistency on the individual test case level. This paper looks at what the impact is of providing explicit definitions of hate speech to LLMs for hate speech classification, using micro-consistency metrics and uncertainty metrics. The research was done using the Llama-3-8B-Instruct model for binary classification on the HateCheck dataset. The results show that providing explicit definitions for hate speech classification using zero-shot prompting worsened micro-consistency and uncertainty, and that the differences are statistically significant. However, more research is required to conclude with certainty that this decline in stability is caused by the model's inherent limitations, rather than a suboptimal setup for this task. ...

The Alignment of Large Language Models' Responses to Subjective Variations in Hate Speech

Comparing Alignment to Real-Life-Inspired Definitions in Zero-Shot Hate Speech Classification

Bachelor thesis (2026) - V. Bunovska, P.K. Murukannaiah, U. Khurana, C.C.S. Liem

Detecting hateful content on social media has become an active area of research, with recent approaches focusing on the use of Large Language Models (LLMs). Rather than using datasets to train classifiers, researchers are exploring methods that embed hate speech definitions directly in the model's prompt. However, hate speech is a subjective concept, and its definition varies across contexts. As a result, LLMs must align their classifications with the specific definition provided in the prompt. To make the creation process more systematic, frameworks for constructing context-specific definitions of hate speech have been proposed. Yet, no work has compared how framework-based formulations influence LLM alignment relative to the definitions used in real-life regulation, such as laws and social media policies. This study, therefore, compares definitions from the Hate Speech Criteria (HSC) framework, legal texts, and platform policies by evaluating how precisely two LLMs align with each type under a zero-shot prompting setup. Our results indicate that while the level of alignment is model-dependent, legal and policy definitions generally guide LLM behavior more effectively than framework-based formulations. Nevertheless, definitions created with the framework still steer models in the intended direction, suggesting that further refinement of these frameworks could improve their effectiveness in prompt-based hate speech detection. ...

Tailoring In-Context Learning Techniques for Definition-Based Hate Speech Detection in Large Language Models

Bachelor thesis (2026) - Parham Bateni, Pradeep Kumar Murukannaiah, Urja Khurana, Cynthia Liem

Hate speech lacks a single agreed definition across legal, social, and benchmark contexts, yet instruction-tuned large language models (LLMs) are increasingly used for hate speech detection. While recent work has explored definition-aware prompting, it remains unclear how different definitions interact with few-shot prompting strategies and model capacity. We investigate whether zero-shot and few-shot in-context learning can align LLMs with dataset-specific hate speech definitions without fine-tuning. Using the HateCheck benchmark, we evaluate three models (Gemma-2-2B, Llama-3.2-3B, and Qwen2.5-3B) under three definition settings (no definition, author-provided text, and structured criteria-based definition) and four prompting strategies (zero-shot and three few-shot variants). Results show that explicit definitions do not reliably improve performance and can sometimes reduce it. Furthermore, few-shot prompting is generally more effective, with the strongest performance often achieved by retrieving semantically similar examples for each query and including them in the prompt. In addition, higher-capacity models benefit more from richer prompts, whereas the smallest model frequently degrades as prompt complexity increases. Overall, definition wording, exemplar selection, and model capacity interact strongly and should be tuned jointly rather than considered in isolation. ...

Which definition of hate speech does the default behaviour of large language models align with most closely?

A Zero-Shot Probing Study of Two Open-Weight Models

Bachelor thesis (2026) - Y. Xiong, P.K. Murukannaiah, U. Khurana, C.C.S. Liem

What counts as hate speech varies and complicates automated detection systems. Large language models (LLMs) are increasingly used for this task in a zero-shot setting, yet the intrinsic definition of hate speech that such models apply when no definition is supplied remains poorly understood. This paper probes the intrinsic, unguided conception of hate speech that two open-weight instruction-tuned models, Meta Llama 3.1 and Google Gemma 4, apply by default. We combine three complementary measurements: zero-shot binary classification, structured elicitation of Hate Speech Criteria (HSC), and a contamination control that compares both tasks with a set of novel cases, and we add two follow-up analyses: a prompt-paraphrase robustness check and a definition-injection probe on the dominance criterion. Both models classify hateful content with high binary accuracy and demonstrate strong target group identification. However, they fail on the dominance criterion, defaulting instead to a misinterpretation where almost all hostile speech is labelled as dominating. We conclude that while the default definition these LLMs apply is target-aware, its tendency toward over-inclusive criterion application constrains the reliability of unguided models for fine-grained hate speech characterisation. ...