The Alignment of Large Language Models' Responses to Subjective Variations in Hate Speech

Comparing Alignment to Real-Life-Inspired Definitions in Zero-Shot Hate Speech Classification

Bachelor Thesis (2026)
Author(s)

V. Bunovska (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

P.K. Murukannaiah – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

U. Khurana – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

C.C.S. Liem – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
23-06-2026
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
13
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Detecting hateful content on social media has become an active area of research, with recent approaches focusing on the use of Large Language Models (LLMs). Rather than using datasets to train classifiers, researchers are exploring methods that embed hate speech definitions directly in the model's prompt. However, hate speech is a subjective concept, and its definition varies across contexts. As a result, LLMs must align their classifications with the specific definition provided in the prompt. To make the creation process more systematic, frameworks for constructing context-specific definitions of hate speech have been proposed. Yet, no work has compared how framework-based formulations influence LLM alignment relative to the definitions used in real-life regulation, such as laws and social media policies. This study, therefore, compares definitions from the Hate Speech Criteria (HSC) framework, legal texts, and platform policies by evaluating how precisely two LLMs align with each type under a zero-shot prompting setup. Our results indicate that while the level of alignment is model-dependent, legal and policy definitions generally guide LLM behavior more effectively than framework-based formulations. Nevertheless, definitions created with the framework still steer models in the intended direction, suggesting that further refinement of these frameworks could improve their effectiveness in prompt-based hate speech detection.

Files

License info not available