Which definition of hate speech does the default behaviour of large language models align with most closely?

None, None

Which definition of hate speech does the default behaviour of large language models align with most closely?

A Zero-Shot Probing Study of Two Open-Weight Models

Bachelor Thesis (2026)

Author(s)

Y. Xiong (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

P.K. Murukannaiah – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

U. Khurana – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

C.C.S. Liem – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Large language models Hate speech detection Definitional analysis

To reference this document use

https://resolver.tudelft.nl/uuid:cd143c92-8cb5-443a-8fc4-282296a663e3

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

23-06-2026

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

10

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

What counts as hate speech varies and complicates automated detection systems. Large language models (LLMs) are increasingly used for this task in a zero-shot setting, yet the intrinsic definition of hate speech that such models apply when no definition is supplied remains poorly understood. This paper probes the intrinsic, unguided conception of hate speech that two open-weight instruction-tuned models, Meta Llama 3.1 and Google Gemma 4, apply by default. We combine three complementary measurements: zero-shot binary classification, structured elicitation of Hate Speech Criteria (HSC), and a contamination control that compares both tasks with a set of novel cases, and we add two follow-up analyses: a prompt-paraphrase robustness check and a definition-injection probe on the dominance criterion. Both models classify hateful content with high binary accuracy and demonstrate strong target group identification. However, they fail on the dominance criterion, defaulting instead to a misinterpretation where almost all hostile speech is labelled as dominating. We conclude that while the default definition these LLMs apply is target-aware, its tendency toward over-inclusive criterion application constrains the reliability of unguided models for fine-grained hate speech characterisation.

Files

BRP_Yuanze_Xiong_LLM_Hate_Spee... (pdf)

(pdf | 1.23 Mb)

License info not available