MD

M. Dragomir

info

Please Note

1 records found

The Alignment of Large Language Models’ Responses to Subjective Variations in Hate Speech

Hate speech detection remains challenging because harmful language is often contextual, indirect, and difficult to distinguish from legitimate discussion, criticism, or reporting. While previous work has highlighted the influence of differing hate speech definitions on annotation and evaluation, less attention has been paid to the specific types of samples that remain difficult for large language models (LLMs), regardless of how hate speech is defined. This paper investigates which hate speech and non-hate speech samples are most challenging for LLaMA 3-8B-Instruct and Qwen 2.5-7B-Instruct using HateCheck Extended and seven hate speech definitions representing platform policies, legal frameworks, and theoretical perspectives.

The results show that overall performance remains relatively stable across definitions, but sample-level analysis reveals substantial differences in error patterns. Explicit hateful cues are generally classified correctly, whereas context-dependent phenomena remain difficult across definitions. Cross-definition analysis further identifies errors that persist regardless of definition, suggesting that these failures stem from model limitations rather than definitional ambiguity alone. These findings demonstrate that sample-level evaluation provides insights not visible through aggregate performance metrics alone and highlight the continuing challenge of contextual reasoning in LLM-based moderation systems. ...