MD
M. Dragomir
info
Please Note
<p>This page displays the records of the person named above and is not linked to a unique person identifier. This record may need to be merged to a profile.</p>
1 records found
1
What Types of Hate Speech Samples Do LLMs Struggle With?
The Alignment of Large Language Models’ Responses to Subjective Variations in Hate Speech
Hate speech detection remains challenging because harmful language is often contextual, indirect, and difficult to distinguish from legitimate discussion, criticism, or reporting. While previous work has highlighted the influence of differing hate speech definitions on annotation and evaluation, less attention has been paid to the specific types of samples that remain difficult for large language models (LLMs), regardless of how hate speech is defined. This paper investigates which hate speech and non-hate speech samples are most challenging for LLaMA 3-8B-Instruct and Qwen 2.5-7B-Instruct using HateCheck Extended and seven hate speech definitions representing platform policies, legal frameworks, and theoretical perspectives.
The results show that overall performance remains relatively stable across definitions, but sample-level analysis reveals substantial differences in error patterns. Explicit hateful cues are generally classified correctly, whereas context-dependent phenomena remain difficult across definitions. Cross-definition analysis further identifies errors that persist regardless of definition, suggesting that these failures stem from model limitations rather than definitional ambiguity alone. These findings demonstrate that sample-level evaluation provides insights not visible through aggregate performance metrics alone and highlight the continuing challenge of contextual reasoning in LLM-based moderation systems. ...
The results show that overall performance remains relatively stable across definitions, but sample-level analysis reveals substantial differences in error patterns. Explicit hateful cues are generally classified correctly, whereas context-dependent phenomena remain difficult across definitions. Cross-definition analysis further identifies errors that persist regardless of definition, suggesting that these failures stem from model limitations rather than definitional ambiguity alone. These findings demonstrate that sample-level evaluation provides insights not visible through aggregate performance metrics alone and highlight the continuing challenge of contextual reasoning in LLM-based moderation systems. ...
Hate speech detection remains challenging because harmful language is often contextual, indirect, and difficult to distinguish from legitimate discussion, criticism, or reporting. While previous work has highlighted the influence of differing hate speech definitions on annotation and evaluation, less attention has been paid to the specific types of samples that remain difficult for large language models (LLMs), regardless of how hate speech is defined. This paper investigates which hate speech and non-hate speech samples are most challenging for LLaMA 3-8B-Instruct and Qwen 2.5-7B-Instruct using HateCheck Extended and seven hate speech definitions representing platform policies, legal frameworks, and theoretical perspectives.
The results show that overall performance remains relatively stable across definitions, but sample-level analysis reveals substantial differences in error patterns. Explicit hateful cues are generally classified correctly, whereas context-dependent phenomena remain difficult across definitions. Cross-definition analysis further identifies errors that persist regardless of definition, suggesting that these failures stem from model limitations rather than definitional ambiguity alone. These findings demonstrate that sample-level evaluation provides insights not visible through aggregate performance metrics alone and highlight the continuing challenge of contextual reasoning in LLM-based moderation systems.
The results show that overall performance remains relatively stable across definitions, but sample-level analysis reveals substantial differences in error patterns. Explicit hateful cues are generally classified correctly, whereas context-dependent phenomena remain difficult across definitions. Cross-definition analysis further identifies errors that persist regardless of definition, suggesting that these failures stem from model limitations rather than definitional ambiguity alone. These findings demonstrate that sample-level evaluation provides insights not visible through aggregate performance metrics alone and highlight the continuing challenge of contextual reasoning in LLM-based moderation systems.