RM

R.M. Martins dos Santos

info

Please Note

1 records found

Automated hate speech detection is crucial to keep up with the high demand for moderation online, yet current models struggle to produce stable and consistent results. While metrics such as accuracy evaluate a model's overall performance, they fail to detect instability, meaning predictions on identical inputs fluctuate. Better metrics exist that can detect this, such as micro-consistency, which looks at the consistency on the individual test case level. This paper looks at what the impact is of providing explicit definitions of hate speech to LLMs for hate speech classification, using micro-consistency metrics and uncertainty metrics. The research was done using the Llama-3-8B-Instruct model for binary classification on the HateCheck dataset. The results show that providing explicit definitions for hate speech classification using zero-shot prompting worsened micro-consistency and uncertainty, and that the differences are statistically significant. However, more research is required to conclude with certainty that this decline in stability is caused by the model's inherent limitations, rather than a suboptimal setup for this task. ...