Personalised Classifier-Guided Decoding

Steering LLM Toxicity Along User-Specified Directions

Bachelor Thesis (2026)
Author(s)

M. Coroi (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

J. Yang – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

A. Arzberger – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

E. Liscio – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

C.E. Brandt – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
18-06-2026
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project, Training-Free Personalisation of Large Language Models Toward Situated Human Values
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
7
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Toxic content is not universally defined: what one user finds offensive, another may find acceptable depending on cultural background, context, and purpose. Current LLM safety systems apply a single global toxicity threshold to every user, and adapting this behaviour after deployment is expensive. This paper asks whether a frozen LLM can instead be steered at inference time to follow individual users’ toxicity preferences across six toxicity dimensions, without retraining. A classifier-guided decoding framework driven by a per-user sensitivity vector is instantiated as three deployable strategies and evaluated on the PRISM preference dataset. All three strategies reduce per-user toxicity error by 15–21%, while preserving general-knowledge accuracy to within 0.7 pp of the unguided baseline. The central finding is directional steerability: the decoder responds to the shape of a user’s preference vector, producing category-specific reductions that align with per-user weights (median cosine similarity 0.845, p = 0.0097 above a permutation baseline). These results show that meaningful personalised toxicity control is achievable at deployment time, without retraining the model.

Files

License info not available