Personalised Classifier-Guided Decoding

None, None

Personalised Classifier-Guided Decoding

Steering LLM Toxicity Along User-Specified Directions

Bachelor Thesis (2026)

Author(s)

M. Coroi (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

J. Yang – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

A. Arzberger – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

E. Liscio – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

C.E. Brandt – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

LLM Toxicity Human Values Training-free In-decoding Inference-time alignment

To reference this document use

https://resolver.tudelft.nl/uuid:a8e877cd-5972-4199-a809-fdd15abdec2f

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

18-06-2026

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project, Training-Free Personalisation of Large Language Models Toward Situated Human Values

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

28

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Toxic content is not universally defined: what one user finds offensive, another may find acceptable depending on cultural background, context, and purpose. Current LLM safety systems apply a single global toxicity threshold to every user, and adapting this behaviour after deployment is expensive. This paper asks whether a frozen LLM can instead be steered at inference time to follow individual users’ toxicity preferences across six toxicity dimensions, without retraining. A classifier-guided decoding framework driven by a per-user sensitivity vector is instantiated as three deployable strategies and evaluated on the PRISM preference dataset. All three strategies reduce per-user toxicity error by 15–21%, while preserving general-knowledge accuracy to within 0.7 pp of the unguided baseline. The central finding is directional steerability: the decoder responds to the shape of a user’s preference vector, producing category-specific reductions that align with per-user weights (median cosine similarity 0.845, p = 0.0097 above a permutation baseline). These results show that meaningful personalised toxicity control is achievable at deployment time, without retraining the model.

Files

Research_Paper_Miruna_Coroi.pd... (pdf)

(pdf | 0.964 Mb)

License info not available