Training-Free Personalisation of LLMs
Representation Engineering for Per-User Toxicity Steering on PRISM
R. Diaconescu (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Jie Yang – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Enrico Liscio – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Anne Arzberger – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
C.E. Brandt – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Large language models (LLMs) increasingly mediate value-laden interactions, yet mainstream alignment methods encode a single normative standard in the model weights and require expensive retraining to change it. This thesis investigates whether representation engineering, a family of methods that steer a frozen model by adding learned directions to its hidden states at inference time, can instead deliver training-free personalisation of toxicity moderation for individual users.We use the Participatory, Representative and Individualised Human Feedback (PRISM) alignment dataset, in which users systematically disagree on what a good response is. From PRISM preference pairs we extract a population-level steering direction with Contrastive Activation Addition (CAA), and we personalise it by composing a small basis of safety directions with weights derived from each user’s dislike-weighted or revealed preferences.On Llama-3.1-8B, population steering moves generations 24–31% closer to the preferred toxicity profile on the hardest prompt categories (p=0.007, paired Wilcoxon signed-rank test), and leaves zero-shot MMLU (Massive Multitask Language Understanding) accuracy unchanged within noise. The intervention is selective: per-record toxicity Mean Absolute Error (MAE) drops by 30–50% precisely where the unsteered model disagrees with human preferences, while already-correct behaviour is left intact. Steering too strongly, however, collapses generation fluency. The per-user extension preserves fluency better than the population direction and reaches the highest preference-prediction accuracy of any arm, shifting model likelihoods toward each user’s preferred response on about 60% of records and reversing the population-MAE ordering, though the per-user margin falls within sampling error at N=197.These results establish representation engineering as a viable, training-free mechanism for personalised LLM alignment on PRISM, bounded by a clear trade-off between alignment strength and generation quality.