Training-Free Personalisation of LLMs

None, None

Training-Free Personalisation of LLMs

Representation Engineering for Per-User Toxicity Steering on PRISM

Bachelor Thesis (2026)

Author(s)

R. Diaconescu (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Jie Yang – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Enrico Liscio – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Anne Arzberger – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

C.E. Brandt – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Representation Inference Llm

To reference this document use

https://resolver.tudelft.nl/uuid:aa0d9291-7481-4c64-a721-271b039f87b5

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

07-11-2026

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

9

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large language models (LLMs) increasingly mediate value-laden interactions, yet mainstream alignment methods encode a single normative standard in the model weights and require expensive retraining to change it. This thesis investigates whether representation engineering, a family of methods that steer a frozen model by adding learned directions to its hidden states at inference time, can instead deliver training-free personalisation of toxicity moderation for individual users.We use the Participatory, Representative and Individualised Human Feedback (PRISM) alignment dataset, in which users systematically disagree on what a good response is. From PRISM preference pairs we extract a population-level steering direction with Contrastive Activation Addition (CAA), and we personalise it by composing a small basis of safety directions with weights derived from each user’s dislike-weighted or revealed preferences.On Llama-3.1-8B, population steering moves generations 24–31% closer to the preferred toxicity profile on the hardest prompt categories (p=0.007, paired Wilcoxon signed-rank test), and leaves zero-shot MMLU (Massive Multitask Language Understanding) accuracy unchanged within noise. The intervention is selective: per-record toxicity Mean Absolute Error (MAE) drops by 30–50% precisely where the unsteered model disagrees with human preferences, while already-correct behaviour is left intact. Steering too strongly, however, collapses generation fluency. The per-user extension preserves fluency better than the population direction and reaches the highest preference-prediction accuracy of any arm, shifting model likelihoods toward each user’s preferred response on about 60% of records and reversing the population-MAE ordering, though the per-user margin falls within sampling error at N=197.These results establish representation engineering as a viable, training-free mechanism for personalised LLM alignment on PRISM, bounded by a clear trade-off between alignment strength and generation quality.

Files

Main-2.pdf

(pdf | 0.343 Mb)

License info not available