STEER-Away

None, None

STEER-Away

Personalized Safety Alignment via Logit Steering

Bachelor Thesis (2026)

Author(s)

A. Trache (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Jie Yang – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Enrico Liscio – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Carolin Brandt – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Anne Arzberger – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

LLM Personalization Toxicity In-decoding Training-free Logiits

To reference this document use

https://resolver.tudelft.nl/uuid:74ab212c-4924-4d63-8aa3-800b054f0d15

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

18-06-2026

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project, Training-Free Personalization of Large Language Models Toward Situated Human Values

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

6

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large Language Models are usually aligned toward broad preference averages, while users can differ in how they perceive toxic language. This paper studies whether training free in-decoding logit-difference can support such personalized toxicity alignment without changing model weights. The key idea is to use two internal generation behaviours: an expert generation branch that represents careful, respectful language and an anti-expert generation branch that represents language patterns to avoid. The resulting difference is added to the base model’s next-token scores during generation, with the toxicity steering category chosen from an inferred user sensitivity profile. Profiles are derived from PRISM, a participatory preference dataset, and Perspective API toxicity scores. On Llama 3.1 8B, I evaluate two methods, Anti-Expert Contrastive Decoding (ACD) and Expert–Anti-Expert Differential Steering (EADS). The results suggest that EADS gives the more balanced trade-off, showing that stronger steering reduces measured toxicity distance while preserving general MMLU utility better than ACD. EADS shows a 12.65% mean reduction in measured toxicity-distance, and a below 1% reduction in both Massive Multitask Language Understanding (MMLU) accuracy and generated answer perplexity. The findings remain limited by the use of automatic toxicity scores as a proxy and by the coarse user-profile representation. These results show that training-free logit-steering is a favorable alternative for personalized toxicity alignment, but it should be, in the future, validated using human evaluation.

Files

Finel_research_papaer_Trache_A... (pdf)

(pdf | 0.565 Mb)

License info not available