Personalized Pre-Decoding Alignment for Training-Free Toxicity Reduction

None, None

Personalized Pre-Decoding Alignment for Training-Free Toxicity Reduction

Comparing URIAL and PBPO-Lite on PRISM User Prompts Without Fine-Tuning

Bachelor Thesis (2026)

Author(s)

A. Florea (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A. Arzberger – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

E. Liscio – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

J. Yang – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

C.E. Brandt – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

LLM Personalization Toxicity Prompt engineering Training-free Pre-decoding

To reference this document use

https://resolver.tudelft.nl/uuid:1c5eb43d-e5f7-4c0a-963c-7b6b543894ed

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

18-06-2026

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project, Training-Free Personalization of Large Language Models Toward Situated Human Values

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

22

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large Language Models (LLMs) often rely on one general safety standard, but this is limited because toxicity is subjective: what one user finds offensive, another user may not. At the same time, creating personalized safety by fine-tuning a model for every user is expensive and impractical. To address this, my research studies pre-decoding interventions, which means modifying the user’s input prompt before the model generates a response. This offers a flexible and low-cost way to personalize alignment without changing the model’s weights. I evaluate two training-free approaches on the PRISM dataset using Qwen and Llama target models: an Untuned LLMs with Restyled In-context ALignment (URIAL)-inspired method, which adds personalized safety examples to the prompt, and a Personalized Black-Box Prompt Optimization Lite (PBPO-Lite) method, which uses a secondary model to rewrite the prompt based on a user’s toxicity profile. These methods are useful because they can adapt to a user’s needs at inference time without permanent model changes. The results show that both interventions bring the outputs closer to the highest rated PRISM answers, with URIAL achieving the strongest toxicity alignment: approximately 51% on Llama and 31% on Qwen. While the methods improve fluency compared with the base models, they can reduce performance on structured knowledge tasks. Overall, the findings suggest that personalized predecoding is a promising low-cost approach for toxicity alignment, provided that safety gains are balanced against possible losses in knowledge-task performance.

Files

Alina_Final_Paper_3_1_.pdf

(pdf | 0.628 Mb)

License info not available