Value-Aware Post-Decoding Reranking for Training-Free Personalisation of LLM Outputs to User-Specific Toxicity Standards

None, None

Value-Aware Post-Decoding Reranking for Training-Free Personalisation of LLM Outputs to User-Specific Toxicity Standards

Bachelor Thesis (2026)

Author(s)

I. Slanina (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A. Arzberger – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

E. Liscio – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

J. Yang – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Inference LLM Toxicity

To reference this document use

https://resolver.tudelft.nl/uuid:2d0d3471-135a-4003-8610-430c75bca795

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

18-06-2026

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

35

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

People differ in what they consider toxic, yet centralised alignment of large language models (LLMs) imposes a single global standard that cannot accommodate this disagreement. We propose a training-free post-decoding approach: for each prompt we generate N candidates from a fixed, pre-trained LLM and re-rank them against a perparticipant toxicity profile built from PRISM ratings. Post-decoding fits the problem because it decouples generation from scoring, so the same candidate pool can be re-ranked under different profiles to separate the effect of the profile from the effect of the candidate pool, something earlier inference-time interventions cannot do. We compare four scoring modules on four matched seeds: two LLMas-a-Judge rerankers (GPT, Claude) and two Detoxify-based geometric matchers (weighted L1, Ledoit–Wolf Mahalanobis), scored by toxicity-vector distance to each participant’s preferred PRISM response. All four reduce per-record error by 23–28% and tie at the top. The selection is genuinely personalised rather than the same generic shift toward safer text for every user: reductions concentrate on each participant’s most sensitive Perspective dimensions, the
toxicity types they most consistently rated down (p < 10−3 under a profile-shuffle null on every module), and replacing the per-user weighting with uniform weights significantly worsens fit on both geometric matchers (Wilcoxon p < 10−3). Because the effect is peruser, it surfaces on a per-user-sensitive measure (a boundary-violation rate, p < 10−3) rather than on aggregate mean error, which averages the per-user differences away. The next step is therefore per-usersensitive evaluation, not retraining.

Files

Iulia_Slanina_Research_Paper.p... (pdf)

(pdf | 0.456 Mb)

License info not available