IS

I. Slanina

info

Please Note

1 records found

Bachelor thesis (2026) - I. Slanina, A. Arzberger, E. Liscio, J. Yang
People differ in what they consider toxic, yet centralised alignment of large language models (LLMs) imposes a single global standard that cannot accommodate this disagreement. We propose a training-free post-decoding approach: for each prompt we generate N candidates from a fixed, pre-trained LLM and re-rank them against a perparticipant toxicity profile built from PRISM ratings. Post-decoding fits the problem because it decouples generation from scoring, so the same candidate pool can be re-ranked under different profiles to separate the effect of the profile from the effect of the candidate pool, something earlier inference-time interventions cannot do. We compare four scoring modules on four matched seeds: two LLMas-a-Judge rerankers (GPT, Claude) and two Detoxify-based geometric matchers (weighted L1, Ledoit–Wolf Mahalanobis), scored by toxicity-vector distance to each participant’s preferred PRISM response. All four reduce per-record error by 23–28% and tie at the top. The selection is genuinely personalised rather than the same generic shift toward safer text for every user: reductions concentrate on each participant’s most sensitive Perspective dimensions, the
toxicity types they most consistently rated down (p < 10−3 under a profile-shuffle null on every module), and replacing the per-user weighting with uniform weights significantly worsens fit on both geometric matchers (Wilcoxon p < 10−3). Because the effect is peruser, it surfaces on a per-user-sensitive measure (a boundary-violation rate, p < 10−3) rather than on aggregate mean error, which averages the per-user differences away. The next step is therefore per-usersensitive evaluation, not retraining. ...