A. Arzberger | TU Delft Repository

STEER-Away

Personalized Safety Alignment via Logit Steering

Bachelor thesis (2026) - A. Trache, Jie Yang, Enrico Liscio, Carolin Brandt, Anne Arzberger

Large Language Models are usually aligned toward broad preference averages, while users can differ in how they perceive toxic language. This paper studies whether training free in-decoding logit-difference can support such personalized toxicity alignment without changing model weights. The key idea is to use two internal generation behaviours: an expert generation branch that represents careful, respectful language and an anti-expert generation branch that represents language patterns to avoid. The resulting difference is added to the base model’s next-token scores during generation, with the toxicity steering category chosen from an inferred user sensitivity profile. Profiles are derived from PRISM, a participatory preference dataset, and Perspective API toxicity scores. On Llama 3.1 8B, I evaluate two methods, Anti-Expert Contrastive Decoding (ACD) and Expert–Anti-Expert Differential Steering (EADS). The results suggest that EADS gives the more balanced trade-off, showing that stronger steering reduces measured toxicity distance while preserving general MMLU utility better than ACD. EADS shows a 12.65% mean reduction in measured toxicity-distance, and a below 1% reduction in both Massive Multitask Language Understanding (MMLU) accuracy and generated answer perplexity. The findings remain limited by the use of automatic toxicity scores as a proxy and by the coarse user-profile representation. These results show that training-free logit-steering is a favorable alternative for personalized toxicity alignment, but it should be, in the future, validated using human evaluation. ...

Personalised Classifier-Guided Decoding

Steering LLM Toxicity Along User-Specified Directions

Bachelor thesis (2026) - M. Coroi, J. Yang, A. Arzberger, E. Liscio, C.E. Brandt

Toxic content is not universally defined: what one user finds offensive, another may find acceptable depending on cultural background, context, and purpose. Current LLM safety systems apply a single global toxicity threshold to every user, and adapting this behaviour after deployment is expensive. This paper asks whether a frozen LLM can instead be steered at inference time to follow individual users’ toxicity preferences across six toxicity dimensions, without retraining. A classifier-guided decoding framework driven by a per-user sensitivity vector is instantiated as three deployable strategies and evaluated on the PRISM preference dataset. All three strategies reduce per-user toxicity error by 15–21%, while preserving general-knowledge accuracy to within 0.7 pp of the unguided baseline. The central finding is directional steerability: the decoder responds to the shape of a user’s preference vector, producing category-specific reductions that align with per-user weights (median cosine similarity 0.845, p = 0.0097 above a permutation baseline). These results show that meaningful personalised toxicity control is achievable at deployment time, without retraining the model. ...

Value-Aware Post-Decoding Reranking for Training-Free Personalisation of LLM Outputs to User-Specific Toxicity Standards

Bachelor thesis (2026) - I. Slanina, A. Arzberger, E. Liscio, J. Yang

People differ in what they consider toxic, yet centralised alignment of large language models (LLMs) imposes a single global standard that cannot accommodate this disagreement. We propose a training-free post-decoding approach: for each prompt we generate N candidates from a fixed, pre-trained LLM and re-rank them against a perparticipant toxicity profile built from PRISM ratings. Post-decoding fits the problem because it decouples generation from scoring, so the same candidate pool can be re-ranked under different profiles to separate the effect of the profile from the effect of the candidate pool, something earlier inference-time interventions cannot do. We compare four scoring modules on four matched seeds: two LLMas-a-Judge rerankers (GPT, Claude) and two Detoxify-based geometric matchers (weighted L1, Ledoit–Wolf Mahalanobis), scored by toxicity-vector distance to each participant’s preferred PRISM response. All four reduce per-record error by 23–28% and tie at the top. The selection is genuinely personalised rather than the same generic shift toward safer text for every user: reductions concentrate on each participant’s most sensitive Perspective dimensions, the
toxicity types they most consistently rated down (p < 10−3 under a profile-shuffle null on every module), and replacing the per-user weighting with uniform weights significantly worsens fit on both geometric matchers (Wilcoxon p < 10−3). Because the effect is peruser, it surfaces on a per-user-sensitive measure (a boundary-violation rate, p < 10−3) rather than on aggregate mean error, which averages the per-user differences away. The next step is therefore per-usersensitive evaluation, not retraining. ...

Personalized Pre-Decoding Alignment for Training-Free Toxicity Reduction

Comparing URIAL and PBPO-Lite on PRISM User Prompts Without Fine-Tuning

Bachelor thesis (2026) - A. Florea, A. Arzberger, E. Liscio, J. Yang, C.E. Brandt

Large Language Models (LLMs) often rely on one general safety standard, but this is limited because toxicity is subjective: what one user finds offensive, another user may not. At the same time, creating personalized safety by fine-tuning a model for every user is expensive and impractical. To address this, my research studies pre-decoding interventions, which means modifying the user’s input prompt before the model generates a response. This offers a flexible and low-cost way to personalize alignment without changing the model’s weights. I evaluate two training-free approaches on the PRISM dataset using Qwen and Llama target models: an Untuned LLMs with Restyled In-context ALignment (URIAL)-inspired method, which adds personalized safety examples to the prompt, and a Personalized Black-Box Prompt Optimization Lite (PBPO-Lite) method, which uses a secondary model to rewrite the prompt based on a user’s toxicity profile. These methods are useful because they can adapt to a user’s needs at inference time without permanent model changes. The results show that both interventions bring the outputs closer to the highest rated PRISM answers, with URIAL achieving the strongest toxicity alignment: approximately 51% on Llama and 31% on Qwen. While the methods improve fluency compared with the base models, they can reduce performance on structured knowledge tasks. Overall, the findings suggest that personalized predecoding is a promising low-cost approach for toxicity alignment, provided that safety gains are balanced against possible losses in knowledge-task performance. ...

Unheard and Misunderstood

Reinforcing Hermeneutical Justice in Annotation Design for ADHD Voices

Bachelor thesis (2025) - A. Yotkov, J. Yang, A. Arzberger, M.L. Tielman

The main way large language models (LLMs) learn to represent and interpret various experiences is through the process of supervised fine-tuning (SFT). However, current practices are not designed to be inclusive for people with ADHD, which leads to generative hermeneutical ignorance due to misrepresentation. Several ADHD characteristics clash with modern annotation task structures, so those voices remain underrepresented. We performed a literature-driven gap analysis, derived five design requirements and evaluation criteria and built an annotation interface that embodied those requirements. Consequently, a mixed approach user study with seven self-identified ADHD participants was conducted to measure behavioral metrics and collect post-task reflections. The results indicated that three of five design criteria were met, which is promising. However, the average mislabeling rate remained quite high, meaning that accuracy is still an open issue. Finally, our study demonstrated that small design adjustments accommodate a more diverse annotator pool, thus, we offer a framework that can be used to reinforce hermeneutical epistemic justice in annotation practices. ...

Incorporating User Feedback into Post-Training LLM Improvement to Promote Hermeneutical Justice

An interface to amplify marginalized voices

Bachelor thesis (2025) - A. Turgut, A. Arzberger, J. Yang, M.L. Tielman

Generative AI can contribute to the misunderstanding or erasure of marginalized groups due to the insufficient nuanced data on their lived experiences. This limits the shared un- derstanding of their perspectives and contributes to a phenomenon called hermeneutical epistemic injustice. This study seeks to reduce this injustice by enabling real-life users from these groups to provide feedback that corrects the behavior of the model. However, victims of hermeneutical injustice struggle with articulating themselves, and current prac- tices lack sufficient support for user expression. Overcoming these challenges, we designed an interface to enable users to give feedback on the accuracy of the model, supported by a data processing workflow to ensure feasibility and scalability. We conducted a user study with 8 individuals with ADHD to evaluate whether the interface facilitates the extraction of accurate data, and found that it enables users to provide more concrete and precise feedback than existing methods, as it includes more guidance and control for the user. ...

Unheard and Misunderstood: Addressing Injustice in LLMs

How are hermeneutical injustices encoded in Reinforcement Learning from Human Feedback (RLHF) in the context of LLMs?

Bachelor thesis (2025) - I. Mockaitytė, A. Arzberger, J. Yang, M.L. Tielman

This study investigates how hermeneutical injustices can become encoded in the Reinforcement Learning from Human Feedback processes used to fine-tune large language models (LLMs). While current research on fairness in LLMs has focused on bias and fairness, there remains a significant gap concerning subtler harms such as hermeneutical injustice. Using adults diagnosed with ADHD as a case study, this research explores how their unique communication and cognitive patterns may be misrepresented or excluded from the RLHF pipeline. The research adopts a qualitative literature review methodology, focusing specifically on real-world RLHF implementations by AI companies. The RLHF pipeline was divided into stages of human feedback collection, reward modeling, and policy optimization. Then, these stages of the RLHF were analyzed through the lens of hermeneutical injustice using interpretive desiderata: representation, flexibility, and authenticity. The findings highlight several conceptual risks. Limited annotator diversity and restrictive feedback formats may exclude neurodivergent voices. Reward models can unintentionally suppress atypical expressions, while policy optimization strategies, especially those prone to mode collapse, can erase some communication styles. Overall, the study shows that without deliberate attention to epistemic inclusion, RLHF processes may perpetuate hermeneutical injustices and undermining the epistemic fairness of LLMs. ...

Prompt Engineering for Hermeneutical Justice in LLMs

An Empirical Study on ADHD-Related Causal Reasoning

Bachelor thesis (2025) - S. Sankara Subramanian Lakshmi, J. Yang, A. Arzberger, M.L. Tielman

Large Language Models are increasingly integrated into everyday applications, but their responses often reflect dominant cultural narratives, which can lead to misrepresentation of marginalized communities. This paper addresses the underexplored issue of hermeneutical epistemic injustice (HEI) in LLM outputs, particularly how these systems fail to accurately represent the lived experiences of people with ADHD when answering causal questions, and whether different prompting techniques can influence and improve the justice reflected in their responses. We introduce a practical framework for measuring HEI based on four proxies: intelligibility, conceptual fit, recognition of structural barriers, and expression style. Through a within-subjects user study with seven adults with ADHD, we evaluated three prompting strategies: Vanilla (baseline), Step-Back, and Human Persona + System 2. Our findings show that Human Persona + System 2 prompting stood out for its empathetic tone, balanced perspectives, and non-judgmental framing, thereby improving fairness across multiple HEI dimensions. Surprisingly, Vanilla prompts performed comparably well overall, while Step-Back responses offered clear practical information and contextual relevance, but were limited by an impassive, matter-of-fact tone. These results suggest that prompt design can meaningfully affect how well LLMs represent marginalized experiences. We conclude that advancing epistemic justice in generative AI requires thoughtful prompt design and may benefit from deeper engagement with affected communities to more accurately and respectfully represent their realities. ...

Unheard and Misunderstood

Tracing Hermeneutical Injustice in ADHD Narratives Generated by Large Language Models

Bachelor thesis (2025) - D. Zhang, J. Yang, A. Arzberger, M.L. Tielman

This study investigates how large language models (LLMs) narrate ADHD-related experiences and whether their narrative forms give rise to hermeneutical injustice. Rather than comparing experience itself, this study analyzes how experiences are narrated. Using a hybrid coding strategy based on Reflexive Thematic Analysis, it compares LLM-generated outputs with first-person narratives from ADHD communities. The analysis identifies several recurring misnarration patterns, Truncated Subjectivity, One-Way Definition, Illocutionary Disablement, and Skewed Style Replacement. Each of these patterns constrains the interpretive space for expressing ADHD experience. Sub-themes are developed to further reveal injustice embedded in LLMs. These patterns are linked to both the training data and the optimization process. In addition, the underlying mechanism of LLMs lacks the différance structure that characterizes human narration. ...