A. Arzberger
Please Note
9 records found
1
STEER-Away
Personalized Safety Alignment via Logit Steering
Personalised Classifier-Guided Decoding
Steering LLM Toxicity Along User-Specified Directions
toxicity types they most consistently rated down (p < 10−3 under a profile-shuffle null on every module), and replacing the per-user weighting with uniform weights significantly worsens fit on both geometric matchers (Wilcoxon p < 10−3). Because the effect is peruser, it surfaces on a per-user-sensitive measure (a boundary-violation rate, p < 10−3) rather than on aggregate mean error, which averages the per-user differences away. The next step is therefore per-usersensitive evaluation, not retraining. ...
toxicity types they most consistently rated down (p < 10−3 under a profile-shuffle null on every module), and replacing the per-user weighting with uniform weights significantly worsens fit on both geometric matchers (Wilcoxon p < 10−3). Because the effect is peruser, it surfaces on a per-user-sensitive measure (a boundary-violation rate, p < 10−3) rather than on aggregate mean error, which averages the per-user differences away. The next step is therefore per-usersensitive evaluation, not retraining.
Personalized Pre-Decoding Alignment for Training-Free Toxicity Reduction
Comparing URIAL and PBPO-Lite on PRISM User Prompts Without Fine-Tuning
Unheard and Misunderstood
Reinforcing Hermeneutical Justice in Annotation Design for ADHD Voices
Incorporating User Feedback into Post-Training LLM Improvement to Promote Hermeneutical Justice
An interface to amplify marginalized voices
Unheard and Misunderstood: Addressing Injustice in LLMs
How are hermeneutical injustices encoded in Reinforcement Learning from Human Feedback (RLHF) in the context of LLMs?
Prompt Engineering for Hermeneutical Justice in LLMs
An Empirical Study on ADHD-Related Causal Reasoning
Unheard and Misunderstood
Tracing Hermeneutical Injustice in ADHD Narratives Generated by Large Language Models