Circular Image

E. Liscio

info

Please Note

26 records found

AI agents optimized purely for task completion often inadvertently violate human moral expectations. This is especially pronounced in reinforcement learning agents operating in text-based environments, where the richness of language and the vast action space allow for numerous harmful yet reward-maximizing paths. Existing approaches to moral alignment commonly represent morality as a single scalar signal, limiting both interpretability and the ability to model diverse moral preferences. We introduce MFT Patronus, a policy-shaping framework based on Moral Foundations Theory that represents morality across five dimensions: Care, Fairness, Loyalty, Authority, and Sanctity. This multidimensional representation enables configurable moral profiles that capture different foundational priorities. We evaluate our approach on the Jiminy Cricket benchmark and show that it maintains task performance while substantially reducing immoral behavior compared to the standard baselines and maintaining a net positive balance of moral over immoral actions. Our results further demonstrate that different moral profiles produce distinct behavioral patterns, suggesting that multidimensional moral representations are a promising direction for interpretable and configurable value alignment in reinforcement learning agents. ...
Mental health problems among adolescents continue to rise, with growing interest in identifying early signs of psychological distress, including cognitive distortions (CDs). CDs are negatively biased patterns of thinking associated with conditions such as depression and anxiety. As CDs are primarily expressed through language, Natural Language Processing (NLP) methods have increasingly been explored for their automatic detection and classification, potentially enabling earlier intervention. This work provides a dataset of Dutch adolescent forum posts from the Kindertelefoon, annotated by professionals with a background in Cognitive Behavioral Therapy (CBT). The dataset is utilized for sentence-level CD detection and classification, exploring the role of local context (the surrounding post) and longitudinal context (previous posts from the same user) on model performance. Experiments compare zero-shot prompting methods with finetuned transformer-based models, using different approaches to incorporate context. Results show that incorporating context does not consistently improve CD detection or classification performance in naturalistic adolescent forum posts. ...
Bachelor thesis (2026) - I. Slanina, A. Arzberger, E. Liscio, J. Yang
People differ in what they consider toxic, yet centralised alignment of large language models (LLMs) imposes a single global standard that cannot accommodate this disagreement. We propose a training-free post-decoding approach: for each prompt we generate N candidates from a fixed, pre-trained LLM and re-rank them against a perparticipant toxicity profile built from PRISM ratings. Post-decoding fits the problem because it decouples generation from scoring, so the same candidate pool can be re-ranked under different profiles to separate the effect of the profile from the effect of the candidate pool, something earlier inference-time interventions cannot do. We compare four scoring modules on four matched seeds: two LLMas-a-Judge rerankers (GPT, Claude) and two Detoxify-based geometric matchers (weighted L1, Ledoit–Wolf Mahalanobis), scored by toxicity-vector distance to each participant’s preferred PRISM response. All four reduce per-record error by 23–28% and tie at the top. The selection is genuinely personalised rather than the same generic shift toward safer text for every user: reductions concentrate on each participant’s most sensitive Perspective dimensions, the
toxicity types they most consistently rated down (p < 10−3 under a profile-shuffle null on every module), and replacing the per-user weighting with uniform weights significantly worsens fit on both geometric matchers (Wilcoxon p < 10−3). Because the effect is peruser, it surfaces on a per-user-sensitive measure (a boundary-violation rate, p < 10−3) rather than on aggregate mean error, which averages the per-user differences away. The next step is therefore per-usersensitive evaluation, not retraining. ...

Personalized Safety Alignment via Logit Steering

Large Language Models are usually aligned toward broad preference averages, while users can differ in how they perceive toxic language. This paper studies whether training free in-decoding logit-difference can support such personalized toxicity alignment without changing model weights. The key idea is to use two internal generation behaviours: an expert generation branch that represents careful, respectful language and an anti-expert generation branch that represents language patterns to avoid. The resulting difference is added to the base model’s next-token scores during generation, with the toxicity steering category chosen from an inferred user sensitivity profile. Profiles are derived from PRISM, a participatory preference dataset, and Perspective API toxicity scores. On Llama 3.1 8B, I evaluate two methods, Anti-Expert Contrastive Decoding (ACD) and Expert–Anti-Expert Differential Steering (EADS). The results suggest that EADS gives the more balanced trade-off, showing that stronger steering reduces measured toxicity distance while preserving general MMLU utility better than ACD. EADS shows a 12.65% mean reduction in measured toxicity-distance, and a below 1% reduction in both Massive Multitask Language Understanding (MMLU) accuracy and generated answer perplexity. The findings remain limited by the use of automatic toxicity scores as a proxy and by the coarse user-profile representation. These results show that training-free logit-steering is a favorable alternative for personalized toxicity alignment, but it should be, in the future, validated using human evaluation. ...

Comparing URIAL and PBPO-Lite on PRISM User Prompts Without Fine-Tuning

Bachelor thesis (2026) - A. Florea, A. Arzberger, E. Liscio, J. Yang, C.E. Brandt
Large Language Models (LLMs) often rely on one general safety standard, but this is limited because toxicity is subjective: what one user finds offensive, another user may not. At the same time, creating personalized safety by fine-tuning a model for every user is expensive and impractical. To address this, my research studies pre-decoding interventions, which means modifying the user’s input prompt before the model generates a response. This offers a flexible and low-cost way to personalize alignment without changing the model’s weights. I evaluate two training-free approaches on the PRISM dataset using Qwen and Llama target models: an Untuned LLMs with Restyled In-context ALignment (URIAL)-inspired method, which adds personalized safety examples to the prompt, and a Personalized Black-Box Prompt Optimization Lite (PBPO-Lite) method, which uses a secondary model to rewrite the prompt based on a user’s toxicity profile. These methods are useful because they can adapt to a user’s needs at inference time without permanent model changes. The results show that both interventions bring the outputs closer to the highest rated PRISM answers, with URIAL achieving the strongest toxicity alignment: approximately 51% on Llama and 31% on Qwen. While the methods improve fluency compared with the base models, they can reduce performance on structured knowledge tasks. Overall, the findings suggest that personalized predecoding is a promising low-cost approach for toxicity alignment, provided that safety gains are balanced against possible losses in knowledge-task performance. ...

Steering LLM Toxicity Along User-Specified Directions

Bachelor thesis (2026) - M. Coroi, J. Yang, A. Arzberger, E. Liscio, C.E. Brandt
Toxic content is not universally defined: what one user finds offensive, another may find acceptable depending on cultural background, context, and purpose. Current LLM safety systems apply a single global toxicity threshold to every user, and adapting this behaviour after deployment is expensive. This paper asks whether a frozen LLM can instead be steered at inference time to follow individual users’ toxicity preferences across six toxicity dimensions, without retraining. A classifier-guided decoding framework driven by a per-user sensitivity vector is instantiated as three deployable strategies and evaluated on the PRISM preference dataset. All three strategies reduce per-user toxicity error by 15–21%, while preserving general-knowledge accuracy to within 0.7 pp of the unguided baseline. The central finding is directional steerability: the decoder responds to the shape of a user’s preference vector, producing category-specific reductions that align with per-user weights (median cosine similarity 0.845, p = 0.0097 above a permutation baseline). These results show that meaningful personalised toxicity control is achievable at deployment time, without retraining the model. ...
Suicide is a leading cause of death, yet predicting it remains a significant challenge. Risk factors such as depression or substance use are commonly used for prediction, but their predictive performance is often only slightly better than chance. Additionally, many cases go undetected due to a lack of contact with mental health services. Social media, however, offers a unique opportunity, as people often share their thoughts and struggles online in real time. In this work, we propose a novel task and method to approach it: predicting suicidal ideation and behavior (SIB) before a user ever expresses it on an online forum. This predictive framing, where no self-disclosure is used as input at any stage, remains largely unexplored in the suicide prediction literature. Our model, Early-SIB, achieves a balanced accuracy of 0.73 for predicting future SIB on a Dutch youth forum, demonstrating that such tools can offer a meaningful addition to traditional methods. ...
Rising mental health issues among adolescents have increased interest in automated approaches for detecting early signs of psychological distress in digital text. One important focus is the identification of cognitive distortions – irrational thought patterns – because of their role in aggravating mental distress, and early detection may enable timely, low cost interventions. While prior work has focused on English data, we present a first in-depth study of cross lingual and cross register generalization for cognitive distortion detection, using forum posts written by Dutch adolescents. We frame the task at two levels: (1) detecting whether a post contains a cognitive distortion, and (2) identifying the specific text span that expresses it. Our findings show that domain adaptation methods perform best for post-level detection, while a simpler technique – sentence embeddings with a classifier – outperforms more complex models for span identification. Results show predicting cognitive distortions in text is challenging, and highlight how changes in language and writing style can significantly impact performance. ...

Transferable & Parameter Efficient LLM Fine Tuning

With the increasing popularity of Large Language Models (LLMs), fine-tuning them has become increasingly computationally expensive. Parameter Efficient Fine-Tuning (PEFT) methods like LoRA and Adapters, introduced by Microsoft and Google, respectively, aim to reduce the number of trainable parameters, with the current state-of-the-art combining both methods as LoRA Adapters. This paper introduces Transformer Modules as a PEFT method. These modules utilize Modular Transformer Blocks (MTBs) inserted into a frozen pre-trained model, achieving competitive performance while significantly reducing computation costs. Compared to the current state-of-the-art using GPT-2, BERT, and T5, Transformer Modules further reduced compute time by 39.7\% and training memory by 72.7\%, with a performance cost of 4.5±2.51\% on the GLUE benchmark. Additionally, the paper presents the Transformer Bridge, a continuous vector transformer designed to transfer Transformer Modules across different models. This could enable cross-model fine-tuning, allowing model-agnostic modules, such as an ethics or medical module, to be used across various LLMs without retraining or access to the original dataset. Although the current implementation of the Transformer Bridge did not fully succeed in mapping embedding spaces, analysis of the results suggests that further refinements using traditional model distillation techniques could lead to success in future iterations. ...
Public deliberations play a crucial role in democratic systems. However, the unstructured nature of deliberations leads to challenges for moderators to analyze the large volume of data produced. This paper aims to solve this challenge by automatically identifying subjective topics behind public discourse by leveraging Large Language Models (LLMs). The study is structured around two core objectives: Identifying Gold Labels and Exploring Subjective Human Labels. The results highlight that fine-tuning the LLaMa-2 model with QLoRa outperforms other methods for Identifying Gold Labels, while the Few-Shot Chain of Thoughts method, enhanced with EmotionPrompt, is particularly effective in capturing subjective variations in human annotations. However, the study also underscores significant limitations, such as the dependency on large, high-quality annotated datasets and the tendency of models to produce hallucinations. These findings highlight the potential of LLMs to identify subjective topics behind public discourse, while also emphasizing the need for further research to address these challenges. ...
In order to tackle topics such as climate change together with the population, public discourse should be scaled up. This discourse should be mediated as it makes it more likely that people understand each other and change their point of view. To help the mediator with this task, emotion detection can greatly help. Positive emotions can improve communications, while negative emotions cause people to be irrational and irritated. However, since emotions are highly subjective, it can make both predictions and evaluation more difficult.

Still, Large Language Models (LLMs) could be used to detect these subjective emotions using different prompting strategies and labels. The experiment included zero-, one-, fewshot and Chain of Thought (CoT) strategies. The precision was better for the one- and fewshot method compared to zeroshot. The CoT methods also showed an increase in precision, but a decrease in recall. The different labels were hard majority labels, soft labels and hard per annotator labels. In conclusion, providing examples improved the performance of the LLM. The CoT strategies were more precise, but gave a worse general prediction. The hard majority labels allow for more general predictions, where per annotator hard labels capture the perspective of different annotators. Soft labels reflect the subjective nature of the labels by providing probabilities instead of binary classification.

The experiment was done on a small data sample, so it is recommended to try the strategies on a larger data sample. Looking into appropriate evaluations for subjective predictions is also recommended in order to reflect the actual performance better. ...

Comparing Prompting Strategies Across Hard, Soft, and Subjective Label Scenarios

This study evaluates the performance of different sentiment analysis methods in the context of public deliberation, focusing on hard-, soft-, and subjective-label scenarios to answer the research question: ``can a Large Language Model detect subjective sentiment of statements within the context of public deliberation?''. If the answer to this question is affirmative, that is a strong indicator that, with the help of longitudinal studies, sentiment analysis with large language models (LLMs) may be implemented to scale public deliberations by providing support for moderators in such discussions. To answer this question, four prompting methods were tested: zero-shot, few-shot, chain-of-thought (CoT) zero-shot, and CoT few-shot using a Frisian dataset of 50 statements annotated by 5 annotators. The findings indicate that the CoT few-shot method significantly outperforms other methods in all scenarios, that soft-labels outperform their hard equivalent, that the underlying data must be balanced for high performing models, and that capturing the perspective of a specific annotator requires further research. Our study suggests that LLMs may perform best under the supervision, or with the collaboration of a human, due to the multi-faced nature of sentiment. ...
This paper investigates the use of Large Language Models (LLMs) for automatic detection of subjective values in argument statements in public discourse. Understanding the underlying values of argument statements could enhance public discussions and potentially lead to better outcomes. The LLM utilization methods tested were zero- and few-shot prompting, as well as chain-of-thought prompts. In order to compare the predictions made by the LLM, a set of ground truth labels was required as an established baseline. For these labels, either single majority labels or multi-value labels were considered, both derived from a set of aggregated human annotations. Results indicated that LLM performance was sub optimal, achieving a maximum weighed F1 score of 0.594 for single-value chain-of-thought predictions. Additionally, current metrics were found inadequate for assessing LLM performance on a highly subjective task such as value detection, evidenced by poor scores in multi-value predictions despite subjective evaluation suggesting otherwise. Furthermore, a last experiment was aimed at capturing a specific annotator’s subjectivity. This yielded inconsistent results, with F1 scores peaking around 0.4, indicating that LLMs are not well-suited for emulating individual human subjectivity. ...
This study investigates the effectiveness of Large Language Models (LLMs) in identifying and classifying subjective arguments within deliberative discourse. Using data from a Participatory Value Evaluation (PVE) conducted in the Netherlands, this research introduces an annotation strategy for identifying arguments and extracting their premises. Then, the Llama 2 model is used to test three different prompting approaches: zero-shot, one-shot and few-shot. The performance is evaluated using the cosine similarity metric and later enhanced by introducing chain-of-thought prompting. The results show that zero-shot prompting unexpectedly outperforms one-shot and few-shot prompting, due to the LLM overfitting to the examples provided. Chain-of-thought prompting is shown to improve the argument identification task. The subjectivity of the annotation task is reflected by the low averaged pairwise F1 score between annotators, and the considerable variance in the number of data items marked by each annotator as not being arguments. The subjectivity of the task is further highlighted by a pairwise chain-of-thought prompting analysis, which shows that annotators with more similar annotations received more similar LLM responses. ...
Moral values influence humans in decision-making. Pluralist moral philosophers argue that human morality can be represented by a finite number of moral values, respecting the differences in moral views. Recent advancements in NLP show that language models retain a discernible level of knowledge in deontological ethics and moral norms of society. However, a model which can only decide either right or wrong cannot fully understand the diverse moral perspectives of humans.

We propose a moral sentence embedding space, which can encompass moral differences, through the state-of-the-art Contrastive Learning framework. We evaluate the moral embedding space both intrinsically and extrinsically via three tasks: classification, moral similarity, and visual analysis. We show that our moral embedding space understands the characteristics of each moral value. Our results also highlight that moral rhetoric is seldom explicit in the text, emphasizing the necessity of additional information such as moral labels. ...

A pluralist approach in generating and processing morally-aligned text

When making decisions, people are automatically guided by their moral compass. However, AI agents need to be conditioned in order to be steered towards moral behaviour. An environment that can be used to train and test agents is the Jiminy Cricket environment. The Jiminy Cricket environment consists of a set of text-based narrative games, where every action possible is annotated with the morality of that action. However, to create a more morally nuanced agent, we have annotated all of the actions according to the following moral values: Care/harm, fairness/cheating, loyalty/betrayal, authority/subversion, and purity/degradation. To morally condition the agent, we calculate the predicted progress of a potential action and combine it with an oracle to retrieve the moral annotation of the potential action. Using both of these components, the score per generated action is calculated and based on the score the eventual action is chosen. The score can be calculated differently based on the weights assigned to the overall progress and morality, as well as based on the sub-weights assigned to each moral value. Using this environment we pose the question, if we focus on only one moral value, what is the most optimal configuration that can be achieved in order to maximize both progress and morality? From the results we can observe that the lowest relative immorality can be achieved by imposing no moral constraints on the agent. Posing constraints on the agent will lead to a relatively bigger decrease of the completion percentage than to the immorality decrease. One-hot encoding the moral values will reveal which immoral actions are needed to progress in the game, and which immoral actions should to be prevented to lower immorality. ...

What is the optimal weight w to win the games while playing morally?

In our everyday life, people interact more and more with agents. However these agents often lack a moral sense and prioritize the accomplishment of the given task. In consequence, agents may unknowingly act immorally. Little research or progress has been done to endow agents with human morality and an internal sense of right and wrong. As of today, agents have a primitive representation of morality often represented as 1 value. In contrast, humans have multiple reasons to judge an action as moral. In hope of creating agents that are imbued with a more complex and human moral, we build upon the Jiminy Cricket environment. This preexisting environment has multiple games with diverse scenarios and the objective is to do the most moral action to maximize the reward ...

Evaluating the tradeoff for artificial agents playing text-based games

Morality is a fundamental concept that guides humans in the decision-making process. Given the rise of large language models in society, it is necessary to ensure that they adhere to human principles, among which morality is of substantial importance. While research has been done regarding artificial agents behaving morally, current state of the art implementations consider morality to be linear, thus failing to capture its complexity and nuances. To account for this, a multidimensional representation of morality is proposed, each dimension corresponding to a different moral foundation. Then, the performance of three types of artificial agents tasked with choosing actions while playing text-based games is compared and analysed. One type of agent is implemented to only choose the most moral action, without aiming to win the games, another one prioritizes moral actions over game progression, and another strives to win the games while also playing morally. The latter outperforms the others in terms of game progression, while also taking few immoral actions. However, the agent prioritizing morality over progression performs only slightly worse while taking no immoral actions, proving that artificial agents can perform well while also behaving morally. ...

How does explainable models perform compared to black-box models

This paper evaluates the performance of an automated explainable model, Moral- Strength, to predict morality, or more pre- cisely Moral Foundations Theory (MFT) traits. MFT is a way to represent and divide morality into precise and detailed traits. This evaluation happens in the Jiminy Cricket environment, an environ- ment composed of 25 text-based games. This evaluation helps us estimate the do- main adaptation of MoralStrength, and also its limitations. The explainability of this model helps understand those limitations. We can conclude that MoralStrength is per- forming overall worse than other optimal models and that the domain adaptation to the Jiminy Cricket domain has some cru- cial flaws, but it leads us to think about the explainability/accuracy trade-off and where to draw the line, knowing that explainable models are important for ethical decision- making. ...
Nowadays Large Language Models are becoming more and more prevalent in today's society. These models act without a sense of morality however. They only prioritize accomplishing their goal. Currently, little research has been done evaluating these models. The current state of the art Reinforcement Learning models represent morality by a singular scalar value determining the morality of a statement. This way of representing morality is inaccurate as there are multiple features determining how moral a statement is. We leverage knowledge from the Moral Foundations Theory to represent morality in a more accurate way, by using a 5-dimensional vector representing morality features. We implement several different agents in an environment where decisions with possible moral implications need to be made. These agents all use alternative approaches in deciding which action to take. The policies are: always pick the most moral action and always pick the most immoral action. Two other agents have the same aforementioned policy but still give some weight towards game progression. Lastly, we look at an amoral agent which does not look at morality at all. We compare these agents by percent completion of the Infocom game suspect. We find that the agent which does not take morality into account achieves the highest completion rate. Agents which give morality a huge weight almost instantly get stuck in an infinite loop without progression. ...