L. Cavalcante Siebert | TU Delft Repository

Adding the expert touch: Formulating Expert-Driven Reward Functions for RL-Based Playlist Generation

Master thesis (2025) - S. Balaram (author) , Luciano C. Siebert (mentor) , Masoud Mansoury (graduation committee member) , A. Mone (mentor) , Zoltán Szlávik (mentor) , Ralvi Isufaj (mentor)

Automatic theme-based playlist generation systems often fail to replicate the quality of expert human curation. While Reinforcement Learning (RL) offers a framework for this sequential task, its effectiveness is limited by the challenge of designing reward functions that capture ...

Automatic theme-based playlist generation systems often fail to replicate the quality of expert human curation. While Reinforcement Learning (RL) offers a framework for this sequential task, its effectiveness is limited by the challenge of designing reward functions that capture the knowledge of professional curators. This thesis introduces and evaluates a methodology to bridge this gap by using Large Language Models (LLMs) to translate curatorial principles, gathered from expert interviews, into dense reward function code. The main aim of this research is to determine if LLMs can effectively interpret the complex strategies of professional curators and, in turn, guide an RL agent to produce playlists that adhere to expert standards.

To investigate this, we interviewed music experts and then used LLMs to create reward functions in two ways: one from a concise summary of the interviews and another from the complete aw transcripts. These reward functions were used to train a RL agent for playlist generation. The agents’ performances were then evaluated for recommendation accuracy and alignment with the expert’s curatorial style, and compared against two baselines: a similarity-based model and an RL agent with a hand-crafted reward function.

The results showed that, the impact of the addition of the interview summarization step on the models’ recommendation accuracy depended on the LLM, with the GPT-based model showing a significant increase in accuracy, while the Gemini-based model’s performance remained consistent across both inputs. Furthermore, qualitative analysis of the generated reward functions revealed that the summarized transcripts resulted in high-level reward factors consistent across all the LLMs, whereas raw transcripts resulted in more varied and granular reward factors. Additionally, the choice of LLM impacted the final reward structure and the agent’s subsequent performance. When compared against the baseline models in the cold-start scenario, RL agents guided by LLM-generated rewards significantly outperformed both the manually-tuned RL baseline and the non-RL similarity-based model. However, in seeded playlist continuation tasks, this performance hierarchy changed, with the simpler similarity-based model achieving higher recommendation accuracy.

Generative AI: Investigating Consistency and Neutrality in Multilingual Outputs

Master thesis (2025) - A. Ibrahim (author) , Luciano Cavalcante Siebert (mentor) , S.K. Kuilman (mentor) , Maria Soledad Pera (graduation committee member)

This thesis investigates whether large language models (LLMs) produce consistent and neutral outputs when the same prompts are given in English and Arabic. It begins by reviewing technological, philosophical, psychological, and linguistic factors that can influence the behavior o ...

This thesis investigates whether large language models (LLMs) produce consistent and neutral outputs when the same prompts are given in English and Arabic. It begins by reviewing technological, philosophical, psychological, and linguistic factors that can influence the behavior of the multilingual model. Consistency is defined as stability in content and tone, while neutrality refers to the absence of biased or emotionally loaded framing.

Ten prompts (seven sensitive and three non-sensitive) were refined through an iterative English ablation process and then translated into Arabic. Six leading LLMs were queried in both languages, and their outputs were analyzed using automated sentiment analysis to measure differences in emotional tone. In parallel, a survey of bilingual English and Arabic speakers evaluated model responses on sentiment consistency, factual consistency, and perceived neutrality in each language, along with the neutral framing of the prompts.

Results indicate that non-sensitive prompts are rated as less neutral but exhibit fewer inconsistencies in sentiment and factuality across English and Arabic outputs. In contrast, sensitive prompts are perceived as more neutral overall but exhibit larger differences in both sentiment and factual alignment. Among the models tested, some demonstrate higher consistency across languages than others. Automated analysis shows English outputs often carry more positive or mixed tones, while Arabic outputs lean toward neutrality. Human evaluations mirror these patterns for non-sensitive topics but differ for the more politically charged prompts, highlighting that automated tools do not align well with human perception in sensitive contexts.

These findings underscore the importance of combining automated metrics with human judgment to assess multilingual reliability and neutrality. The study suggests that improving balance in training data, improving transparency about language-specific behaviors, and guiding users to anticipate multilingual variations are key to developing fairer and more reliable GenAI systems.

Reducing uninteresting anomalies

Designing a framework that retrains anomaly detection to no longer highlight non-relevant cases

Master thesis (2025) - N.J.A. van de Werken (author) , A. Lukina (mentor) , Emir Demirović (graduation committee member) , Luciano Cavalcante Siebert (mentor)

Anomaly detection is a cornerstone of data analysis, aimed at identifying patterns that deviate from expected behaviour. However, conventional anomaly detection methods often fail to differentiate between actionable anomalies and those that, while statistically anomalous, are irr ...

Anomaly detection is a cornerstone of data analysis, aimed at identifying patterns that deviate from expected behaviour. However, conventional anomaly detection methods often fail to differentiate between actionable anomalies and those that, while statistically anomalous, are irrelevant to the user’s goals. Such uninteresting anomalies, originating from distinct, unrelated distributions, contribute to false alarms and resource inefficiencies, particularly in critical domains like cybersecurity and healthcare. This thesis proposes a novel, adaptive framework that retrains anomaly detection models to exclude uninteresting anomalies from being flagged, thereby improving the relevance of detected anomalies.

Central to this framework is the use of the Synthetic Minority Over-sampling Technique for Nominal and Continuous (SMOTE-NC) to artificially augment datasets with labelled uninteresting anomalies, transforming them into regular data. The framework also incorporates user feedback to iteratively refine model performance during deployment. A key finding of this research is that the effectiveness of the framework is highly dependent on the degree of distinguishability between interesting and uninteresting anomalies. Specifically, when the two types of anomalies are clearly distinct in terms of their statistical and categorical features, the framework achieves a significant reduction in false positives without adversely affecting the detection rate of actionable anomalies or the accuracy of regular data.

The framework was evaluated using four state-of-the-art anomaly detection models: Isolation Forest, One-Class Support Vector Machines, Autoencoders, and Variational Autoencoders. It was then tested using two datasets: one in cybersecurity, involving various attack types, and another in healthcare, where anomalies represent different diagnostic categories. The results demonstrate that the framework can effectively identify and suppress uninteresting anomalies, achieving over 90\% accuracy in classifying these cases as regular data under favourable conditions. Notably, when the distinction between interesting and uninteresting anomalies was substantial, the models retained their ability to detect actionable anomalies, with minimal degradation in overall accuracy. Furthermore, it was seen that a significant number of samples are needed to be able to successfully represent the uninteresting anomalies class, a minimum of around 50. When not using the framework, it takes many more samples for the algorithm to successfully no longer detect these uninteresting anomalies as anomalies. This class then needs to make up a significant amount of the training data, so depending on the size of the training data set, this can mean thousands of samples. Gathering around 50 samples poses no problem for the framework, as it is meant to be especially relevant when we have too many uninteresting anomalies, where it is constantly giving false alerts, so there is usually a sufficient number of samples available.

Uncovering Sequential Social Dilemmas in Multi-Agent Reinforcement Learning

Challenges and Strategies for Local Energy Communities

Master thesis (2025) - M.T. Okoń (author) , L. Siebert (mentor) , Jochen Cremer (mentor) , J. Yang (graduation committee member)

This thesis investigates the occurrence and mitigation of Sequential Social Dilemmas (SSDs) in Local Energy Communities (LECs) managed through Multi-agent Reinforcement Learning (MARL). LECs have great potential as pivotal elements in the green energy transition, yet the inherent ...

Interactive Reinforcement Learning for Adaptive Thermal Comfort

Master thesis (2024) - A. Korkusuz (author) , L. Siebert (mentor) , P. Rutgers (mentor)

Designing and implementing effective systems for thermal comfort management in buildings is a complex task due to the need to account for subjective preference parameters influenced by human physiology, bias and tendencies. This research introduces a novel approach to simulating ...

Decoding Sentiment with Large Language Models

Comparing Prompting Strategies Across Hard, Soft, and Subjective Label Scenarios

Bachelor thesis (2024) - T. Oberhuber (author) , Luciano C. Siebert (mentor) , A. Homayounirad (mentor) , E. Liscio (mentor) , J. Yang (graduation committee member)

This study evaluates the performance of different sentiment analysis methods in the context of public deliberation, focusing on hard-, soft-, and subjective-label scenarios to answer the research question: ``can a Large Language Model detect subjective sentiment of statements wit ...

Using Large Language Models to Detect Deliberative Elements in Public Discourse

Detecting Subjective Emotions in Public Discourse

Bachelor thesis (2024) - B.C.P. Zuurbier (author) , Luciano C. Siebert (mentor) , Amir Homayounirad (mentor) , E. Liscio (mentor) , J. Yang (graduation committee member)

In order to tackle topics such as climate change together with the population, public discourse should be scaled up. This discourse should be mediated as it makes it more likely that people understand each other and change their point of view. To help the mediator with this task, ...

Leveraging Large Language Models for Classifying Subjective Arguments in Public Discourse

Bachelor thesis (2024) - A. Dobrinoiu (author) , Luciano C. Siebert (mentor) , Amir Homayounirad (mentor) , E. Liscio (mentor) , J. Yang (graduation committee member)

This study investigates the effectiveness of Large Language Models (LLMs) in identifying and classifying subjective arguments within deliberative discourse. Using data from a Participatory Value Evaluation (PVE) conducted in the Netherlands, this research introduces an annotation ...

Leveraging LLMs for Classifying Subjective Topics Behind Public Discourse

Bachelor thesis (2024) - A. Marcu (author) , Luciano C. Siebert (mentor) , Amir Homayounirad (mentor) , E. Liscio (mentor) , J. Yang (graduation committee member)

Public deliberations play a crucial role in democratic systems. However, the unstructured nature of deliberations leads to challenges for moderators to analyze the large volume of data produced. This paper aims to solve this challenge by automatically identifying subjective topic ...

Decreasing the number of demonstrations required for Inverse Reinforcement Learning by integrating human feedback

Bachelor thesis (2024) - Z. Oğurlu (author) , L. Cavalcante Siebert (mentor) , A. Mone (mentor) , Wendelin Böhmer (graduation committee member)

The main concept behind reinforcement learning is that an agent takes certain actions and is rewarded or punished for these actions. However, the rewards that are involved when performing a certain task can be quite complicated in real life and the contribution of different facto ...

The main concept behind reinforcement learning is that an agent takes certain actions and is rewarded or punished for these actions. However, the rewards that are involved when performing a certain task can be quite complicated in real life and the contribution of different factors in the reward function is often unknown. From this problem emerges reward learning, which is the process of learning the reward function of an environment. There are several techniques for performing reward learning. We can view these different techniques within 2 different high-level categories: Learning from demonstrations and learning from feedback. IRL (Inverse Reinforcement Learning) is a way of learning from demonstrations. Meanwhile, RLHF (Reinforcement Learning from Human Feedback) is a way of learning from feedback.

In this paper, we are proposing the approach of training a reward learning agent, first with IRL and then with RLHF. IRL provides the benefit of learning a reward function quite quickly, however, it can suffer from the presence of sub-optimal demonstrations from the expert. Meanwhile, RLHF is slower at learning the reward function from scratch. Hence, we are proposing an approach where we integrate RLHF as a way to fine-tune the initial reward function calculated by IRL. By doing so, we are aiming to alleviate the negative effect of sub-optimal expert demonstrations on IRL.

We test and evaluate our methodology on the cart pole environment from the seals library. We compare the results from our approach to reward learning from only expert demonstrations, without integrating human feedback (i.e. only IRL). The obtained results suggest that, RLHF might in fact not be a good complement for IRL, specifically when we have sub-optimal expert demonstrations. In fact, we found that applying RLHF on top of IRL can even drop the performance of the resulting reward function, which challenges our initial hypothesis regarding the complementarity between these two methods.

The Role of Feedback Variety in Reinforcement Learning from Human Feedback

Bachelor thesis (2024) - I. Makarov (author) , Luciano Cavalcante Siebert (mentor) , A. Mone (mentor) , J.W. Böhmer (graduation committee member)

Reinforcement Learning from Human Feedback (RLHF) offers a powerful approach to training agents in environments where defining an explicit reward function is challenging by learning from human feedback provided in various forms. This research evaluates three common feedback types ...

Exploring the Synergy between Inverse Reinforcement Learning and Reinforcement Learning From Human Feedback for Query Reduction

Bachelor thesis (2024) - A. Batrineanu (author) , L. Cavalcante Siebert (mentor) , A. Mone (mentor) , Wendelin Böhmer (graduation committee member)

Reinforcement Learning is a powerful tool for problems that require sequential-decision-making. However, it often faces challenges due to the extensive need for reward engineering. Reinforcement Learning from Human Feedback (RLHF) and Inverse Reinforcement Learning (IRL) hold the ...

Leveraging LLMs for subjective value detection in argument statements

Bachelor thesis (2024) - J.C.E. Gorter (author) , Luciano Siebert (mentor) , A. Homayounirad (mentor) , Enrico Liscio (mentor)

This paper investigates the use of Large Language Models (LLMs) for automatic detection of subjective values in argument statements in public discourse. Understanding the underlying values of argument statements could enhance public discussions and potentially lead to better outc ...

The Human Factor: Addressing Diversity in Reinforcement Learning from Human Feedback

How can RLHF deal with possibly conflicting feedback?

Bachelor thesis (2024) - J. PAEZ FRANCO (author) , A. Mone (mentor) , L. Cavalcante Siebert (mentor) , Wendelin Böhmer (graduation committee member)

Reinforcement Learning from Human Feedback (RLHF) is a promising approach to training agents to perform complex tasks by incorporating human feedback. However, the quality and diversity of this feedback can significantly impact the learning process. Humans are highly diverse in t ...

Conflict in the World of Inverse Reinforcement Learning

Investigating Inverse Reinforcement Learning with Conflicting Demonstrations

Bachelor thesis (2024) - P. Koev (author) , A. Mone (mentor) , L. Cavalcante Siebert (mentor) , Wendelin Böhmer (graduation committee member)

Inverse Reinforcement Learning (IRL) algorithms are closely related to Reinforcement Learning (RL) but instead try to model the reward function from a given set of expert demonstrations. In IRL, many algorithms have been proposed, but most assume consistent demonstrations. Consis ...

Detecting Long-term Behavioral Adaptations in Assisted Driving

An Automated Approach Using Neural Networks and Novelty Detection

Master thesis (2024) - R.G. Oude Elferink (author) , Luciano Cavalcante Cavalcante Siebert (mentor) , Anna Lukina (mentor) , Chirag Raman (graduation committee member)

The autonomous vehicle industry has the potential to revolutionize the future of driving, making the understanding of vehicle-driver interactions crucial as we progress towards fully autonomous systems. Advanced Driver Assistance Systems (ADAS) are integral in this evolution, bri ...

Multi-expert Preference Alignment in Reinforcement Learning

Master thesis (2024) - L. Li (author) , Luciano Siebert (mentor)

This project explores adaptation to preference shifts in Multi-objective Reinforcement Learning (MORL), with a focus on how Reinforcement Learning (RL) agents can align with the preferences of multiple experts. This alignment can occur across various scenarios featuring distinct ...

Preference-Based Reinforcement Learninig in Demand Response Programs

Master thesis (2024) - P. Piccini (author) , L. Cavalcante Siebert (mentor)

ncentive-based demand response (iDR) programs serve as important tools for distributed system operators (DSOs) to achieve a reduction in electricity demand during periods of grid overload. During these programs, participants can decide to curtail their consumption in exchange for ...

What are the implications of Curriculum Learning strategy on IRL methods?

Investigating Inverse Reinforcement Learning from Human Behavior

Bachelor thesis (2023) - M. Vlasenko (author) , Luciano Siebert (mentor) , A. Caregnato Neto (mentor) , J. Weber (graduation committee member)

Inverse Reinforcement Learning (IRL) is a subfield of Reinforcement Learning (RL) that focuses on recovering the reward function using expert demonstrations. In the field of IRL, Adversarial IRL (AIRL) is a promising algorithm that is postulated to recover non-linear rewards in e ...

Inverse Reinforcement Learning (IRL) in Presence of Risk and Uncertainty Related Cognitive Biases

To what extent can IRL learn rewards from expert demonstrations with loss and risk aversion?

Bachelor thesis (2023) - M. Ikiz (author) , A. Caregnato Neto (mentor) , Luciano Siebert (mentor) , J. Weber (graduation committee member)

A key issue in Reinforcement Learning (RL) research is the difficulty of defining rewards. Inverse Reinforcement Learning (IRL) is a technique that addresses this challenge by learning the rewards from expert demonstrations. In a realistic setting, expert demonstrations are colle ...