Zoltán Szlávik | TU Delft Repository

Adding the expert touch: Formulating Expert-Driven Reward Functions for RL-Based Playlist Generation

Master thesis (2025) - S. Balaram, Luciano Cavalcante Siebert, M. Mansoury, Antonio Mone, Zoltán Szlávik, Ralvi Isufaj

Automatic theme-based playlist generation systems often fail to replicate the quality of expert human curation. While Reinforcement Learning (RL) offers a framework for this sequential task, its effectiveness is limited by the challenge of designing reward functions that capture the knowledge of professional curators. This thesis introduces and evaluates a methodology to bridge this gap by using Large Language Models (LLMs) to translate curatorial principles, gathered from expert interviews, into dense reward function code. The main aim of this research is to determine if LLMs can effectively interpret the complex strategies of professional curators and, in turn, guide an RL agent to produce playlists that adhere to expert standards.

To investigate this, we interviewed music experts and then used LLMs to create reward functions in two ways: one from a concise summary of the interviews and another from the complete aw transcripts. These reward functions were used to train a RL agent for playlist generation. The agents’ performances were then evaluated for recommendation accuracy and alignment with the expert’s curatorial style, and compared against two baselines: a similarity-based model and an RL agent with a hand-crafted reward function.

The results showed that, the impact of the addition of the interview summarization step on the models’ recommendation accuracy depended on the LLM, with the GPT-based model showing a significant increase in accuracy, while the Gemini-based model’s performance remained consistent across both inputs. Furthermore, qualitative analysis of the generated reward functions revealed that the summarized transcripts resulted in high-level reward factors consistent across all the LLMs, whereas raw transcripts resulted in more varied and granular reward factors. Additionally, the choice of LLM impacted the final reward structure and the agent’s subsequent performance. When compared against the baseline models in the cold-start scenario, RL agents guided by LLM-generated rewards significantly outperformed both the manually-tuned RL baseline and the non-RL similarity-based model. However, in seeded playlist continuation tasks, this performance hierarchy changed, with the simpler similarity-based model achieving higher recommendation accuracy. ...

Automatic theme-based playlist generation systems often fail to replicate the quality of expert human curation. While Reinforcement Learning (RL) offers a framework for this sequential task, its effectiveness is limited by the challenge of designing reward functions that capture the knowledge of professional curators. This thesis introduces and evaluates a methodology to bridge this gap by using Large Language Models (LLMs) to translate curatorial principles, gathered from expert interviews, into dense reward function code. The main aim of this research is to determine if LLMs can effectively interpret the complex strategies of professional curators and, in turn, guide an RL agent to produce playlists that adhere to expert standards.

To investigate this, we interviewed music experts and then used LLMs to create reward functions in two ways: one from a concise summary of the interviews and another from the complete aw transcripts. These reward functions were used to train a RL agent for playlist generation. The agents’ performances were then evaluated for recommendation accuracy and alignment with the expert’s curatorial style, and compared against two baselines: a similarity-based model and an RL agent with a hand-crafted reward function.

The results showed that, the impact of the addition of the interview summarization step on the models’ recommendation accuracy depended on the LLM, with the GPT-based model showing a significant increase in accuracy, while the Gemini-based model’s performance remained consistent across both inputs. Furthermore, qualitative analysis of the generated reward functions revealed that the summarized transcripts resulted in high-level reward factors consistent across all the LLMs, whereas raw transcripts resulted in more varied and granular reward factors. Additionally, the choice of LLM impacted the final reward structure and the agent’s subsequent performance. When compared against the baseline models in the cold-start scenario, RL agents guided by LLM-generated rewards significantly outperformed both the manually-tuned RL baseline and the non-RL similarity-based model. However, in seeded playlist continuation tasks, this performance hierarchy changed, with the simpler similarity-based model achieving higher recommendation accuracy.

Enabling Targeted Music Exploration with Interactive Recommendations

Master thesis (2024) - A.M. Nonnemaker, C.C.S. Liem, R. Isufaj, Zoltán Szlávik, C.A. Raman

Recommender systems are widely used to help users navigate vast content catalogs, but they often limit users to suggestions that closely match their existing preferences, creating "filter bubbles" that discourage exploration. We focus on solving this problem in the context of music recommendations, helping users discover and develop new musical tastes. We embed a knowledge graph containing expert-curated metadata, user interaction data, and audio similarity features, into a representation space where similar songs are mapped closely together. This enables the system to gradually guide users from their current preferences toward a new genre through personalized recommendations. Additionally, we apply a Bayesian active learning approach to iteratively update user preference models based on feedback, balancing exploration and exploitation to ensure user satisfaction while gathering information on the user's new preferences. We conducted a user study to evaluate the approach, demonstrating that a gradual, interactive approach outperforms directly introducing users to a new genre, increasing user engagement and their affinity toward the target genre. This research highlights the value of gradual, user-driven exploration in creating better music discovery experiences. Based on our findings, we provide recommendations for industry stakeholders and discuss opportunities for future research on targeted exploration in music recommendation. ...

Improving Search Relevance Feedback through Human Centered Design

Master thesis (2020) - S. Gu, A. Bozzon, J.D. Lomas, Zoltán Szlávik

Artificial intelligence (AI) is expected to play a transformational role in health and wellbeing. Search (i.e. information retrieval) technologies already play a significant role in healthcare research and practice. Relevance feedback in Search is vital for system evaluation and improvements. However, in small user scale contexts, the exploitation of user behaviors may not infer valid relevance judgments. Therefore, engaging users to provide such feedback explicitly is essential for improving search performance (i.e. effectiveness). However, previous research has found that users are generally reluctant to provide explicit feedback in digital environments, and the willingness decreases overtime in some experiments. In collaboration with myTomorrows, an Amsterdam-based pharma-tech company, this Master thesis aims to find answers to the challenge mentioned above through a specific context of myTomororws AI-powered treatment Search which has the urgent need for engaging healthcare professionals (HCPs) in providing relevance feedback on search results (e.g. Clinical Trials and Expanded Access Programs) for system evaluation and improvements. Through Human Centered Design methods such as interviews, observations, and speed dates, the project yielded a future myTomorrows Search design enhanced with three relevance feedback collection concepts. As research materials, the concepts were tested and evaluated by nine HCPs from three countries (the Netherlands, China, and Brazil). The user study results indicate that embedding utility, as the motivator, in relevance feedback collection appeals to HCPs more than using motivators such as altruism or enjoyment. Moreover, the best point of user engagement is identified as the moment between users finishing the examination of information and starting the next ones. Additionally, this study generalized the project process and user study insights into a four-stage guide for designing explicit feedback collection in text-base Search. Although it remains unvalidated, this guide has the potential to apply to other small user scale contexts, guiding or inspiring user researchers and designers to design explicit user feedback collection in Search.
...

Artificial intelligence (AI) is expected to play a transformational role in health and wellbeing. Search (i.e. information retrieval) technologies already play a significant role in healthcare research and practice. Relevance feedback in Search is vital for system evaluation and improvements. However, in small user scale contexts, the exploitation of user behaviors may not infer valid relevance judgments. Therefore, engaging users to provide such feedback explicitly is essential for improving search performance (i.e. effectiveness). However, previous research has found that users are generally reluctant to provide explicit feedback in digital environments, and the willingness decreases overtime in some experiments. In collaboration with myTomorrows, an Amsterdam-based pharma-tech company, this Master thesis aims to find answers to the challenge mentioned above through a specific context of myTomororws AI-powered treatment Search which has the urgent need for engaging healthcare professionals (HCPs) in providing relevance feedback on search results (e.g. Clinical Trials and Expanded Access Programs) for system evaluation and improvements. Through Human Centered Design methods such as interviews, observations, and speed dates, the project yielded a future myTomorrows Search design enhanced with three relevance feedback collection concepts. As research materials, the concepts were tested and evaluated by nine HCPs from three countries (the Netherlands, China, and Brazil). The user study results indicate that embedding utility, as the motivator, in relevance feedback collection appeals to HCPs more than using motivators such as altruism or enjoyment. Moreover, the best point of user engagement is identified as the moment between users finishing the examination of information and starting the next ones. Additionally, this study generalized the project process and user study insights into a four-stage guide for designing explicit feedback collection in text-base Search. Although it remains unvalidated, this guide has the potential to apply to other small user scale contexts, guiding or inspiring user researchers and designers to design explicit user feedback collection in Search.

Learn representations in the presence of segmentation label noises

Master thesis (2017) - Jihong Ju, Jan van Gemert, Marco Loog, Zoltan Szlávik, Alan Hanjalic

Training data for segmentation tasks are often available only on a small scale. Transferring learned representations from pre-trained classification models is therefore widely adopted by convolutional neural networks for semantic segmentation. In domains where the representations from the classification models are not directly applicable, we propose to train representations with segmentation datasets that potentially contains label errors. Our experiments demonstrate that label errors, such as mislabeled segments and missing segmentations, have negative influences to the learned representations. To alleviate the negative effects of object mislabelling, we propose to discard the object labels and instead train foreground/background segmentation. The learned representations with binary segmentation achieve a fine-tuning performance comparable to the representations learned with ``gold'' standard segmentations. In the existence of missing segmentations, a sigmoid loss for the background class is proposed to achieve high recall while keeping the precision better than simply weighting the classes. The proposed class dependent, sigmoid loss obtains better segmentation performance as well as better representations than the weighting the classes in the presence of missing segmentations. To summerize, we propose to learn representations with foreground/background segmentation and with a sigmoid loss for the background class when there exist missing segmentations for objects.
...