Ralvi Isufaj

Master thesis (1)

1 records found

Adding the expert touch: Formulating Expert-Driven Reward Functions for RL-Based Playlist Generation

Master thesis (2025) - S. Balaram (author) , Luciano C. Siebert (mentor) , Masoud Mansoury (graduation committee member) , A. Mone (mentor) , Zoltán Szlávik (mentor) , Ralvi Isufaj (mentor)

Automatic theme-based playlist generation systems often fail to replicate the quality of expert human curation. While Reinforcement Learning (RL) offers a framework for this sequential task, its effectiveness is limited by the challenge of designing reward functions that capture the knowledge of professional curators. This thesis introduces and evaluates a methodology to bridge this gap by using Large Language Models (LLMs) to translate curatorial principles, gathered from expert interviews, into dense reward function code. The main aim of this research is to determine if LLMs can effectively interpret the complex strategies of professional curators and, in turn, guide an RL agent to produce playlists that adhere to expert standards.

To investigate this, we interviewed music experts and then used LLMs to create reward functions in two ways: one from a concise summary of the interviews and another from the complete aw transcripts. These reward functions were used to train a RL agent for playlist generation. The agents’ performances were then evaluated for recommendation accuracy and alignment with the expert’s curatorial style, and compared against two baselines: a similarity-based model and an RL agent with a hand-crafted reward function.

The results showed that, the impact of the addition of the interview summarization step on the models’ recommendation accuracy depended on the LLM, with the GPT-based model showing a significant increase in accuracy, while the Gemini-based model’s performance remained consistent across both inputs. Furthermore, qualitative analysis of the generated reward functions revealed that the summarized transcripts resulted in high-level reward factors consistent across all the LLMs, whereas raw transcripts resulted in more varied and granular reward factors. Additionally, the choice of LLM impacted the final reward structure and the agent’s subsequent performance. When compared against the baseline models in the cold-start scenario, RL agents guided by LLM-generated rewards significantly outperformed both the manually-tuned RL baseline and the non-RL similarity-based model. However, in seeded playlist continuation tasks, this performance hierarchy changed, with the simpler similarity-based model achieving higher recommendation accuracy.