D. Mambelli | TU Delft Repository

Policy Distillation in Offline Multi-task Reinforcement Learning

Master thesis (2024) - J.A.E. van Lith, D. Mambelli, M.T.J. Spaan, N.M. Gürel

In Reinforcement Learning (RL), an agent learns to make decisions by interacting with an environment and receiving feedback in the form of rewards. Multi-Task Reinforcement Learning (MTRL) extends this concept by training a single agent to perform multiple tasks simultaneously, allowing for more efficient use of resources and behavior sharing between tasks. Policy Distillation (PD) is a technique commonly used in MTRL, where policies from multiple single-task agents (teachers) are distilled into a single multi-task agent (student). This is done by merging common structure across tasks, while separating task-specific properties.

However, existing PD approaches require interactions with the environment during training. In this work, we investigate the effectiveness of PD in the offline setting, where the agent has no interaction with the environment before deployment and can only learn from previously collected data. Through a series of experiments, we demonstrate that a straightforward approach yields the highest performance. This approach involves first learning teacher policies using an existing offline RL algorithm, then distilling these policies into a student by sampling states from the offline data and applying a Mean Squared Error (MSE) loss between the teachers’ and student’s best actions. Moreover, we investigate the effect of a state distribution shift—a major challenge in offline RL—on our approach. We find that such shifts impact performance only slightly in cases of relatively small neural networks or substantial distribution shifts.

We also explore how PD can be enhanced to better capture common structure across related tasks, a key to improving efficiency in MTRL. To this end, we formally define common structure at two levels: the trajectory level and the computational level. To the best of our knowledge, we present the first attempt to quantify the amount of common structure shared across tasks. This measurement reveals that task commonalities are not fully exploited automatically. At the computational level, we attempt to improve sharing of common structure by reducing the network size and adding a regularization term to the loss function. To capture more common structure at the trajectory level, we argue that multi-task exploration is required, meaning that behaviors from one task must be evaluated in the context of another task. We propose two extensions to our approach that introduce multi-task exploration: Data Sharing (DS) and Offline Q-Switch (OQS). While these extensions are capable of improving performance, they also have clear limitations.

Overall, we propose a new, high-performing offline MTRL method and provide valuable insights into the fundamental capabilities and limitations of PD in capturing common structure across tasks, specifically within the offline MTRL setting.
...

In Reinforcement Learning (RL), an agent learns to make decisions by interacting with an environment and receiving feedback in the form of rewards. Multi-Task Reinforcement Learning (MTRL) extends this concept by training a single agent to perform multiple tasks simultaneously, allowing for more efficient use of resources and behavior sharing between tasks. Policy Distillation (PD) is a technique commonly used in MTRL, where policies from multiple single-task agents (teachers) are distilled into a single multi-task agent (student). This is done by merging common structure across tasks, while separating task-specific properties.

However, existing PD approaches require interactions with the environment during training. In this work, we investigate the effectiveness of PD in the offline setting, where the agent has no interaction with the environment before deployment and can only learn from previously collected data. Through a series of experiments, we demonstrate that a straightforward approach yields the highest performance. This approach involves first learning teacher policies using an existing offline RL algorithm, then distilling these policies into a student by sampling states from the offline data and applying a Mean Squared Error (MSE) loss between the teachers’ and student’s best actions. Moreover, we investigate the effect of a state distribution shift—a major challenge in offline RL—on our approach. We find that such shifts impact performance only slightly in cases of relatively small neural networks or substantial distribution shifts.

We also explore how PD can be enhanced to better capture common structure across related tasks, a key to improving efficiency in MTRL. To this end, we formally define common structure at two levels: the trajectory level and the computational level. To the best of our knowledge, we present the first attempt to quantify the amount of common structure shared across tasks. This measurement reveals that task commonalities are not fully exploited automatically. At the computational level, we attempt to improve sharing of common structure by reducing the network size and adding a regularization term to the loss function. To capture more common structure at the trajectory level, we argue that multi-task exploration is required, meaning that behaviors from one task must be evaluated in the context of another task. We propose two extensions to our approach that introduce multi-task exploration: Data Sharing (DS) and Offline Q-Switch (OQS). While these extensions are capable of improving performance, they also have clear limitations.

Overall, we propose a new, high-performing offline MTRL method and provide valuable insights into the fundamental capabilities and limitations of PD in capturing common structure across tasks, specifically within the offline MTRL setting.

What would Jiminy Cricket do?

A pluralist approach in generating and processing morally-aligned text

Bachelor thesis (2023) - K.N.I. Timmerman, E. Liscio, D. Mambelli, P.K. Murukannaiah, J. Yang

When making decisions, people are automatically guided by their moral compass. However, AI agents need to be conditioned in order to be steered towards moral behaviour. An environment that can be used to train and test agents is the Jiminy Cricket environment. The Jiminy Cricket environment consists of a set of text-based narrative games, where every action possible is annotated with the morality of that action. However, to create a more morally nuanced agent, we have annotated all of the actions according to the following moral values: Care/harm, fairness/cheating, loyalty/betrayal, authority/subversion, and purity/degradation. To morally condition the agent, we calculate the predicted progress of a potential action and combine it with an oracle to retrieve the moral annotation of the potential action. Using both of these components, the score per generated action is calculated and based on the score the eventual action is chosen. The score can be calculated differently based on the weights assigned to the overall progress and morality, as well as based on the sub-weights assigned to each moral value. Using this environment we pose the question, if we focus on only one moral value, what is the most optimal configuration that can be achieved in order to maximize both progress and morality? From the results we can observe that the lowest relative immorality can be achieved by imposing no moral constraints on the agent. Posing constraints on the agent will lead to a relatively bigger decrease of the completion percentage than to the immorality decrease. One-hot encoding the moral values will reveal which immoral actions are needed to progress in the game, and which immoral actions should to be prevented to lower immorality. ...

Natural Language Processing and Reinforcement Learning to Generate Morally Aligned Text

Comparing a moral agent to an optimally playing agent

Bachelor thesis (2023) - R.A.X.M. Lubbers, E. Liscio, D. Mambelli, P.K. Murukannaiah, J. Yang

Nowadays Large Language Models are becoming more and more prevalent in today's society. These models act without a sense of morality however. They only prioritize accomplishing their goal. Currently, little research has been done evaluating these models. The current state of the art Reinforcement Learning models represent morality by a singular scalar value determining the morality of a statement. This way of representing morality is inaccurate as there are multiple features determining how moral a statement is. We leverage knowledge from the Moral Foundations Theory to represent morality in a more accurate way, by using a 5-dimensional vector representing morality features. We implement several different agents in an environment where decisions with possible moral implications need to be made. These agents all use alternative approaches in deciding which action to take. The policies are: always pick the most moral action and always pick the most immoral action. Two other agents have the same aforementioned policy but still give some weight towards game progression. Lastly, we look at an amoral agent which does not look at morality at all. We compare these agents by percent completion of the Infocom game suspect. We find that the agent which does not take morality into account achieves the highest completion rate. Agents which give morality a huge weight almost instantly get stuck in an infinite loop without progression. ...

NLP and reinforcement learning to generate morally aligned text

How does explainable models perform compared to black-box models

Bachelor thesis (2023) - N. De Leeuw, E. Liscio, D. Mambelli, P.K. Murukannaiah, J. Yang

This paper evaluates the performance of an automated explainable model, Moral- Strength, to predict morality, or more pre- cisely Moral Foundations Theory (MFT) traits. MFT is a way to represent and divide morality into precise and detailed traits. This evaluation happens in the Jiminy Cricket environment, an environ- ment composed of 25 text-based games. This evaluation helps us estimate the do- main adaptation of MoralStrength, and also its limitations. The explainability of this model helps understand those limitations. We can conclude that MoralStrength is per- forming overall worse than other optimal models and that the domain adaptation to the Jiminy Cricket domain has some cru- cial flaws, but it leads us to think about the explainability/accuracy trade-off and where to draw the line, knowing that explainable models are important for ethical decision- making. ...

Natural Language Processing and Reinforcement Learning to Generate Morally

What is the optimal weight w to win the games while playing morally?

Bachelor thesis (2023) - K.T.C. Boudier, E. Liscio, D. Mambelli, P.K. Murukannaiah, J. Yang

In our everyday life, people interact more and more with agents. However these agents often lack a moral sense and prioritize the accomplishment of the given task. In consequence, agents may unknowingly act immorally. Little research or progress has been done to endow agents with human morality and an internal sense of right and wrong. As of today, agents have a primitive representation of morality often represented as 1 value. In contrast, humans have multiple reasons to judge an action as moral. In hope of creating agents that are imbued with a more complex and human moral, we build upon the Jiminy Cricket environment. This preexisting environment has multiple games with diverse scenarios and the objective is to do the most moral action to maximize the reward ...

Alleviating the cold-start problem by using demographic data and domain-aware similarity measure

Bachelor thesis (2022) - R.C. Kalaria, F.A. Oliehoek, A.T. Czechowski, O. Azizi, D. Mambelli, D.M.J. Tax

Recommender systems (RS) are a cornerstone for most online businesses that cater to a large customer base such as e-commerce, social network platforms and many others. RS's enable these platforms to provide tailor-made experiences to each of their customers by strategically utilizing users/items rating data or any other available data. Collaborative filtering (CF) techniques are some of the most popular and successful RS models created. However, CF techniques often suffer from the cold start (CS) problem. In particular, they struggle with complete cold start (CCS) situations in which no user/item rating history is available and incomplete cold start (ICS) situations in which only a limited amount of user/item rating history is available.
In this paper, we explore two models which utilize novel ideas to combat the CCS and ICS problems. The first model (DCF) focuses on the intelligent use of user demographic data to combat the CCS problem. The second model (PIPCF) focuses on the use of a novel domain-specific similarity measure called Proximity-Impact-Popularity (PIP) to combat the ICS problem. In addition to this, we also propose our own model (DPIP-CF) which combines these two ideas in conjunction with some of our own modifications to combat the CCS and ICS problems simultaneously.
We utilize the MovieLens data set which is a commonly available and popular dataset that is often used to test RS's. Through a series of experiments, we demonstrate the strengths of DCF and PIPCF in dealing with the CCS and ICS problems respectively. Finally, we also show that our DPIP-CF model outperforms all other models discussed in this paper and is a viable solution to dealing with the CCS and ICS problems simultaneously. ...

Minimizing the Long-tail Problem in Collaborative Filtering Based Recommender Systems Using Clustering

Bachelor thesis (2022) - Y. Mundhra, F.A. Oliehoek, A.T. Czechowski, D. Mambelli, O. Azizi, D.M.J. Tax

Recommender systems are an essential part of online businesses in today's day and age. They provide users with meaningful recommendations for items and products. A frequently occurring problem in recommender systems is known as the long-tail problem. It refers to a situation in which a majority of the items in the data set have limited ratings due to which many recommender systems, especially collaborative filtering based methods, are not able to recommend these items, also known as long-tail items. Although popular items are easier to recommend, it has been noticed that long-tail items often generate a significant fraction of the revenue and therefore should also be recommended to users. This paper proposes a modified version of a collaborative filtering based recommender system aimed to reduce the effects of the long-tail recommendation problem (LTRP). The algorithm first splits the data set into the head H and the tail T and clusters the items from the tail. The average rating avg for each cluster is calculated and for all users and their unrated long-tail items, the rating for that item is set to avg with a probability of p. Now the standard collaborative filtering algorithm is run with the newly inserted ratings. The inserted ratings reduce the sparsity of the data set and therefore make it easier to recommend long-tail items. Empirical experiments on the 100K MovieLens data set indicate that the proposed algorithm recommends more long-tail items than the standard collaborative filtering algorithm, thus reducing the effects of the LTRP while maintaining the same or a slightly lower accuracy of the recommender system. ...

Evaluating Design Choices in Tripartite Graph-Based Recommender Systems to Improve Long Tail Recommendations

Bachelor thesis (2022) - Thomas Crul, F.A. Oliehoek, A.T. Czechowski, D. Mambelli, O. Azizi, D.M.J. Tax

Even though the abaility to recommend items in the long tail is one of the main strengths of recommendation systems, modern models still show decreased performance when recommending these niche items. Various bipartite and tripartite graph-based models have been proposed that are specifically tailored to solving this long tail issue. This study aims to investigate the effect of the design of the additional layer introduced by tripartite graph-based recommender systems on their performance. All options available in the MovieLens 1M dataset are evaluated on recall and diversity. Experimental results suggest that tripartite graphs based on latent information describing the users perform better than ones utilising item-based latent information, but both these options hardly outperform the baseline bipartite model. Regardless of the graph used, normalising the transition matrix is found to significantly increase performance. It is hypothesised that larger user-focused additional layers show increased diversity over smaller options when normalised. Issues regarding the reproducibility of previous research are identified and addressed, and the development of unified evaluation metrics is advocated to prevent such problems in the future. ...

Adapting to Dynamic User Preferences in Recommendation Systems via Deep Reinforcement Learning

Bachelor thesis (2022) - P.L. Pantea, F.A. Oliehoek, A.T. Czechowski, D. Mambelli, O. Azizi, D.M.J. Tax

Recommender Systems play a significant part in filtering and efficiently prioritizing relevant information to alleviate the information overload problem and maximize user engagement. Traditional recommender systems employ a static approach towards learning the user's preferences, relying on logged previous interactions with the system, disregarding the sequential nature of the recommendation task and consequently, the user preference shifts occurring across interactions. In this study, we formulate the recommendation task as a slate Markov Decision Process (slate-MDP) and leverage deep reinforcement learning (DRL) to learn recommendation policies through sequential interactions and maximize user engagement over extended horizons in non-stationary environments. We construct the simulated environment with various degrees of preferential dynamics and benchmark two DRL-based algorithms: FullSlateQ, a non-decomposed full slate Q-learning based on a DQN agent, and SlateQ, which implements DQN using slate decomposition. Our findings suggest that SlateQ outperforms by 10.57% FullSlateQ in non-stationary environments and that with a moderate discount factor, the algorithms behave myopically and fail to make an appropriate tradeoff to maximize long-term user engagement. ...