Shimon Whiteson | TU Delft Repository

Facial Feedback for Reinforcement Learning: A Case Study and Offline Analysis Using the TAMER Framework

Conference paper (2021) - Guangliang Li (author) , Shimon Whiteson (author) , Hamdi Dibeklioğlu (author) , HS Hung (author)

Interactive reinforcement learning provides a way for agents to learn to solve tasks from evaluative feedback provided by a human user. Previous research showed that humans give copious feedback early in training but very sparsely thereafter. In this paper, we investigate the pot ...

Randomized Entity-wise Factorization for Multi-Agent Reinforcement Learning

Conference paper (2021) - Shariq Iqbal (author) , Christian A. Schroeder de Witt (author) , Bei Peng (author) , Wendelin Böhmer (author) , Shimon Whiteson (author) , Fei Sha (author)

Real world multi-agent tasks often involve varying types and quantities of agents and non-agent entities; however, agents within these tasks rarely need to consider all others at all times in order to act effectively. Factored value function approaches have historically leveraged ...

My body is a cage: the role of morphology in graph-based incompatible control

Conference paper (2021) - Vitaly Kurin (author) , Maximilian Igl (author) , Tim Rocktäschel (author) , J.W. Böhmer (author) , Shimon Whiteson (author)

Multitask Reinforcement Learning is a promising way to obtain models with better performance, generalisation, data efficiency, and robustness. Most existing work is limited to compatible settings, where the state and action space dimensions are the same across tasks. Graph Neural ...

UneVEn: Universal Value Exploration for Multi-Agent Reinforcement Learning

Conference paper (2021) - Tarun Gupta (author) , Anuj Mahajan (author) , Bei Peng (author) , J.W. Böhmer (author) , Shimon Whiteson (author)

VDN and QMIX are two popular value-based algorithms for cooperative MARL that learn a centralized action value function as a monotonic mixing of per-agent utilities. While this enables easy decentralization of the learned policy, the restricted joint action value function can pre ...

Transient non-stationarity and generalisation in deep reinforcement learning

Conference paper (2021) - Maximilian Igl (author) , Gregory Farquhar (author) , Jelena Luketina (author) , J.W. Böhmer (author) , Shimon Whiteson (author)

Non-stationarity can arise in Reinforcement Learning (RL) even in stationary environments. For example, most RL algorithms collect new data throughout training, using a non-stationary behaviour policy. Due to the transience of this non-stationarity, it is often not explicitly add ...

Analysing factorizations of action-value networks for cooperative multi-agent reinforcement learning

Journal article (2021) - Jacopo Castellini (author) , FA Oliehoek (author) , Rahul Savani (author) , Shimon Whiteson (author)

Recent years have seen the application of deep reinforcement learning techniques to cooperative multi-agent systems, with great empirical success. However, given the lack of theoretical insight, it remains unclear what the employed neural networks are learning, or how we should e ...

FACMAC

Factored Multi-Agent Centralised Policy Gradients

Conference paper (2021) - Bei Peng (author) , Tabish Rashid (author) , Christian A. Schroeder de Witt (author) , Pierre-Alexandre Kamienny (author) , Philip H.S. Torr (author) , J.W. Böhmer (author) , Shimon Whiteson (author)

We propose FACtored Multi-Agent Centralised policy gradients (FACMAC), a new method for cooperative multi-agent reinforcement learning in both discrete and continuous action spaces. Like MADDPG, a popular multi-agent actor-critic method, our approach uses deep deterministic polic ...

Maximizing Information Gain in Partially Observable Environments via Prediction Rewards

Conference paper (2020) - Yash Satsangi (author) , Sungsu Lim (author) , Shimon Whiteson (author) , F.A. Oliehoek (author) , Martha White (author)

Information gathering in a partially observable environment can be formulated as a reinforcement learning (RL), problem where the reward depends on the agent's uncertainty. For example, the reward can be the negative entropy of the agent's belief over an unknown (or hidden) varia ...

Facial feedback for reinforcement learning

A case study and offline analysis using the TAMER framework

Journal article (2020) - Guangliang Li (author) , Hamdi Dibeklioğlu (author) , Shimon Whiteson (author) , HS Hung (author)

Interactive reinforcement learning provides a way for agents to learn to solve tasks from evaluative feedback provided by a human user. Previous research showed that humans give copious feedback early in training but very sparsely thereafter. In this article, we investigate the p ...

Multitask Soft Option Learning

Conference paper (2020) - Maximilian Igl (author) , Andrew Gambardella (author) , J. He (author) , Nantas Nardelli (author) , N Siddharth (author) , J.W. Böhmer (author) , Shimon Whiteson (author)

We present Multitask Soft Option Learning (MSOL), a hierarchical multitask framework based on Planning as Inference. MSOL extends the concept of options, using separate variational posteriors for each task, regularized by a shared prior. This “soft” version of options avoids seve ...

Optimistic Exploration even with a Pessimistic Initialisation

Conference paper (2020) - Tabish Rashid (author) , Bei Peng (author) , J.W. Böhmer (author) , Shimon Whiteson (author)

Deep coordination graphs

Conference paper (2020) - J.W. Böhmer (author) , Vitaly Kurin (author) , Shimon Whiteson (author)

This paper introduces the deep coordination graph (DCG) for collaborative multi-agent reinforcement learning. DCG strikes a flexible tradeoff between representational capacity and generalization by factoring the joint value function of all agents according to a coordination graph ...

Deep residual reinforcement learning

Conference paper (2020) - Shangtong Zhang (author) , J.W. Böhmer (author) , Shimon Whiteson (author)

We revisit residual algorithms in both model-free and model-based reinforcement learning settings. We propose the bidirectional target network technique to stabilize residual algorithms, yielding a residual version of DDPG that significantly outperforms vanilla DDPG in the DeepMi ...

Multi-agent common knowledge reinforcement learning

Journal article (2019) - Christian A. Schroeder de Witt (author) , Jakob N. Foerster (author) , Gregory Farquhar (author) , Philip H.S. Torr (author) , J.W. Böhmer (author) , Shimon Whiteson (author)

Cooperative multi-agent reinforcement learning often requires decentralised policies, which severely limit the agents' ability to coordinate their behaviour. In this paper, we show that common knowledge between agents allows for complex decentralised coordination. Common knowledg ...

Exploration with Unreliable Intrinsic Reward in Multi-Agent Reinforcement Learning

Conference paper (2019) - J.W. Böhmer (author) , Tabish Rashid (author) , Shimon Whiteson (author)

The Representational Capacity of Action-Value Networks for Multi-Agent Reinforcement Learning

Conference paper (2019) - Jacopo Castellini (author) , FA Oliehoek (author) , Rahul Savani (author) , Shimon Whiteson (author)

Recent years have seen the application of deep reinforcement learning techniques to cooperative multi-agent systems, with great empirical success. In this work, we empirically investigate the representational power of various network architectures on a series of one-shot games. D ...

Generalized off-policy actor-critic

Journal article (2019) - Shangtong Zhang (author) , J.W. Böhmer (author) , Shimon Whiteson (author)

We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuing reinforcement learning (RL) setting. Compared to the commonly used excursion objective, which can be misleading about the performance ...

Social interaction for efficient agent learning from human reward

Journal article (2018) - Guangliang Li (author) , Shimon Whiteson (author) , W Bradley Knox (author) , Hayley Hung (author)

Learning from rewards generated by a human trainer observing an agent in action has been proven to be a powerful method for teaching autonomous agents to perform challenging tasks, especially for those non-technical users. Since the efficacy of this approach depends critically on ...

Exploiting submodular value functions for scaling up active perception

Journal article (2018) - Yash Satsangi (author) , Shimon Whiteson (author) , Frans A. Oliehoek (author) , Matthijs Spaan (author)

In active perception tasks, an agent aims to select sensory actions that reduce its uncertainty about one or more hidden variables. For example, a mobile robot takes sensory actions to efficiently navigate in a new environment. While partially observable Markov decision processes ...

In active perception tasks, an agent aims to select sensory actions that reduce its uncertainty about one or more hidden variables. For example, a mobile robot takes sensory actions to efficiently navigate in a new environment. While partially observable Markov decision processes (POMDPs) provide a natural model for such problems, reward functions that directly penalize uncertainty in the agent’s belief can remove the piecewise-linear and convex (PWLC) property of the value function required by most POMDP planners. Furthermore, as the number of sensors available to the agent grows, the computational cost of POMDP planning grows exponentially with it, making POMDP planning infeasible with traditional methods. In this article, we address a twofold challenge of modeling and planning for active perception tasks. We analyze rhoPOMDP and POMDP-IR, two frameworks for modeling active perception tasks, that restore the PWLC property of the value function. We show the mathematical equivalence of these two frameworks by showing that given a rhoPOMDP along with a policy, they can be reduced to a POMDP-IR and an equivalent policy (and vice-versa). We prove that the value function for the given rhoPOMDP (and the given policy) and the reduced POMDP-IR (and the reduced policy) is the same. To efficiently plan for active perception tasks, we identify and exploit the independence properties of POMDP-IR to reduce the computational cost of solving POMDP-IR (and rhoPOMDP). We propose greedy point-based value iteration (PBVI), a new POMDP planning method that uses greedy maximization to greatly improve scalability in the action space of an active perception POMDP. Furthermore, we show that, under certain conditions, including submodularity, the value function computed using greedy PBVI is guaranteed to have bounded error with respect to the optimal value function. We establish the conditions under which the value function of an active perception POMDP is guaranteed to be submodular. Finally, we present a detailed empirical analysis on a dataset collected from a multi-camera tracking system employed in a shopping mall. Our method achieves similar performance to existing methods but at a fraction of the computational cost leading to better scalability for solving active perception tasks.

Bounded approximations for linear multi-objective planning under uncertainty

Conference paper (2014) - Diederik M. Roijers (author) , J.C.D. Scharpff (author) , Matthijs T. J. Spaan (author) , FA Oliehoek (author) , Mathijs Weerdt (author) , Shimon Whiteson (author)

Planning under uncertainty poses a complex problem in which multiple objectives often need to be balanced. When dealing with multiple objectives, it is often assumed that the relative importance of the objectives is known a priori. However, in practice human decision makers often ...