M. Suau | TU Delft Repository

Leveraging Factored State Representations for Enhanced Efficiency in Reinforcement Learning

Doctoral thesis (2024) - M. Suau, F.A. Oliehoek, M.T.J. Spaan

Reinforcement learning techniques have demonstrated great promise in tackling sequential decision-making problems. However, the inherent complexity of real-world scenarios presents significant challenges for its application. This thesis takes a fresh approach that explores the untapped potential of factored state representations as a means to enhance the efficiency of reinforcement learning.

Factored representations involve variables describing various features of the environment. These variables, along with their possible values, define the agent’s states. Unlike standard representations, factored representations provide a unique perspective that enables us to gain deeper insights into the underlying structure of the environment and refine our understanding of the problem at hand.

By analyzing variable dependencies, we can abstract simplified representations of the environment states and construct computationally lightweight models. To do so, we will explore potential factorizations of key functions governing the reinforcement learning problem, such as transitions, rewards, policies, or value functions. These factorizations can be achieved by exploiting variable redundancies and leveraging relations of conditional independence.

This thesis proposes a set of methods that are shown to improve the efficiency and scalability of reinforcement learning in complex scenarios. We hope that the findings of this research contribute to showcasing the potential of factored representations and serve as inspiration for future research in this direction.
...

Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL

Preprint (2023) - M. Suau, M.T.J. Spaan, F.A. Oliehoek

Reinforcement learning agents may sometimes develop habits that are effective only when specific policies are followed. After an initial exploration phase in which agents try out different actions, they eventually converge toward a particular policy. When this occurs, the distribution of state-action trajectories becomes narrower, and agents start experiencing the same transitions again and again. At this point, spurious correlations may arise. Agents may then pick up on these correlations and learn state representations that do not generalize beyond the agent’s trajectory distribution. In this paper, we provide a mathematical characterization of this phenomenon, which we refer to as policy confounding, and show, through a series of examples, when and how it occurs in practice. ...

Online Planning in POMDPs with Self-Improving Simulators

Conference paper (2022) - Jinke He, Miguel Suau , Hendrik Baier, Michael Kaisers, Frans A. Oliehoek

How can we plan efficiently in a large and complex environment when the time budget is limited? Given the original simulator of the environment, which may be computationally very demanding, we propose to learn online an approximate but much faster simulator that improves over time. To plan reliably and efficiently while the approximate simulator is learning, we develop a method that adaptively decides which simulator to use for every simulation, based on a statistic that measures the accuracy of the approximate simulator. This allows us to use the approximate simulator to replace the original simulator for faster simulations when it is accurate enough under the current context, thus trading off simulation speed and accuracy. Experimental results in two large domains show that when integrated with POMCP, our approach allows to plan with improving efficiency over time. ...

Distributed Influence-Augmented Local Simulators for Parallel MARL in Large Networked Systems

Conference paper (2022) - M. Suau, J. He, Mustafa Mert Çelikok, M.T.J. Spaan, F.A. Oliehoek

Due to its high sample complexity, simulation is, as of today, critical for the successful application of reinforcement learning. Many real-world problems, however, exhibit overly complex dynamics, which makes their full-scale simulation computationally slow. In this paper, we show how to factorize large networked systems of many agents into multiple local regions such that we can build separate simulators that run independently and in parallel. To monitor the influence that the different local regions exert on one another, each of these simulators is equipped with a learned model that is periodically trained on real trajectories. Our empirical results reveal that distributing the simulation among different processes not only makes it possible to train large multi-agent systems in just a few hours but also helps mitigate the negative effects of simultaneous learning ...

Influence-Augmented Local Simulators

A Scalable Solution for Fast Deep RL in Large Networked Systems

Conference paper (2022) - Miguel Suau, Jinke He, Matthijs T.J. Spaan, Frans A. Oliehoek

Learning effective policies for real-world problems is still an open challenge for the field of reinforcement learning (RL). The main limitation being the amount of data needed and the pace at which that data can be obtained. In this paper, we study how to build lightweight simulators of complicated systems that can run sufficiently fast for deep RL to be applicable. We focus on domains where agents interact with a reduced portion of a larger environment while still being affected by the global dynamics. Our method combines the use of local simulators with learned models that mimic the influence of the global system. The experiments reveal that incorporating this idea into the deep RL workflow can considerably accelerate the training process and presents several opportunities for the future. ...

Speeding up Deep Reinforcement Learning through Influence-Augmented Local Simulators

Conference paper (2022) - Miguel Suau, Jinke He, Matthijs T.J. Spaan, Frans A. Oliehoek

Learning effective policies for real-world problems is still an open challenge for the field of reinforcement learning (RL). The main limitation being the amount of data needed and the pace at which that data can be obtained. In this paper, we study how to build lightweight simulators of complicated systems that can run sufficiently fast for deep RL to be applicable. We focus on domains where agents interact with a reduced portion of a larger environment while still being affected by the global dynamics. Our method combines the use of local simulators with learned models that mimic the influence of the global system. The experiments reveal that incorporating this idea into the deep RL workflow can considerably accelerate the training process and presents several opportunities for the future. ...

Influence-aware memory architectures for deep reinforcement learning in POMDPs

Journal article (2022) - Miguel Suau , Jinke He, Elena Congeduti, Rolf Starre, Aleksander Czechowski, Frans A. Oliehoek

Due to its perceptual limitations, an agent may have too little information about the environment to act optimally. In such cases, it is important to keep track of the action-observation history to uncover hidden state information. Recent deep reinforcement learning methods use recurrent neural networks (RNN) to memorize past observations. However, these models are expensive to train and have convergence difficulties, especially when dealing with high dimensional data. In this paper, we propose influence-aware memory, a theoretically inspired memory architecture that alleviates the training difficulties by restricting the input of the recurrent layers to those variables that influence the hidden state information. Moreover, as opposed to standard RNNs, in which every piece of information used for estimating Q values is inevitably fed back into the network for the next prediction, our model allows information to flow without being necessarily stored in the RNN’s internal memory. Results indicate that, by letting the recurrent layers focus on a small fraction of the observation variables while processing the rest of the information with a feedforward neural network, we can outperform standard recurrent architectures both in training speed and policy performance. This approach also reduces runtime and obtains better scores than methods that stack multiple observations to remove partial observability. ...

Using Bisimulation Metrics to Analyze and Evaluate Latent State Representations

Conference paper (2021) - N. Albers, M. Suau de Castro, F.A. Oliehoek

Deep Reinforcement Learning (RL) is a promising technique towards constructing intelligent agents, but it is not always easy to understand the learning process and the factors that impact it. To shed some light on this, we analyze the Latent State Representations (LSRs) that deep RL agents learn, and compare them to what such agents should ideally learn. We propose a crisp definition of ’ideal LSR’ based on a bisimulation metric, which measures how behaviorally similar states are. The ideal LSR is that in which the distance between two states is proportional to this bisimulation metric. Intuitively, forming such an ideal representation is highly favorable due to its compactness and generalization properties. Here we investigate if this type of representation is also desirable in practice. Our experiments suggest that learning representations that are close to this ideal LSR may improve upon generalization to new irrelevant feature values and modified dynamics. Yet, we show empirically that the extent to which such representations are learned depends on both the network capacity and the state encoding, and that with the current techniques the exact ideal LSR is never formed. ...

Learning What to Attend to: Using Bisimulation Metrics to Explore and Improve Upon What a Deep Reinforcement Learning Agent Learns

Abstract (2020) - N. Albers, M. Suau de Castro, F.A. Oliehoek

Recent years have seen a surge of algorithms and architectures for deep Re- inforcement Learning (RL), many of which have shown remarkable success for various problems. Yet, little work has attempted to relate the performance of these algorithms and architectures to what the resulting deep RL agents actu- ally learn, and whether this corresponds to what they should ideally learn. Such a comparison may allow for both an improved understanding of why certain algorithms or network architectures perform better than others and the devel- opment of methods that specically address discrepancies between what is and what should be learned. ...

Influence-Augmented Online Planning for Complex Environments

Journal article (2020) - J. He, M. Suau de Castro, F.A. Oliehoek

How can we plan efficiently in real time to control an agent in a complex environment that may involve many other agents? While existing sample-based planners have enjoyed empirical success in large POMDPs, their performance heavily relies on a fast simulator. However, real-world scenarios are complex in nature and their simulators are often computationally demanding, which severely limits the performance of online planners. In this work, we propose influence-augmented online planning, a principled method to transform a factored simulator of the entire environment into a local simulator that samples only the state variables that are most relevant to the observation and reward of the planning agent and captures the incoming influence from the rest of the environment using machine learning methods. Our main experimental results show that planning on this less accurate but much faster local simulator with POMCP leads to higher real-time planning performance than planning on the simulator that models the entire environment. ...

Influence-Based Abstraction in Deep Reinforcement Learning

Conference paper (2019) - Miguel Suau de Castro, Elena Congeduti, Rolf Starre, Aleksander Czechowski, Frans Oliehoek

thousands, or even millions of state variables. Unfortunately, applying reinforcement learning algorithms to handle complex tasks becomes more and more challenging as the number of state variables increases. In this paper, we build on the concept of influence-based abstraction which tries to tackle such scalability issues by decomposing large systems into small regions. We explore this method in the context of deep reinforcement learning, showing that by keeping track of a small set of variables in the history of previous actions and observations we can learn policies that can effectively control a local region in the global system. ...