Thiago D. Simão | TU Delft Repository

Reinforcement Learning by Guided Safe Exploration

Conference paper (2023) - Qisong Yang (author) , Thiago D. Simão (author) , Nils Jansen (author) , Simon Tindemans (author) , Matthijs T. J. Spaan (author)

Safety is critical to broadening the application of reinforcement learning (RL). Often, we train RL agents in a controlled environment, such as a laboratory, before deploying them in the real world. However, the real-world target task might be unknown prior to deployment. Reward- ...

Safe Online and Offline Reinforcement Learning

Doctoral thesis (2023) - Thiago D. Simão (author)

Reinforcement Learning (RL) agents can solve general problems based on little to no knowledge of the underlying environment. These agents learn through experience, using a trial-and-error strategy that can lead to effective innovations, but this randomized process might cause und ...

Reinforcement Learning (RL) agents can solve general problems based on little to no knowledge of the underlying environment. These agents learn through experience, using a trial-and-error strategy that can lead to effective innovations, but this randomized process might cause undesirable events. Therefore, to enable the adoption of RL in our daily lives, we must ensure their reliability and safety. Safety requirements are often incompatible with the naive random exploration usually performed by RL agents. Safe RL studies how to make such agents more reliable and how to ensure they behave appropriately. We investigate these issues in online settings, where the agent interacts directly with the environment, and in offline settings, where the agent only has access to historical data and does not interact directly with the environment. While safety has numerous facets in RL, in this thesis, we focus on two of them. First, the safe policy improvement problem, which considers how to compute a policy offline reliably. Second, the constrained reinforcement learning problem, which investigates how to learn a policy that satisfies a set of safety constraints. Next, we detail these perspectives and how we approach them. The first perspective is of particular interest in offline settings. In this setting, we can imagine some decision mechanism has been operating the system, we refer to this mechanism as the behavior policy. Assuming these past decisions were recorded in a database, we would like to use RL to compute a new policy using such database. It would be difficult to convince stakeholders to switch to the policy computed by RL if there were chances that the new policy would cause considerable performance loss compared to the behavior policy. Therefore, developing algorithms that reliably compute policies that outperform the behavior policy is essential as this gives confidence to decision-makers that the new policy will not degrade the performance of the underlying system. The safe policy improvement problem formalizes these issues. Considering that real-world data is limited and costly, in Chapter 3, we investigate how to improve the sample complexity of safe policy improvement algorithms by exploiting the factored structure of the underlying problem. In particular, we consider problems where the dynamics of each state variable depend only on a small subset of the state variables. Exploiting this structure, we develop RL algorithms that require orders of magnitude fewer data to find better policies than their counterparts that ignore such structure. This method also generalizes samples from one state to another, which allows us to compute improved policies if the data only partially cover the problem. In many real-world applications such as dialogue systems, pharmaceutical tests, and crop management, data is collected under human supervision, and the behavior policy remains unknown. In Chapter 4, we apply safe policy improvement algorithms with an estimated policy built from data. We formally provide safe policy improvement guarantees over the behavior policy even without direct access to it. Our empirical experiments on tasks with finite and continuous states support the theoretical findings. The second safety perspective is relevant for online RL agents. Engineering a reward signal that allows the agent to maximize its performance while remaining safe is not trivial. Therefore, it is better to decouple safety from reward using constrained Markov decision processes (CMDPs), where an independent signal models the safety aspects. In this setting, an RL agent can autonomously find trade-offs between performance and safety. Unfortunately, most RL agents designed for the constrained setting only guarantee safety after the learning phase, which prevents their direct deployment. In Chapter 6, we investigate settings where a concise abstract model of the safety aspects is given, a reasonable assumption since a thorough understanding of safety-related matters is a prerequisite for deploying RL in typical applications. We propose an RL algorithm that uses this abstract model to learn policies safely. During the training process, this algorithm can seamlessly switch from a conservative to a greedy policy without violating the safety constraints. We prove that this algorithm is safe under the given assumptions. Empirically, we show that even if safety and reward signals are contradictory, this algorithm always operates safely, while when they are aligned, this approach also improves the agent's performance. Finally, we study how to reduce the performance regret of this algorithm without sacrificing the safety guarantees. To summarize, we develop new RL methods exploiting prior knowledge about the structure of the problem. We propose reliable offline algorithms that can improve the policy using fewer data and online algorithms that comply with safety constraints while learning. Besides safety and reliability, we also touch on other issues preventing the deployment of RL to real-world tasks, such as data efficiency and learning with a fixed batch of data. Nevertheless, we must recall that other challenges, such as partial-observability and explainability, still require attention. We hope this thesis serves as a stepping stone toward combining different types of prior knowledge to improve various aspects of RL.

Training and Transferring Safe Policies in Reinforcement Learning

Conference paper (2022) - Qisong Yang (author) , Thiago D. Simão (author) , Nils Jansen (author) , Simon Tindemans (author) , Matthijs T. J. Spaan (author)

Safety is critical to broadening the a lication of reinforcement learning (RL). Often, RL agents are trained in a controlled environment, such as a laboratory, before being de loyed in the real world. However, the target reward might be unknown rior to de loyment. Reward-free R ...

Back to the Future

Solving Hidden Parameter MDPs with Hindsight

Conference paper (2022) - C.T. Ponnambalam (author) , Danial Kamran (author) , Thiago D. Simão (author) , FA Oliehoek (author) , Matthijs Spaan (author)

A Modern Perspective on Safe Automated Driving for Different Traffic Dynamics using Constrained Reinforcement Learning

Conference paper (2022) - Danial Kamran (author) , Thiago D. Simão (author) , Q. Yang (author) , Canmanie T. Ponnambalam (author) , Johannes Fischer (author) , Matthijs T.J. Spaan (author) , Martin Lauer (author)

The use of reinforcement learning (RL) in real-world domains often requires extensive effort to ensure safe behavior. While this compromises the autonomy of the system, it might still be too risky to allow a learning agent to freely explore its environment. These strict impositio ...

Refined Risk Management in Safe Reinforcement Learning with a Distributional Safety Critic

Conference paper (2022) - Q. Yang (author) , Thiago D. Simão (author) , Simon H. Tindemans (author) , Matthijs TJ Spaan (author)

Safety is critical to broadening the real-world use of reinforcement learning (RL). Modeling the safety aspects using a safety-cost signal separate from the reward is becoming standard practice, since it avoids the problem of finding a good balance between safety and performance. ...

Safety-constrained reinforcement learning with a distributional safety critic

Journal article (2022) - Q. Yang (author) , Thiago D. Simão (author) , Simon H. Tindemans (author) , Matthijs TJ Spaan (author)

Safety is critical to broadening the real-world use of reinforcement learning. Modeling the safety aspects using a safety-cost signal separate from the reward and bounding the expected safety-cost is becoming standard practice, since it avoids the problem of finding a good balanc ...

WCSAC: Worst-Case Soft Actor Critic for Safety-Constrained Reinforcement Learning

Conference paper (2021) - Q. Yang (author) , Thiago D. Simão (author) , Simon H. Tindemans (author) , Matthijs TJ Spaan (author)

Safe exploration is regarded as a key priority area for reinforcement learning research. With separate reward and safety signals, it is natural to cast it as constrained reinforcement learning, where expected long-term costs of policies are constrained. However, it can be hazardo ...

AlwaysSafe: Reinforcement Learning without Safety Constraint Violations during Training

Conference paper (2021) - Thiago D. Simão (author) , Nils Jansen (author) , Matthijs T.J. Spaan (author)

Deploying reinforcement learning (RL) involves major concerns around safety. Engineering a reward signal that allows the agent to maximize its performance while remaining safe is not trivial. Safe RL studies how to mitigate such problems. For instance, we can decouple safety fro ...

Safe Policy Improvement with an Estimated Baseline Policy

Conference paper (2020) - Thiago D. Simão (author) , Romain Laroche (author) , Rémi Tachet des Combes (author)

Previous work has shown the unreliability of existing algorithms in the batch Reinforcement Learning setting, and proposed the theoretically-grounded Safe Policy Improvement with Baseline Bootstrapping (SPIBB) fix: reproduce the baseline policy in the uncertain state-action pairs ...

Structure Learning for Safe Policy Improvement

Conference paper (2019) - Thiago D. Simão (author) , MTJ Spaan (author)

We investigate how Safe Policy Improvement (SPI) algorithms can exploit the structure of factored Markov decision processes when such structure is unknown a priori. To facilitate the application of reinforcement learning in the real world, SPI provides probabilistic guarantees th ...

Safe and Sample-Efficient Reinforcement Learning Algorithms for Factored Environments

Conference paper (2019) - Thiago D. Simão (author)

Reinforcement Learning (RL) deals with problems that can be modeled as a Markov Decision Process (MDP) where the transition function is unknown. In situations where an arbitrary policy π is already in execution and the experiences with the environment were recorded in a batch D, ...

Safe Policy Improvement with Baseline Bootstrapping in Factored Environments

Conference paper (2019) - Thiago D. Simão (author) , MTJ Spaan (author)

We present a novel safe reinforcement learning algorithm that exploits the factored dynamics of the environment to become less conservative. We focus on problem settings in which a policy is already running and the interaction with the environment is limited. In order to safely d ...