Circular Image

Thiago D. Simão

info

Please Note

14 records found

Conference paper (2025) - Maris F.L. Galesloot, Marnix Suilen, Thiago D. Simão, Steven Carr, Matthijs T.J. Spaan, Ufuk Topcu, Nils Jansen
Robust POMDPs extend classical POMDPs to incorporate model uncertainty using so-called uncertainty sets on the transition and observation functions, effectively defining ranges of probabilities. Policies for robust POMDPs must be (1) memory-based to account for partial observability and (2) robust against model uncertainty to account for the worst-case probability instances from the uncertainty sets. To compute such robust memory-based policies, we propose the pessimistic iterative planning (PIP) framework, which alternates between (1) selecting pessimistic POMDPs via worst-case probability instances from the uncertainty sets, and (2) computing finite-state controllers (FSCs) for these pessimistic POMDPs. Within PIP, we propose the RFSCNET algorithm, which optimizes a recurrent neural network to compute the FSCs. The empirical evaluation shows that RFSCNET can compute better-performing robust policies than several baselines and a state-of-the-art robust POMDP solver. ...
Conference paper (2023) - Qisong Yang, Thiago D. Simão, Nils Jansen, Simon H. Tindemans, Matthijs T.J. Spaan
Safety is critical to broadening the application of reinforcement learning (RL). Often, we train RL agents in a controlled environment, such as a laboratory, before deploying them in the real world. However, the real-world target task might be unknown prior to deployment. Reward-free RL trains an agent without the reward to adapt quickly once the reward is revealed. We consider the constrained reward-free setting, where an agent (the guide) learns to explore safely without the reward signal. This agent is trained in a controlled environment, which allows unsafe interactions and still provides the safety signal. After the target task is revealed, safety violations are not allowed anymore. Thus, the guide is leveraged to compose a safe behaviour policy. Drawing from transfer learning, we also regularize a target policy (the student) towards the guide while the student is unreliable and gradually eliminate the influence of the guide as training progresses. The empirical analysis shows that this method can achieve safe transfer learning and helps the student solve the target task faster. ...
Doctoral thesis (2023) - T. D. Simão
Reinforcement Learning (RL) agents can solve general problems based on little to no knowledge of the underlying environment. These agents learn through experience, using a trial-and-error strategy that can lead to effective innovations, but this randomized process might cause undesirable events. Therefore, to enable the adoption of RL in our daily lives, we must ensure their reliability and safety. Safety requirements are often incompatible with the naive random exploration usually performed by RL agents. Safe RL studies how to make such agents more reliable and how to ensure they behave appropriately. We investigate these issues in online settings, where the agent interacts directly with the environment, and in offline settings, where the agent only has access to historical data and does not interact directly with the environment. While safety has numerous facets in RL, in this thesis, we focus on two of them. First, the safe policy improvement problem, which considers how to compute a policy offline reliably. Second, the constrained reinforcement learning problem, which investigates how to learn a policy that satisfies a set of safety constraints. Next, we detail these perspectives and how we approach them. The first perspective is of particular interest in offline settings. In this setting, we can imagine some decision mechanism has been operating the system, we refer to this mechanism as the behavior policy. Assuming these past decisions were recorded in a database, we would like to use RL to compute a new policy using such database. It would be difficult to convince stakeholders to switch to the policy computed by RL if there were chances that the new policy would cause considerable performance loss compared to the behavior policy. Therefore, developing algorithms that reliably compute policies that outperform the behavior policy is essential as this gives confidence to decision-makers that the new policy will not degrade the performance of the underlying system. The safe policy improvement problem formalizes these issues. Considering that real-world data is limited and costly, in Chapter 3, we investigate how to improve the sample complexity of safe policy improvement algorithms by exploiting the factored structure of the underlying problem. In particular, we consider problems where the dynamics of each state variable depend only on a small subset of the state variables. Exploiting this structure, we develop RL algorithms that require orders of magnitude fewer data to find better policies than their counterparts that ignore such structure. This method also generalizes samples from one state to another, which allows us to compute improved policies if the data only partially cover the problem. In many real-world applications such as dialogue systems, pharmaceutical tests, and crop management, data is collected under human supervision, and the behavior policy remains unknown. In Chapter 4, we apply safe policy improvement algorithms with an estimated policy built from data. We formally provide safe policy improvement guarantees over the behavior policy even without direct access to it. Our empirical experiments on tasks with finite and continuous states support the theoretical findings. The second safety perspective is relevant for online RL agents. Engineering a reward signal that allows the agent to maximize its performance while remaining safe is not trivial. Therefore, it is better to decouple safety from reward using constrained Markov decision processes (CMDPs), where an independent signal models the safety aspects. In this setting, an RL agent can autonomously find trade-offs between performance and safety. Unfortunately, most RL agents designed for the constrained setting only guarantee safety after the learning phase, which prevents their direct deployment. In Chapter 6, we investigate settings where a concise abstract model of the safety aspects is given, a reasonable assumption since a thorough understanding of safety-related matters is a prerequisite for deploying RL in typical applications. We propose an RL algorithm that uses this abstract model to learn policies safely. During the training process, this algorithm can seamlessly switch from a conservative to a greedy policy without violating the safety constraints. We prove that this algorithm is safe under the given assumptions. Empirically, we show that even if safety and reward signals are contradictory, this algorithm always operates safely, while when they are aligned, this approach also improves the agent's performance. Finally, we study how to reduce the performance regret of this algorithm without sacrificing the safety guarantees. To summarize, we develop new RL methods exploiting prior knowledge about the structure of the problem. We propose reliable offline algorithms that can improve the policy using fewer data and online algorithms that comply with safety constraints while learning. Besides safety and reliability, we also touch on other issues preventing the deployment of RL to real-world tasks, such as data efficiency and learning with a fixed batch of data. Nevertheless, we must recall that other challenges, such as partial-observability and explainability, still require attention. We hope this thesis serves as a stepping stone toward combining different types of prior knowledge to improve various aspects of RL. ...
Safety is critical to broadening the real-world use of reinforcement learning. Modeling the safety aspects using a safety-cost signal separate from the reward and bounding the expected safety-cost is becoming standard practice, since it avoids the problem of finding a good balance between safety and performance. However, it can be risky to set constraints only on the expectation neglecting the tail of the distribution, which might have prohibitively large values. In this paper, we propose a method called Worst-Case Soft Actor Critic for safe RL that approximates the distribution of accumulated safety-costs to achieve risk control. More specifically, a certain level of conditional Value-at-Risk from the distribution is regarded as a safety constraint, which guides the change of adaptive safety weights to achieve a trade-off between reward and safety. As a result, we can compute policies whose worst-case performance satisfies the constraints. We investigate two ways to estimate the safety-cost distribution, namely a Gaussian approximation and a quantile regression algorithm. On the one hand, the Gaussian approximation is simple and easy to implement, but may underestimate the safety cost, on the other hand, the quantile regression leads to a more conservative behavior. The empirical analysis shows that the quantile regression method achieves excellent results in complex safety-constrained environments, showing good risk control. ...
Conference paper (2022) - Danial Kamran, Thiago D. Simão, Qisong Yang, Canmanie T. Ponnambalam, Johannes Fischer, Matthijs T.J. Spaan, Martin Lauer
The use of reinforcement learning (RL) in real-world domains often requires extensive effort to ensure safe behavior. While this compromises the autonomy of the system, it might still be too risky to allow a learning agent to freely explore its environment. These strict impositions come at the cost of flexibility and applying them often relies on complex parameters and hard-coded knowledge modelled by the reward function. Autonomous driving is one such domain that could greatly benefit from more efficient and verifiable methods for safe automation. We propose to approach the automated driving problem using constrained RL, a method that automates the trade off between risk and utility, thereby significantly reducing the burden on the designer. We first show that an engineered reward function for ensuring safety and utility in one specific environment might not result in the optimal behavior when traffic dynamics changes in the exact environment. Next we show how algorithms based on constrained RL which are more robust to the environmental disturbances can address this challenge. These algorithms use a simple and easy to interpret reward and cost function, and are able to maintain both, efficiency and safety without requiring reward parameter tuning. We demonstrate our approach in the automated merging scenario with different traffic configurations such as low or high chance of cooperative drivers and different cooperative driving strategies. ...
Conference paper (2022) - Q. Yang, T. D. Simão, Simon H. Tindemans, M.T.J. Spaan
Safety is critical to broadening the real-world use of reinforcement learning (RL). Modeling the safety aspects using a safety-cost signal separate from the reward is becoming standard practice, since it avoids the problem of finding a good balance between safety and performance. However, the total safety-cost distribution of different trajectories is still largely unexplored. In this paper, we propose an actor critic method for safe RL that uses an implicit quantile network to approximate the distribution of accumulated safety-costs. Using an accurate estimate of the distribution of accumulated safetycosts, in particular of the upper tail of the distribution, greatly improves the performance of riskaverse RL agents. The empirical analysis shows that our method achieves good risk control in complex safety-constrained environments. ...
Conference paper (2022) - Q. Yang, T. D. Simão, Nils Jansen, Simon H. Tindemans, M.T.J. Spaan
Safety is critical to broadening the a lication of reinforcement learning (RL). Often, RL agents are trained in a controlled environment, such as a laboratory, before being de loyed in the real world. However, the target reward might be unknown rior to de loyment. Reward-free RL addresses this roblem by training an agent without the reward to ada t quickly once the reward is revealed. We consider the constrained reward-free setting, where an agent (the guide) learns to ex lore safely without the reward signal. This agent is trained in a controlled environment, which allows unsafe interactions and still rovides the safety signal. After the target task is revealed, safety violations are not allowed anymore. Thus, the guide is leveraged to com ose a safe sam ling olicy. Drawing from transfer learning, we also regularize a target olicy (the student) towards the guide while the student is unreliable and gradually eliminate the influence from the guide as training rogresses. The em irical analysis shows that this method can achieve safe transfer learning and hel s the student solve the target task faster. ...

Solving Hidden Parameter MDPs with Hindsight

Conference paper (2022) - Canmanie Ponnambalam, Danial Kamran, Thiago D. Simão, Frans A. Oliehoek, Matthijs T.J. Spaan
Conference paper (2021) - T. D. Simão, Nils Jansen, M.T.J. Spaan
Deploying reinforcement learning (RL) involves major concerns around safety. Engineering a reward signal that allows the agent to maximize its performance while remaining safe is not trivial. Safe RL studies how to mitigate such problems. For instance, we can decouple safety from reward using constrained Markov decision processes (CMDPs), where an independent signal models the safety aspects. In this setting, an RL agent can autonomously find tradeoffs between performance and safety. Unfortunately, most RL agents designed for CMDPs only guarantee safety after the learning phase, which might prevent their direct deployment. In this work, we investigate settings where a concise abstract model of the safety aspects is given, a reasonable assumption since a thorough understanding of safety-related matters is a prerequisite for deploying RL in typical applications. Factored CMDPs provide such compact models when a small subset of features describe the dynamics relevant for the safety constraints. We propose an RL algorithm that uses this abstract model to learn policies for CMDPs safely, that is without violating the constraints. During the training process, this algorithm can seamlessly switch from a conservative policy to a greedy policy without violating the safety constraints. We prove that this algorithm is safe under the given assumptions. Empirically, we show that even if safety and reward signals are contradictory, this algorithm always operates safely and, when they are aligned, this approach also improves the agent's performance. ...
Conference paper (2021) - Q. Yang, T. D. Simão, S.H. Tindemans, M.T.J. Spaan
Safe exploration is regarded as a key priority area for reinforcement learning research. With separate reward and safety signals, it is natural to cast it as constrained reinforcement learning, where expected long-term costs of policies are constrained. However, it can be hazardous to set constraints on the expected safety signal without considering the tail of the distribution. For instance, in safety-critical domains, worst-case analysis is required to avoid disastrous results. We present a novel reinforcement learning algorithm called Worst-Case Soft Actor Critic, which extends the Soft Actor Critic algorithm with a safety critic to achieve risk control. More specifically, a certain level of conditional Value-at- Risk from the distribution is regarded as a safety measure to judge the constraint satisfaction, which guides the change of adaptive safety weights to achieve a trade-off between reward and safety. As a result, we can optimize policies under the premise that their worst-case performance satisfies the constraints. The empirical analysis shows that our algorithm attains better risk control compared to expectation-based methods. ...
Conference paper (2020) - Thiago D. Simão, Romain Laroche, Rémi Tachet des Combes
Previous work has shown the unreliability of existing algorithms in the batch Reinforcement Learning setting, and proposed the theoretically-grounded Safe Policy Improvement with Baseline Bootstrapping (SPIBB) fix: reproduce the baseline policy in the uncertain state-action pairs, in order to control the variance on the trained policy performance. However, in many real-world applications such as dialogue systems, pharmaceutical tests or crop management, data is collected under human supervision and the baseline remains unknown. In this paper, we apply SPIBB algorithms with a baseline estimate built from the data. We formally show safe policy improvement guarantees over the true baseline even without direct access to it. Our empirical experiments on finite and continuous states tasks support the theoretical findings. It shows little loss of performance in comparison with SPIBB when the baseline policy is given, and more importantly, drastically and significantly outperforms competing algorithms both in safe policy improvement, and in average performance. ...
Conference paper (2019) - Thiago D. Simão, Matthijs T.J. Spaan
We investigate how Safe Policy Improvement (SPI) algorithms can exploit the structure of factored Markov decision processes when such structure is unknown a priori. To facilitate the application of reinforcement learning in the real world, SPI provides probabilistic guarantees that policy changes in a running process will improve the performance of this process. However, current SPI algorithms have requirements that might be impractical, such as: (i) availability of a large amount of historical data, or (ii) prior knowledge of the underlying structure. To overcome these limitations we enhance a Factored SPI (FSPI) algorithm with different structure learning methods. The resulting algorithms need fewer samples to improve the policy and require weaker prior knowledge assumptions. In well-factorized domains, the proposed algorithms improve performance significantly compared to a flat SPI algorithm, demonstrating a sample complexity closer to an FSPI algorithm that knows the structure. This indicates that the combination of FSPI and structure learning algorithms is a promising solution to real-world problems involving many variables. ...
Conference paper (2019) - Thiago D. Simão, Matthijs T.J. Spaan
We present a novel safe reinforcement learning algorithm that exploits the factored dynamics of the environment to become less conservative. We focus on problem settings in which a policy is already running and the interaction with the environment is limited. In order to safely deploy an updated policy, it is necessary to provide a confidence level regarding its expected performance. However, algorithms for safe policy improvement might require a large number of past experiences to become confident enough to change the agent’s behavior. Factored reinforcement learning, on the other hand, is known to make good use of the data provided. It can achieve a better sample complexity by exploiting independence between features of the environment, but it lacks a confidence level. We study how to improve the sample efficiency of the safe policy improvement with baseline bootstrapping algorithm by exploiting the factored structure of the environment. Our main result is a theoretical bound that is linear in the number of parameters of the factored representation instead of the number of states. The empirical analysis shows that our method can improve the policy using a number of samples potentially one order of magnitude smaller than the flat algorithm. ...
Conference paper (2019) - Thiago D. Simão
Reinforcement Learning (RL) deals with problems that can be modeled as a Markov Decision Process (MDP) where the transition function is unknown. In situations where an arbitrary policy π is already in execution and the experiences with the environment were recorded in a batch D, an RL algorithm can use D to compute a new policy π 0. However, the policy computed by traditional RL algorithms might have worse performance compared to π. Our goal is to develop safe RL algorithms, where the agent has a high confidence that the performance of π 0 is better than the performance of π given D. To develop sample-efficient and safe RL algorithms we combine ideas from exploration strategies in RL with a safe policy improvement method. ...