Interpretable Reinforcement Learning for Continuous Action Environments
Extending DTPO for Continuous Action Spaces and Evaluating Competitiveness with RPO
M.Z. Kaptein (TU Delft - Electrical Engineering, Mathematics and Computer Science)
A. Lukina – Mentor (TU Delft - Algorithmics)
D.A. Vos – Mentor (TU Delft - Algorithmics)
L. Cavalcante Siebert – Graduation committee member (TU Delft - Interactive Intelligence)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
This research addresses the challenge of interpretability in Reinforcement Learning (RL) for environments with continuous action spaces by extending the Decision Tree Policy Optimization (DTPO) algorithm, which was originally developed for discrete action spaces.
Unlike deep RL methods such as Proximal Policy Optimization (PPO), which are effective but difficult to interpret, DTPO offers transparent rule-based policies. We propose a continuous-action variant of the DTPO algorithm, DTPO-c, which allows decision trees to output Gaussian distribution parameters while maintaining interpretability. Our experiments on the Pendulum-v1 environment show that DTPO-c can achieve performance comparable to Robust Policy Optimization (RPO), although it requires more computational effort. Additionally, we investigate the impact of discretizing continuous actions and find that increasing action resolution does not always lead to improved performance, likely due to limited model capacity. These results confirm the feasibility of interpretable RL in continuous environments, making it suitable for applications where understanding and trusting the behavior of the model is important.