Interpretable Reinforcement Learning for Continuous Action Environments

Extending DTPO for Continuous Action Spaces and Evaluating Competitiveness with RPO

Bachelor Thesis (2025)
Author(s)

M.Z. Kaptein (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A. Lukina – Mentor (TU Delft - Algorithmics)

D.A. Vos – Mentor (TU Delft - Algorithmics)

L. Cavalcante Siebert – Graduation committee member (TU Delft - Interactive Intelligence)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
24-06-2025
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This research addresses the challenge of interpretability in Reinforcement Learning (RL) for environments with continuous action spaces by extending the Decision Tree Policy Optimization (DTPO) algorithm, which was originally developed for discrete action spaces.
Unlike deep RL methods such as Proximal Policy Optimization (PPO), which are effective but difficult to interpret, DTPO offers transparent rule-based policies. We propose a continuous-action variant of the DTPO algorithm, DTPO-c, which allows decision trees to output Gaussian distribution parameters while maintaining interpretability. Our experiments on the Pendulum-v1 environment show that DTPO-c can achieve performance comparable to Robust Policy Optimization (RPO), although it requires more computational effort. Additionally, we investigate the impact of discretizing continuous actions and find that increasing action resolution does not always lead to improved performance, likely due to limited model capacity. These results confirm the feasibility of interpretable RL in continuous environments, making it suitable for applications where understanding and trusting the behavior of the model is important.

Files

Final-Research-Paper.pdf
(pdf | 2.59 Mb)
License info not available