Interpretable Reinforcement Learning for Continuous Action Environments

None, None

Interpretable Reinforcement Learning for Continuous Action Environments

Extending DTPO for Continuous Action Spaces and Evaluating Competitiveness with RPO

Bachelor Thesis (2025)

Author(s)

M.Z. Kaptein (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A. Lukina – Mentor (TU Delft - Algorithmics)

D.A. Vos – Mentor (TU Delft - Algorithmics)

L. Cavalcante Siebert – Graduation committee member (TU Delft - Interactive Intelligence)

Faculty

Electrical Engineering, Mathematics and Computer Science

Reinforcement Learning Explainable AI Decision Trees Interpretable Machine Learning Decision Tree Policy Optimization Continuous Action Spaces

To reference this document use:

https://resolver.tudelft.nl/uuid:fcb7137c-d3ce-40ad-a2c8-cbee5535eefd

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

24-06-2025

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This research addresses the challenge of interpretability in Reinforcement Learning (RL) for environments with continuous action spaces by extending the Decision Tree Policy Optimization (DTPO) algorithm, which was originally developed for discrete action spaces.
Unlike deep RL methods such as Proximal Policy Optimization (PPO), which are effective but difficult to interpret, DTPO offers transparent rule-based policies. We propose a continuous-action variant of the DTPO algorithm, DTPO-c, which allows decision trees to output Gaussian distribution parameters while maintaining interpretability. Our experiments on the Pendulum-v1 environment show that DTPO-c can achieve performance comparable to Robust Policy Optimization (RPO), although it requires more computational effort. Additionally, we investigate the impact of discretizing continuous actions and find that increasing action resolution does not always lead to improved performance, likely due to limited model capacity. These results confirm the feasibility of interpretable RL in continuous environments, making it suitable for applications where understanding and trusting the behavior of the model is important.

Files

Final-Research-Paper.pdf

(pdf | 2.59 Mb)

License info not available