Structure Learning for Safe Policy Improvement

None, None; None, None

Structure Learning for Safe Policy Improvement

Conference Paper (2019)

Author(s)

Thiago D. Simão (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Matthijs T.J. Spaan (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Research Group

Algorithmics

DOI related publication

https://doi.org/10.24963/ijcai.2019/479 Final published version

To reference this document use

https://resolver.tudelft.nl/uuid:fc2201e3-01f7-4623-85a4-e2634c0c172c

More Info

expand_more

Publication Year

2019

Language

English

Abstract

We investigate how Safe Policy Improvement (SPI) algorithms can exploit the structure of factored Markov decision processes when such structure is unknown a priori. To facilitate the application of reinforcement learning in the real world, SPI provides probabilistic guarantees that policy changes in a running process will improve the performance of this process. However, current SPI algorithms have requirements that might be impractical, such as: (i) availability of a large amount of historical data, or (ii) prior knowledge of the underlying structure. To overcome these limitations we enhance a Factored SPI (FSPI) algorithm with different structure learning methods. The resulting algorithms need fewer samples to improve the policy and require weaker prior knowledge assumptions. In well-factorized domains, the proposed algorithms improve performance significantly compared to a flat SPI algorithm, demonstrating a sample complexity closer to an FSPI algorithm that knows the structure. This indicates that the combination of FSPI and structure learning algorithms is a promising solution to real-world problems involving many variables.