Value Improved Actor Critic Algorithms

None, None; None, None; None, None; None, None; None, None

Value Improved Actor Critic Algorithms

Preprint (2024)

Author(s)

Y. Oren (TU Delft - Electrical Engineering, Mathematics and Computer Science)

M.A. Zanger (TU Delft - Electrical Engineering, Mathematics and Computer Science)

P.R. van der Vaart (TU Delft - Electrical Engineering, Mathematics and Computer Science)

M.T.J. Spaan (TU Delft - Electrical Engineering, Mathematics and Computer Science)

J.W. Böhmer (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Research Group

Sequential Decision Making

DOI related publication

https://doi.org/10.48550/arXiv.2406.01423 Submitted manuscript

To reference this document use

https://resolver.tudelft.nl/uuid:a6d2d7a2-54b7-4b73-a5c8-c3ca71a0a6a0

More Info

expand_more

Publication Year

2024

Language

English

Research Group

Sequential Decision Making

Downloads counter

171

Abstract

Many modern reinforcement learning algorithms build on the actor-critic (AC) framework: iterative improvement of a policy (the actor) using policy improvement operators and iterative approximation of the policy's value (the critic). In contrast, the popular value-based algorithm family employs improvement operators in the value update, to iteratively improve the value function directly. In this work, we propose a general extension to the AC framework that employs two separate improvement operators: one applied to the policy in the spirit of policy-based algorithms and one applied to the value in the spirit of value-based algorithms, which we dub Value-Improved AC (VI-AC). We design two practical VI-AC algorithms based in the popular online off-policy AC algorithms TD3 and DDPG. We evaluate VI-TD3 and VI-DDPG in the Mujoco benchmark and find that both improve upon or match the performance of their respective baselines in all environments tested