Towards unbiased action value estimation in reinforcement learning

None, None; None, None; None, None

Towards unbiased action value estimation in reinforcement learning

Journal Article (2026)

Author(s)

Yuan Xue ( L3S Research Centre)

Daniel Kudenko ( L3S Research Centre)

Megha Khosla (TU Delft - Multimedia Computing)

Reinforcement learning Sample efficiency Value overestimation

DOI related publication

https://doi.org/10.1016/j.neucom.2025.131581 Final published version

To reference this document use

https://resolver.tudelft.nl/uuid:abbc80e3-6f2e-4b50-9bb9-21fd71071a18

More Info

expand_more

Publication Year

2026

Language

English

Journal title

Neurocomputing

Volume number

683

Article number

131581

Downloads counter

4

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Q-learning, as a well-known reinforcement learning algorithm, is prone to overestimation of action values. Such overestimation is mainly due to the use of the maximization operator when updating the Q function. Although existing approaches attempt to reduce overestimation bias, they typically retain the maximization or minimization operator in the update process. Recognizing that these operators are the root cause of biased value estimation, we aim to eliminate these operators altogether. An existing tabular RL algorithm, QV-learning, jointly learns a state-value function and an action-value function without using the maximization or minimization operator; however, it leaves the analysis related to overestimation bias unaddressed. We fill this gap by conducting a targeted evaluation of QV-learning with experience replay applied, demonstrating its significant effectiveness in addressing overestimation bias and superior sample efficiency. Notably, we provide a theoretical analysis of the optimal convergence of QV-learning, which is absent from prior studies. Moreover, we propose a novel deep RL extension of QV-learning, called Deep VQ-Networks (DVQN). Given the noisy learning environment in the deep RL setting, DVQN accounts for the exploration policy's bias towards the overestimated actions, thereby reducing the collection of poor data caused by overestimation and improving training efficiency. We evaluate DVQN across ten Atari game domains and demonstrate that it achieves performance that is either superior to or comparable with baselines including: Deep Q Networks, Deep SARSA, Deep Double Q Networks, Clipped Deep Double Q Networks, Averaged DQN, Dueling DQN and DQV-learning.

Files

1-s2.0-S0925231225022532-main.... (pdf)

(pdf | 8.94 Mb)