Towards unbiased action value estimation in reinforcement learning

Journal Article (2026)
Author(s)

Yuan Xue ( L3S Research Centre)

Daniel Kudenko ( L3S Research Centre)

Megha Khosla (TU Delft - Multimedia Computing)

DOI related publication
https://doi.org/10.1016/j.neucom.2025.131581 Final published version
More Info
expand_more
Publication Year
2026
Language
English
Journal title
Neurocomputing
Volume number
683
Article number
131581
Downloads counter
4
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Q-learning, as a well-known reinforcement learning algorithm, is prone to overestimation of action values. Such overestimation is mainly due to the use of the maximization operator when updating the Q function. Although existing approaches attempt to reduce overestimation bias, they typically retain the maximization or minimization operator in the update process. Recognizing that these operators are the root cause of biased value estimation, we aim to eliminate these operators altogether. An existing tabular RL algorithm, QV-learning, jointly learns a state-value function and an action-value function without using the maximization or minimization operator; however, it leaves the analysis related to overestimation bias unaddressed. We fill this gap by conducting a targeted evaluation of QV-learning with experience replay applied, demonstrating its significant effectiveness in addressing overestimation bias and superior sample efficiency. Notably, we provide a theoretical analysis of the optimal convergence of QV-learning, which is absent from prior studies. Moreover, we propose a novel deep RL extension of QV-learning, called Deep VQ-Networks (DVQN). Given the noisy learning environment in the deep RL setting, DVQN accounts for the exploration policy's bias towards the overestimated actions, thereby reducing the collection of poor data caused by overestimation and improving training efficiency. We evaluate DVQN across ten Atari game domains and demonstrate that it achieves performance that is either superior to or comparable with baselines including: Deep Q Networks, Deep SARSA, Deep Double Q Networks, Clipped Deep Double Q Networks, Averaged DQN, Dueling DQN and DQV-learning.