Learning Complex Policy Distribution with CEM Guided Adversarial Hypernetwork

Conference Paper (2021)
Author(s)

Shi Yuan Tang (Nanyang Technological University)

Frans Oliehoek (TU Delft - Interactive Intelligence)

Athirai A. Irissappane (University of Washington)

Jie Zhang (Nanyang Technological University)

Research Group
Interactive Intelligence
Copyright
© 2021 Shi Yuan Tang, F.A. Oliehoek, Athirai A. Irissappane, Jie Zhang
More Info
expand_more
Publication Year
2021
Language
English
Copyright
© 2021 Shi Yuan Tang, F.A. Oliehoek, Athirai A. Irissappane, Jie Zhang
Research Group
Interactive Intelligence
Pages (from-to)
1308-1316
ISBN (electronic)
9781450383073
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Cross-Entropy Method (CEM) is a gradient-free direct policy search method, which has greater stability and is insensitive to hyperparameter tuning. CEM bears similarity to population-based evolutionary methods, but, rather than using a population it uses a distribution over candidate solutions (policies in our case). Usually, a natural exponential family distribution such as multivariate Gaussian is used to parameterize the policy distribution. Using a multivariate Gaussian limits the quality of CEM policies as the search becomes confined to a less representative subspace. We address this drawback by using an adversarially-trained hypernetwork, enabling a richer and complex representation of the policy distribution. To achieve better training stability and faster convergence, we use a multivariate Gaussian CEM policy to guide our adversarial training process. Experiments demonstrate that our approach outperforms state-of-the-art CEM-based methods by 15.8% in terms of rewards while achieving faster convergence. Results also show that our approach is less sensitive to hyper-parameters than other deep-RL methods such as REINFORCE, DDPG and DQN.

Files

P1308.pdf
(pdf | 2.57 Mb)
License info not available