Learning Complex Policy Distribution with CEM Guided Adversarial Hypernetwork

None, None; None, None; None, None; None, None

Learning Complex Policy Distribution with CEM Guided Adversarial Hypernetwork

Conference Paper (2021)

Author(s)

Shi Yuan Tang (Nanyang Technological University)

Frans Oliehoek (TU Delft - Interactive Intelligence)

Athirai A. Irissappane (University of Washington)

Jie Zhang (Nanyang Technological University)

Research Group

Interactive Intelligence

Copyright

Reinforcement Learning Generative Adversarial Networks Cross-Entropy Method Hypernetworks

To reference this document use:

https://resolver.tudelft.nl/uuid:879acf0f-ade2-456e-b5f4-8ba1988b5549

More Info

expand_more

Publication Year

2021

Language

English

Copyright

Research Group

Interactive Intelligence

Pages (from-to)

1308-1316

ISBN (electronic)

9781450383073

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Cross-Entropy Method (CEM) is a gradient-free direct policy search method, which has greater stability and is insensitive to hyperparameter tuning. CEM bears similarity to population-based evolutionary methods, but, rather than using a population it uses a distribution over candidate solutions (policies in our case). Usually, a natural exponential family distribution such as multivariate Gaussian is used to parameterize the policy distribution. Using a multivariate Gaussian limits the quality of CEM policies as the search becomes confined to a less representative subspace. We address this drawback by using an adversarially-trained hypernetwork, enabling a richer and complex representation of the policy distribution. To achieve better training stability and faster convergence, we use a multivariate Gaussian CEM policy to guide our adversarial training process. Experiments demonstrate that our approach outperforms state-of-the-art CEM-based methods by 15.8% in terms of rewards while achieving faster convergence. Results also show that our approach is less sensitive to hyper-parameters than other deep-RL methods such as REINFORCE, DDPG and DQN.

Files

P1308.pdf

(pdf | 2.57 Mb)

License info not available