Best-Response Bayesian Reinforcement Learning with Bayes-adaptive POMDPs for Centaurs

None, None; None, None; None, None

Best-Response Bayesian Reinforcement Learning with Bayes-adaptive POMDPs for Centaurs

Conference Paper (2022)

Author(s)

M.M. Celikok (Aalto University)

F.A. Oliehoek (TU Delft - Interactive Intelligence)

Samuel Kaski (The University of Manchester, Aalto University)

Research Group

Interactive Intelligence

Copyright

Hybrid Intelligence Bayesian Reinforcement Learning Computational Rationality Multiagent Learning

To reference this document use:

https://resolver.tudelft.nl/uuid:75bcb8d7-ed75-40a2-92bd-2dc660f21b42

More Info

expand_more

Publication Year

2022

Language

English

Copyright

Research Group

Interactive Intelligence

Pages (from-to)

235-243

ISBN (electronic)

978-171385433-3

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Centaurs are half-human, half-AI decision-makers where the AI's goal is to complement the human. To do so, the AI must be able to recognize the goals and constraints of the human and have the means to help them. We present a novel formulation of the interaction between the human and the AI as a sequential game where the agents are modelled using Bayesian best-response models. We show that in this case the AI's problem of helping bounded-rational humans make better decisions reduces to a Bayes-adaptive POMDP. In our simulated experiments, we consider an instantiation of our framework for humans who are subjectively optimistic about the AI's future behaviour. Our results show that when equipped with a model of the human, the AI can infer the human's bounds and nudge them towards better decisions. We discuss ways in which the machine can learn to improve upon its own limitations as well with the help of the human. We identify a novel trade-off for centaurs in partially observable tasks: for the AI's actions to be acceptable to the human, the machine must make sure their beliefs are sufficiently aligned, but aligning beliefs might be costly. We present a preliminary theoretical analysis of this trade-off and its dependence on task structure.

Files

3535850.3535878.pdf

(pdf | 1.36 Mb)

- Embargo expired in 05-12-2022

License info not available