Resource-constrained Multi-agent Markov Decision Processes
More Info
expand_more
Abstract
Intelligent autonomous agents, designed to automate and simplify many aspects of our society, will increasingly be required to also interact with other agents autonomously. Where agents interact, they are likely to encounter resource constraints. For example, agents managing household appliances to optimize electricity usage might need to share the limited capacity of the distribution grid.
This thesis describes research into new algorithms for optimizing the behavior of agents operating in constrained environments, when these agents have significant uncertainty about the effects of their actions on their state. Such systems are effectively modeled in a framework of constrained multi-agent Markov decision processes (MDPs). A single-agent MDP model captures the uncertainty in the outcome of the actions chosen by a specific agent. It does so by providing a probabilistic model of state transitions, describing the likelihood of arriving in a future state, conditional on the current state and action. Agents collect different rewards or penalties depending on the current state and chosen action, informing their objective of maximizing their expected reward. To include constraints, resource consumption functions are added to the actions, and the agents' (shared) objective is modified with a condition restricting their (cumulative) resource consumption. We propose novel algorithms to advance the state of the art in three challenging settings: computing static preallocations off-line, computing dynamic (re)allocations on-line, and optimally learning model dynamics through safe reinforcement learning under the constraints. Taken together, these algorithms show how agents can coordinate their actions under uncertainty and shared resource constraints in a broad range of conditions. Furthermore, the proposed solutions are complementary: static preallocations can be used as back-up strategy for when a communication disruption prevents the use of dynamic allocations.