Formal Verification in Uncertain POMDPs
More Info
expand_more
Abstract
Recurrent neural networks (RNNs) have emerged as an effective method for policy learning in partially observable Markov Decision Processes (POMDPs). However, a major drawback of RNN-based policies is the difficulty to formally verify behavioural specifications, e.g. with regard to reachability and expected cost. In accordance with previous work, we insert a quantized bottleneck network (QBN) to the RNN that learns a mapping from the latent memory states to quantized vectors, which enables the extraction of a finite-state controller (FSC) representation of the RNN. This FSC, together with a POMDP model description, constitutes a policy-induced Discrete-Time Markov Chain (DTMC) that allows us to use efficient formal verification methods. For the scenarios in which the FSC fails to satisfy the behavioural specification, the verification method generates diagnostic information in the form of critical examples. These critical examples can be used to re-train the RNN and extract an updated FSC. In particular, we consider the synthesis of policies with formally verified satisfaction of behavioural specifications in uncertain POMDPs (uPOMDPs), where the uncertainty is expressed by polynomial parameterizations of the transition and/or observation probabilities. The uPOMDP describes a set of possible POMDPs that can be used to express imperfect knowledge of the environment, e.g. because the POMDP is an abstraction of real-world dynamics. Our goal is to learn and extract an FSC that has the best worst-case performance among all possible instantiations of the uPOMDP. To this end, our approach iteratively determines the POMDP among the uPOMDP set that is most critical for the extracted FSCs with respect to the specification under consideration. This POMDP reflects the model for which the FSC requires most improvement, and is thus used to retrain the RNN. Although applicable to any linear-temporal logic (LTL) specification, we test our implementation on benchmark problems with expected cost specifications. Our results suggest that our approach is able to extract well-performing FSCs for uPOMDPs where long-term memory is required.