Sequential decision-making under uncertainty is an important branch of artificial intelligence research with a plethora of real-life applications. In this thesis, we generalize two fundamental properties of the decision-making process. First, we show that the theory on planning methods for finite spaces can be extended to infinite but countable spaces. Second, we propose a unified model of reinforcement learning algorithms that employ the principle of optimism in the face of uncertainty. This model is used to explain why these methods are efficient. We use the developed theory to design novel algorithms. Depending on the user's needs, these algorithms can either automate the decision-making process completely, or provide advice in decision-support systems.
We start with presenting the basic concepts from the theory of decision-making and discuss the two approaches to it: planning and reinforcement learning. We look at a few typical sequential decision-making problems of increasing difficulty. In particular, we present a game that involves grid navigation and the problems of warehouse management and wind farm operation. Next, we survey the state-of-the-art methods for solving such problems.
Based on this analysis, we identify the following research opportunities. In planning, models with non-stationary and countably-infinite data remain relatively untreated because they are equivalent to infinitely-dimensional optimization problems, which are notoriously difficult to solve even approximately. In reinforcement learning, optimistic approaches lead to computational efficiency, yet the theory of optimism remains undeveloped. Moreover, while reinforcement learning shines at playing games, such as chess, shōgi, Go, and StarCraft II, its practical applications remain few.
Next, we overview a mathematical framework of sequential decision-making under uncertainty known as the Markov decision process. We explain how the goal of the decision-maker can be expressed as an optimization problem and present two approaches to achieving this goal. The first—more common—approach assigns so-called values to different actions. The other approach uses so-called occupancies that tell how often the agent should choose the actions instead of evaluating how good these actions are. In fact, the two approaches are known to be dual to each other. While this duality is well studied in the finite case, the infinite case is less explored. To address this knowledge gap, we present a new dual formulation for countable problems, both finite and infinite.
Afterwards, we use the dual formulation to design a new planning algorithm for infinite-horizon problems with non-stationary data. These problems are essentially infinite-dimensional optimization problems and as such are impossible to solve exactly using the standard approaches. We show that they can be solved by changing what is defined as optimal behavior: instead of seeking universally optimal policies, we consider initial-decision-optimal ones. Instead of planning all of the actions beforehand, these policies can be used to plan given the currently observed data. When the next decision is required, the process can be repeated in the same manner, leading to an optimal decision-making strategy. Our approach uses the occupancy-value duality to rule out suboptimal actions based on so-called truncations: finite-time approximations of the infinite-horizon decision-making problem.
We extend the truncation approach to a more general setting of decision-making problems with countably-infinite state spaces. Instead of time-based truncations, we consider state-based ones. This allows us to limit the amount of data required to make the decisions and to design an algorithm for a class of problems that are otherwise unsolvable to optimality. This approach belongs to a family of methods called policy iteration: starting from an initial policy, it constructs a series of improvements in the decisions while ruling out choices that are provably suboptimal.
After that, we turn to reinforcement learning. For a long time, the only provably efficient reinforcement-learning methods were model-based ones; recently, a family of model-free optimistic methods emerged, each of them accompanied by an analysis of how sample-efficient the method is. We, too, study optimistic reinforcement learning, but in contrast to the existing research, we seek to understand not how efficient it is, but why it is efficient. Our analysis results in a formula that explains the three factors that cause regret—the efficiency loss—in optimistic reinforcement learning: the problem size, the measure of exploration, and the estimation error caused by the mismatch between the realized transitions and their true distribution. It can be applied to all of the existing algorithms as well as new ones. We design one such new algorithm and show how our theoretical framework can facilitate the proof of its efficiency.
Finally, we consider a high-impact real-world sequential decision-making problem known as active wake control. Wind turbines can negatively impact each other with their wakes. These wake-induced losses can be reduced by changing the turbine orientations. Unfortunately, the optimal control strategy is non-trivial. To address this, existing approaches use simplified wake models in combination with numerical optimization methods; instead we propose to use model-free reinforcement learning. As a first step towards this goal, we present a wind farm simulator that is suitable for reinforcement learning and better reflects the realities of wind farm operation than other existing tools. Using this simulator, we show that previous research used a suboptimal action representation in this problem; we identify two alternatives, both of which improve the learning efficiency. Additionally, we demonstrate that reinforcement learning is robust to errors in the observations, providing further evidence that it is a fitting approach to active wake control.
Our contributions advance the state of the art in the theory of sequential decision-making under uncertainty and its applications. These advances hint at unexplored connections between countably-infinite planning and optimistic learning, which may lead to even more efficient algorithms for sequential decision-making under uncertainty in the future.