GN

G. Neustroev

info

Please Note

6 records found

Journal article (2025) - Grigory Neustroev, Mirco Giacobbe, Anna Lukina
We introduce for the first time a neural-certificate framework for continuous-time stochastic dynamical systems. Autonomous learning systems in the physical world demand continuous-time reasoning, yet existing learnable certificates for probabilistic verification assume discretization of the time continuum. Inspired by the success of training neural Lyapunov certificates for deterministic continuous-time systems and neural supermartingale certificates for stochastic discrete-time systems, we propose a framework that bridges the gap between continuous-time and probabilistic neural certification for dynamical systems under complex requirements. Our method combines machine learning and symbolic reasoning to produce formally certified bounds on the probabilities that a nonlinear system satisfies specifications of reachability, avoidance, and persistence. We present both the theoretical justification and the algorithmic implementation of our framework and showcase its efficacy on popular benchmarks. ...
Wind farms suffer from so-called wake effects: when turbines are located in the wind shadows of other turbines, their power output is substantially reduced. These losses can be partially mitigated via actively changing the yaw from the individually optimal direction. Most existing wake control techniques have two major limitations: they use simplified wake models to optimize the control strategy, and they assume that the atmospheric conditions remain stable. In this paper, we address these limitations by applying reinforcement learning (RL). RL forgoes the wake model entirely and learns an optimal control strategy based on the observed atmospheric conditions and a reward signal, in this case the power output of the farm. It also accounts for random transitions in the observations, such as turbulent fluctuations in the wind. To evaluate RL for active wake control, we provide a simulator based on the state-of-the-art FLORIS model in the OpenAI gym format. Next, we propose three different state-action representations of the active wake control problem and investigate their effect on the performance of RL-based wake control. Finally, we compare RL to a state-of-the-art wake control strategy based on FLORIS and show that RL is less sensitive to changes in unobservable data. ...
Doctoral thesis (2022) - G. Neustroev, M.M. de Weerdt, R.A. Verzijlbergh
Sequential decision-making under uncertainty is an important branch of artificial intelligence research with a plethora of real-life applications. In this thesis, we generalize two fundamental properties of the decision-making process. First, we show that the theory on planning methods for finite spaces can be extended to infinite but countable spaces. Second, we propose a unified model of reinforcement learning algorithms that employ the principle of optimism in the face of uncertainty. This model is used to explain why these methods are efficient. We use the developed theory to design novel algorithms. Depending on the user's needs, these algorithms can either automate the decision-making process completely, or provide advice in decision-support systems.

We start with presenting the basic concepts from the theory of decision-making and discuss the two approaches to it: planning and reinforcement learning. We look at a few typical sequential decision-making problems of increasing difficulty. In particular, we present a game that involves grid navigation and the problems of warehouse management and wind farm operation. Next, we survey the state-of-the-art methods for solving such problems.

Based on this analysis, we identify the following research opportunities. In planning, models with non-stationary and countably-infinite data remain relatively untreated because they are equivalent to infinitely-dimensional optimization problems, which are notoriously difficult to solve even approximately. In reinforcement learning, optimistic approaches lead to computational efficiency, yet the theory of optimism remains undeveloped. Moreover, while reinforcement learning shines at playing games, such as chess, shōgi, Go, and StarCraft II, its practical applications remain few.

Next, we overview a mathematical framework of sequential decision-making under uncertainty known as the Markov decision process. We explain how the goal of the decision-maker can be expressed as an optimization problem and present two approaches to achieving this goal. The first—more common—approach assigns so-called values to different actions. The other approach uses so-called occupancies that tell how often the agent should choose the actions instead of evaluating how good these actions are. In fact, the two approaches are known to be dual to each other. While this duality is well studied in the finite case, the infinite case is less explored. To address this knowledge gap, we present a new dual formulation for countable problems, both finite and infinite.

Afterwards, we use the dual formulation to design a new planning algorithm for infinite-horizon problems with non-stationary data. These problems are essentially infinite-dimensional optimization problems and as such are impossible to solve exactly using the standard approaches. We show that they can be solved by changing what is defined as optimal behavior: instead of seeking universally optimal policies, we consider initial-decision-optimal ones. Instead of planning all of the actions beforehand, these policies can be used to plan given the currently observed data. When the next decision is required, the process can be repeated in the same manner, leading to an optimal decision-making strategy. Our approach uses the occupancy-value duality to rule out suboptimal actions based on so-called truncations: finite-time approximations of the infinite-horizon decision-making problem.

We extend the truncation approach to a more general setting of decision-making problems with countably-infinite state spaces. Instead of time-based truncations, we consider state-based ones. This allows us to limit the amount of data required to make the decisions and to design an algorithm for a class of problems that are otherwise unsolvable to optimality. This approach belongs to a family of methods called policy iteration: starting from an initial policy, it constructs a series of improvements in the decisions while ruling out choices that are provably suboptimal.

After that, we turn to reinforcement learning. For a long time, the only provably efficient reinforcement-learning methods were model-based ones; recently, a family of model-free optimistic methods emerged, each of them accompanied by an analysis of how sample-efficient the method is. We, too, study optimistic reinforcement learning, but in contrast to the existing research, we seek to understand not how efficient it is, but why it is efficient. Our analysis results in a formula that explains the three factors that cause regret—the efficiency loss—in optimistic reinforcement learning: the problem size, the measure of exploration, and the estimation error caused by the mismatch between the realized transitions and their true distribution. It can be applied to all of the existing algorithms as well as new ones. We design one such new algorithm and show how our theoretical framework can facilitate the proof of its efficiency.

Finally, we consider a high-impact real-world sequential decision-making problem known as active wake control. Wind turbines can negatively impact each other with their wakes. These wake-induced losses can be reduced by changing the turbine orientations. Unfortunately, the optimal control strategy is non-trivial. To address this, existing approaches use simplified wake models in combination with numerical optimization methods; instead we propose to use model-free reinforcement learning. As a first step towards this goal, we present a wind farm simulator that is suitable for reinforcement learning and better reflects the realities of wind farm operation than other existing tools. Using this simulator, we show that previous research used a suboptimal action representation in this problem; we identify two alternatives, both of which improve the learning efficiency. Additionally, we demonstrate that reinforcement learning is robust to errors in the observations, providing further evidence that it is a fitting approach to active wake control.

Our contributions advance the state of the art in the theory of sequential decision-making under uncertainty and its applications. These advances hint at unexplored connections between countably-infinite planning and optimistic learning, which may lead to even more efficient algorithms for sequential decision-making under uncertainty in the future. ...
Conference paper (2020) - Greg Neustroev, Mathijs de Weerdt
Reinforcement learning (RL), like any on-line learning method, inevitably faces the exploration-exploitation dilemma. When a learning algorithm requires as few data samples as possible, it is called sample efficient. The design of sample-efficient algorithms is an important area of research. Interestingly, all currently known provably efficient model-free RL algorithms utilize the same well-known principle of optimism in the face of uncertainty. We unite these existing algorithms into a single general model-free optimistic RL framework. We show how this facilitates the design of new optimistic model-free RL algorithms by simplifying the analysis of their efficiency. Finally, we propose one such new algorithm and demonstrate its performance in an experimental study. ...
Reinforcement learning requires exploration, leading to repeated execution of sub-optimal actions. Naive exploration techniques address this problem by changing gradually from exploration to exploitation. This approach employs a wide search resulting in exhaustive exploration and low sample-efficiency. More advanced search methods explore optimistically based on an upper bound estimate of expected rewards. These methods employ deep search, aiming to reach states not previously visited. Another deep search strategy is found in action-elimination methods, which aim to discover and eliminate sub-optimal actions. Despite the effectiveness of advanced deep search strategies, some problems are better suited to naive exploration. We devise a new method, called Interval Q-Learning, that finds a balance between wide and deep search. It assigns a small probability to taking sub-optimal actions and combines both greedy and optimistic exploration. This allows for fast convergence to a near-optimal policy, and then exploration around it. We demonstrate the performance of tabular and deep Q-network versions of Interval Q-Learning, showing that it offers convergence speed-up both in problems that favor wide exploration methods and those that favor deep search strategies. ...
Infinite-horizon non-stationary Markov decision processes provide a general framework to model many real-life decision-making problems, e.g., planning equipment maintenance. Unfortunately, these problems are notoriously difficult to solve, due to their infinite dimensionality. Often, only the optimality of the initial action is of importance to the decision-maker: once it has been identified, the procedure can be repeated to generate a plan of arbitrary length. The optimal initial action can be identified by finding a time horizon so long that data beyond it has no effect on the initial decision. This horizon is known as a solution horizon and can be discovered by considering a series of truncations of the problem until a stopping rule guaranteeing initial decision optimality is satisfied. We present such a stopping rule for problems with unbounded rewards. Given a candidate policy, the rule uses a mathematical program that searches for other possibly optimal initial actions within the space of feasible truncations. If no better action can be found, the candidate action is deemed optimal. Our rule runs faster than comparable rules and discovers shorter, more efficient solution horizons. ...