J. Liu | TU Delft Repository

Interpreting reinforcement Learning for post-capture control In space debris removal

Doctoral thesis (2026) - J. Liu, E.K.A. Gill, J. Guo

Space debris, consisting of defunct artificial objects in orbit, poses a significant threat to on-orbit safety. Active Debris Removal (ADR) using robotic arms offers a potential solution to capture and remove debris, but the resulting spacecraft post-capture operates under substantial uncertainty. This uncertainty is not only relevant for post-capture scenarios but is a common feature in spacecraft attitude control, where factors such as parameter variations, inertia uncertainty, and external disturbances like solar radiation pressure can significantly affect system dynamics. Learning-based control approaches, particularly reinforcement learning (RL), have emerged as promising frameworks to handle such uncertain environments, as they allow control behaviors to be learned without explicit dynamic models. However, the performance of RL algorithms varies across different tasks, and understanding the internal learning dynamics remains challenging, limiting interpretability and reliable application to space systems.

This research addresses these challenges by developing visualization-based methods to interpret and analyze the learning dynamics of actor–critic RL algorithms applied to spacecraft attitude control. Specifically, it introduces a critic match loss landscape visualization method for online actor–critic algorithms, allowing the evolution of the critic network during training to be examined systematically. Network parameters are recorded at the end of each episode and projected onto a low-dimensional subspace using Principal Component Analysis (PCA). A fixed-target critic match loss is then defined using reference state samples and corresponding temporal-difference (TD) targets from a selected policy. Evaluating the loss over the principal component plane generates both three-dimensional landscapes and two-dimensional contour plots with overlaid training trajectories, thereby illustrating how the critic optimizes its parameters over time. Quantitative indices and random-direction projections are used to systematically compare learning behavior across different training runs and reduce reliance on a single PCA projection.

The method is demonstrated on the Action-Dependent Heuristic Dynamic Programming (ADHDP) algorithm, applied to cart-pole and spacecraft attitude control tasks. Comparative analysis shows how the loss landscape geometry corresponds to training stability and control performance. Moreover, the visualization framework is extended to off-policy RL by adapting it to the Soft Actor–Critic (SAC) algorithm, which is capable of convergent control performance in spacecraft tasks. Adjustments account for SAC’s twin-critic structure, target computations, and replay-based training while maintaining interpretive consistency with online learning results.

Finally, the framework is expanded to capture both actor and critic dynamics, integrating four complementary components: a three-dimensional critic match loss landscape, an actor loss landscape, trajectories combining time, TD error, and actor weight evolution, and state–TD plots highlighting areas of large TD fluctuations. Applied across multiple ADHDP variants with deep learning, this multi-perspective framework provides interpretable insights into the interactions between value estimation, policy optimization, and TD signals during training. The framework allows for diagnosis of instability mechanisms, improves understanding of actor–critic learning behavior, and supports the design and analysis of RL algorithms for dynamic and uncertain control systems.

Altogether, this research establishes a comprehensive visualization and interpretation framework for RL in spacecraft attitude control and other dynamic environments, enhancing transparency, interpretability, and reliability of actor–critic algorithms under uncertainty, and providing a foundation for future work in dynamic visualization, theoretical stability analysis, and experimental validation on integrated physical platforms. ...

Space debris, consisting of defunct artificial objects in orbit, poses a significant threat to on-orbit safety. Active Debris Removal (ADR) using robotic arms offers a potential solution to capture and remove debris, but the resulting spacecraft post-capture operates under substantial uncertainty. This uncertainty is not only relevant for post-capture scenarios but is a common feature in spacecraft attitude control, where factors such as parameter variations, inertia uncertainty, and external disturbances like solar radiation pressure can significantly affect system dynamics. Learning-based control approaches, particularly reinforcement learning (RL), have emerged as promising frameworks to handle such uncertain environments, as they allow control behaviors to be learned without explicit dynamic models. However, the performance of RL algorithms varies across different tasks, and understanding the internal learning dynamics remains challenging, limiting interpretability and reliable application to space systems.

This research addresses these challenges by developing visualization-based methods to interpret and analyze the learning dynamics of actor–critic RL algorithms applied to spacecraft attitude control. Specifically, it introduces a critic match loss landscape visualization method for online actor–critic algorithms, allowing the evolution of the critic network during training to be examined systematically. Network parameters are recorded at the end of each episode and projected onto a low-dimensional subspace using Principal Component Analysis (PCA). A fixed-target critic match loss is then defined using reference state samples and corresponding temporal-difference (TD) targets from a selected policy. Evaluating the loss over the principal component plane generates both three-dimensional landscapes and two-dimensional contour plots with overlaid training trajectories, thereby illustrating how the critic optimizes its parameters over time. Quantitative indices and random-direction projections are used to systematically compare learning behavior across different training runs and reduce reliance on a single PCA projection.

The method is demonstrated on the Action-Dependent Heuristic Dynamic Programming (ADHDP) algorithm, applied to cart-pole and spacecraft attitude control tasks. Comparative analysis shows how the loss landscape geometry corresponds to training stability and control performance. Moreover, the visualization framework is extended to off-policy RL by adapting it to the Soft Actor–Critic (SAC) algorithm, which is capable of convergent control performance in spacecraft tasks. Adjustments account for SAC’s twin-critic structure, target computations, and replay-based training while maintaining interpretive consistency with online learning results.

Finally, the framework is expanded to capture both actor and critic dynamics, integrating four complementary components: a three-dimensional critic match loss landscape, an actor loss landscape, trajectories combining time, TD error, and actor weight evolution, and state–TD plots highlighting areas of large TD fluctuations. Applied across multiple ADHDP variants with deep learning, this multi-perspective framework provides interpretable insights into the interactions between value estimation, policy optimization, and TD signals during training. The framework allows for diagnosis of instability mechanisms, improves understanding of actor–critic learning behavior, and supports the design and analysis of RL algorithms for dynamic and uncertain control systems.

Altogether, this research establishes a comprehensive visualization and interpretation framework for RL in spacecraft attitude control and other dynamic environments, enhancing transparency, interpretability, and reliability of actor–critic algorithms under uncertainty, and providing a foundation for future work in dynamic visualization, theoretical stability analysis, and experimental validation on integrated physical platforms.

Visualizing critic match loss landscapes for interpretation of online reinforcement learning control algorithms

Journal article (2026) - Jingyi Liu, Jian Guo, Eberhard Gill

Reinforcement learning has proven its power on various occasions. However, its performance is not always guaranteed when system dynamics change. Instead, it largely relies on users’ empirical experience. For reinforcement learning algorithms with an actor–critic structure, the critic neural network reflects the approximation and optimization process in the RL algorithm. Analyzing the performance of the critic neural network helps to understand the mechanism of the algorithm. To support systematic interpretation of such algorithms in dynamic control problems, this work proposes a critic match loss landscape visualization method for online reinforcement learning. The method constructs a loss landscape by projecting recorded critic parameter trajectories onto a low-dimensional linear subspace. The critic match loss is evaluated over the projected parameter grid using fixed reference state samples and temporal-difference targets. This yields a three-dimensional loss surface together with a two-dimensional optimization path that characterizes critic learning behavior. To extend analysis beyond visual inspection, quantitative landscape indices and a normalized system performance index are introduced, enabling structured comparison across different training outcomes. The approach is demonstrated using the Action-Dependent Heuristic Dynamic Programming algorithm on cart–pole and spacecraft attitude control tasks. Comparative analyses across projection methods and training stages reveal distinct landscape characteristics associated with stable convergence and unstable learning. The proposed framework enables both qualitative and quantitative interpretation of critic optimization behavior in online reinforcement learning. ...

Online policy iteration ADP-based control of post-capture combined spacecraft without inertia identifications

Journal article (2022) - Jingyi Liu, Jian Guo, Eberhard Gill

Using robotic arms to capture space debris is a promising method in active debris removal. Since most space debris is uncooperative target, uncertainties exist in the inertial parameters of the combined spacecraft after target capture. In this paper, an attitude takeover control method based on adaptive dynamic programming is investigated for the postcapture combined spacecraft. The controller requires no inertial information. Firstly, the dynamic and kinematic model of combined spacecraft is established. Then, the learning-based control, action-dependent heuristic dynamic programming with actor-critic structure is introduced. Finally, the algorithm is tested with the benchmark cart-pole system and the combined spacecraft. Simulation results show that the inertia-free control method is sensitive to the learning parameters and initial weight of the actor and the critic, which will bring problems in practical use. ...