Interpreting reinforcement Learning for post-capture control In space debris removal
J. Liu (TU Delft - Aerospace Engineering)
E.K.A. Gill – Promotor (TU Delft - Aerospace Engineering)
J. Guo – Promotor (TU Delft - Aerospace Engineering)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Space debris, consisting of defunct artificial objects in orbit, poses a significant threat to on-orbit safety. Active Debris Removal (ADR) using robotic arms offers a potential solution to capture and remove debris, but the resulting spacecraft post-capture operates under substantial uncertainty. This uncertainty is not only relevant for post-capture scenarios but is a common feature in spacecraft attitude control, where factors such as parameter variations, inertia uncertainty, and external disturbances like solar radiation pressure can significantly affect system dynamics. Learning-based control approaches, particularly reinforcement learning (RL), have emerged as promising frameworks to handle such uncertain environments, as they allow control behaviors to be learned without explicit dynamic models. However, the performance of RL algorithms varies across different tasks, and understanding the internal learning dynamics remains challenging, limiting interpretability and reliable application to space systems.
This research addresses these challenges by developing visualization-based methods to interpret and analyze the learning dynamics of actor–critic RL algorithms applied to spacecraft attitude control. Specifically, it introduces a critic match loss landscape visualization method for online actor–critic algorithms, allowing the evolution of the critic network during training to be examined systematically. Network parameters are recorded at the end of each episode and projected onto a low-dimensional subspace using Principal Component Analysis (PCA). A fixed-target critic match loss is then defined using reference state samples and corresponding temporal-difference (TD) targets from a selected policy. Evaluating the loss over the principal component plane generates both three-dimensional landscapes and two-dimensional contour plots with overlaid training trajectories, thereby illustrating how the critic optimizes its parameters over time. Quantitative indices and random-direction projections are used to systematically compare learning behavior across different training runs and reduce reliance on a single PCA projection.
The method is demonstrated on the Action-Dependent Heuristic Dynamic Programming (ADHDP) algorithm, applied to cart-pole and spacecraft attitude control tasks. Comparative analysis shows how the loss landscape geometry corresponds to training stability and control performance. Moreover, the visualization framework is extended to off-policy RL by adapting it to the Soft Actor–Critic (SAC) algorithm, which is capable of convergent control performance in spacecraft tasks. Adjustments account for SAC’s twin-critic structure, target computations, and replay-based training while maintaining interpretive consistency with online learning results.
Finally, the framework is expanded to capture both actor and critic dynamics, integrating four complementary components: a three-dimensional critic match loss landscape, an actor loss landscape, trajectories combining time, TD error, and actor weight evolution, and state–TD plots highlighting areas of large TD fluctuations. Applied across multiple ADHDP variants with deep learning, this multi-perspective framework provides interpretable insights into the interactions between value estimation, policy optimization, and TD signals during training. The framework allows for diagnosis of instability mechanisms, improves understanding of actor–critic learning behavior, and supports the design and analysis of RL algorithms for dynamic and uncertain control systems.
Altogether, this research establishes a comprehensive visualization and interpretation framework for RL in spacecraft attitude control and other dynamic environments, enhancing transparency, interpretability, and reliability of actor–critic algorithms under uncertainty, and providing a foundation for future work in dynamic visualization, theoretical stability analysis, and experimental validation on integrated physical platforms.