Adaptive Control for Spacecraft Rendezvous

A reinforcement meta-learning approach

Master thesis (2023)

Authors

C.F. de Inza Niemeijer Aerospace Engineering

Contributors

J. Guo Space Systems Egineering (supervisor 1)

Faculty

Aerospace Engineering

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:b8f07b76-8c58-45d8-ab3f-4f87be2d3084

Published Date

27-06-2023

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Aerospace Engineering

Abstract

The continued increase in the number of satellites in low Earth orbit has led to a growing threat of collisions between space objects. On-orbit servicing and active debris removal missions can alleviate this threat by extending the lifetime of active satellites and deorbiting inactive ones, but this requires advanced guidance and control algorithms for the rendezvous phase. Recently, various control policies based on machine learning have been proposed to leverage the advantages of neural networks. One notable technique that has shown much potential in asteroid and planetary landing scenarios is reinforcement meta-learning. This technique consists of training recurrent neural networks in uncertain scenarios in order to develop highly robust control policies that can adapt to unknown conditions in real-time. The goal of this thesis was to apply the meta-learning technique to a rendezvous scenario.
Thus, throughout this project a recurrent neural network was trained via reinforcement meta-learning to generate a control policy that can perform the final approach maneuver of a chaser spacecraft towards a rotating target. A feedforward network was also trained for comparison. The learning algorithm used to train the policy is the Proximal Policy Optimization algorithm, which is a modern actor-critic method that has shown good performance in several continuous control settings. A virtual environment was developed in Python to simulate the rendezvous scenario and collect data to train the policy.
Before beginning with the training process, the hyperparameters of the model were tuned to ensure a smooth and efficient learning process. The three components that required tuning were the learning algorithm, the architecture of the neural networks, and the reward function. Each of these components was tuned in turn, primarily through trial and error. This process required the learning algorithm to be executed multiple times, using a different combination of hyperparameters on each iteration. By repeating this process over a large search space, suitable hyperparameters were found for the learning algorithm and the neural networks. The hyperparameters were chosen to maximize the amount of reward achieved by the policy while maintaining a reasonable training runtime. The reward function was split into several components to guide the policy towards its objective, thereby improving the speed at which the policy learns. Each of the components of the reward function represented some partial goal that the controller had to accomplish. Tuning the relative weight of these components was a challenging process since it often leads to trade-offs between different policy behaviors. Once the tuning process was completed, a sensitivity study was performed to ensure that the model can be used for different kinds of rendezvous trajectories. The sensitivity analysis was performed by training the model on different scenarios, including different orbit altitudes, different distances from the target, different target sizes, and different target rotation speeds. The results of this study showed that the model could be applied to most of these scenarios without the need for any major changes.
After completing the tuning and the sensitivity analysis, the recurrent and the feedforward policies were each trained on a partially observable environment, and their performance was evaluated using a Monte Carlo simulation for a total of one thousand trajectories. The results showed that the recurrent policy was able to learn how to infer hidden information from the environment, which led it to have a much better performance than the feedforward policy. However, the recurrent policy was not without its limitations, since it could not always generate collision-free trajectories, especially when the target rotated at a faster rate. Overall, this thesis showed that reinforcement meta-learning can be a valuable tool for executing complex rendezvous maneuvers, which may be useful now that active debris removal missions are becoming a reality. Furthermore, this thesis also presented a description of how the model was designed and tuned, so that other machine learning practitioners may apply the technique to different scenarios.

Files

De_Inza_Thesis_04.pdf

(.pdf | 4.55 Mb)