On Non-Stationarity in Reinforced Deep Markov Models with Applications in Portfolio Optimization

More Info
expand_more

Abstract

In this thesis, we aim to improve the application of deep reinforcement learning in portfo- lio optimization. Reinforcement learning has in recent years been applied to a wide range of problems, from games to control systems in the physical world and also to finance. While reinforcement learning has shown success in simulated environments (e.g. matching or exceeding human performance in games), its adoption in practical applications (non- simulated environments) has lagged. Dulac-Arnold et al. [2019] suggest this is caused by a discrepancy in the experimental set-up in research and the conditions in practice. Specifically, they present a list of challenges that make the application of reinforcement learning in real-world settings more difficult. One of these challenges is non-stationary environments, which is common in financial environments. It is a challenge since, given an observed state, the optimal action may not always be the same as it may change over time due to non-stationarity. Therefore, more specifically, the goal of this thesis is to overcome the challenge of non-stationarity in the application of reinforcement learning to portfolio optimization. In this thesis, we use reinforced deep Markov models (RDMM) introduced by Ferreira [2020] (applied to an optimal execution problem and later used by Cartea et al. [2021] for statistical arbitrage on simulated price movements of an FX triplet) for its data efficiency and ability to handle complex environments. RDMM involve a partially observ- able Markov decision process (POMDP) which is also the setting used by Xie et al. [2021] to model non-stationarity in reinforcement learning. We extend RDMM to incorporate non-stationarity, using the framework suggested by Xie et al. [2021], and apply it to port- folio optimization. Our implementation is sample efficient which allows for quick learning, by doing this we attempt to improve on another challenge of reinforcement learning — i.e. sample-inefficiency [Dulac-Arnold et al., 2019]. Moreover, our implementation can handle continuous state and action spaces.
We compare the performance of our algorithms to classical portfolio optimization tech- niques such as Mean-Variance (MV) and Equal Risk Contribution (ERC), and to popular reinforcement learning techniques such as Deep Deterministic Policy Gradient (DDPG) and Soft Actor-Critic (SAC). We observe our implementation has higher sample-efficiency compared DDPG and SAC, and higher cumulative returns on the test set compared to MV, ERC, DDPG, and SAC.