Reinforcement Learning for Regime-Aware Pairs Trading
Regime-Switching Reinforcement Learning for Portfolio Allocation in Pairs Trading
T.B. Ilieva (TU Delft - Electrical Engineering, Mathematics and Computer Science)
F. Yu – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
N. Yorke-Smith – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
F.A. Oliehoek – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Pairs trading is a well-studied strategy in statistical arbitrage. By using asset pairs with correlated changes in their historical prices, the strategy profits from exploiting the non-permanent divergence of their price relationship, assuming that this relationship will revert to its long-term equilibrium. However, the dynamics of this relationship may vary over time, as the spread, which measures the deviation between the prices of paired assets, can exhibit different levels of volatility and mean-reverting behavior under different market conditions. In this paper, we propose a regime-aware reinforcement learning framework for portfolio optimization in pairs trading. We model the spread between assets and characterize its behavior using statistical features capturing its relative position to historical equilibrium, its volatility, and the strength of its mean-reverting behavior. These features are used within a Hidden Markov Model to infer latent market regimes, which represent distinct states of spread dynamics over time. The inferred regimes are incorporated into the state representation of a reinforcement learning agent, which learns to dynamically allocate capital across pairs. We evaluate the proposed approach against a regime-agnostic reinforcement learning benchmark and a classical z-score threshold strategy. In a controlled simulation study, the regime-aware agent achieves a mean Sharpe ratio of 1.354 versus 0.738 for the baseline on V/MA (ΔSharpe = +0.616) and 1.183 versus 0.564 on V/JKHY (ΔSharpe = +0.619), consistent across 10 training seeds. On real out-of-sample data from 2023 to 2026, the regime agent achieves Sharpe ratios of 0.567 and 0.609 on V/MA and V/JKHY respectively, outperforming the baseline in both cases.