Reinfocement learning for regime-aware pairs trading

None, None

Reinfocement learning for regime-aware pairs trading

Reinforcement Learning for Regime-Dependent Optimal Stopping in Pairs Trading

Bachelor Thesis (2026)

Author(s)

M.O. Bankov (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

F. Yu – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

F.A. Oliehoek – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

N. Yorke-Smith – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Optimal Stopping Double Deep Q-Network Markov models Pairs trading

To reference this document use

https://resolver.tudelft.nl/uuid:e7437b3c-6059-428c-922a-305690206992

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

24-06-2026

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

8

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Pairs trading is a strategy that utilises the mean-reverting spread between two correlated assets (stocks). An important factor in such strategies is the market regime, which captures characteristics like trend and volatility of the data, and can shift over time. This paper investigates whether incorporating regime awareness improves the performance of Reinforcement Learning agents for pairs trading entry and exit decisions. Three Double Deep-Q network variants are implemented and compared: a baseline DQN (Deep-Q network), a Recurrent DQN, and a Hidden Markov Model-based DQN that maintains a separate agent per inferred regime. The agents are evaluated on intraday Corn and Wheat futures data, as well as on single-regime generated daily data. Results show that the Recurrent DQN does not significantly improve over the baseline, suggesting it does not implicitly capture regime information. The Markov DQN outperforms the baseline on real data (p = 0.033), while performing worse on generated data, though between-run variance is high in both cases. This supports the hypothesis that explicit regime modelling can benefit pairs trading on real-world data.

Files

Mihail_Bankov_RP.pdf

(pdf | 0.592 Mb)

License info not available