Deep Reinforcement Learning for Battery Arbitrage in the Continuous Intraday Market

Scaling to Minute-Level Trading with TD3+BC under Realistic Market Constraints

Master Thesis (2025)
Author(s)

T.J. van Os (TU Delft - Mechanical Engineering)

Contributor(s)

S. Grammatico – Mentor (TU Delft - Team Sergio Grammatico)

A. Mallick – Mentor (TU Delft - Team Peyman Mohajerin Esfahani)

Faculty
Mechanical Engineering
More Info
expand_more
Publication Year
2025
Language
English
Coordinates
52.380324788885986, 4.888964689481552
Graduation Date
01-11-2025
Awarding Institution
Delft University of Technology
Programme
['Mechanical Engineering | Systems and Control']
Sponsors
None
Faculty
Mechanical Engineering
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This thesis investigates the minute-level operation of a 40 MWh battery in the Continuous Intraday Market. The study focuses on the Dutch market within its European cross-border setting and formulates dispatch as a finite-horizon Markov Decision Process that captures both battery physics and order-book depth. A continuous action-to-power mapping guarantees feasibility by enforcing state-ofcharge, efficiency, and market liquidity constraints. Two deep reinforcement learning methods—TD3 and TD3 with behaviour cloning (TD3+BC)—are implemented and benchmarked against a rollingintrinsic (RI) optimiser, which serves as both baseline and source of expert trajectories. Minute-level resolution proves empirically justified: the RI benchmark at one-minute granularity consistently outperforms its fifteen-minute counterpart. In comparative experiments, TD3+BC outperforms plain TD3 and narrows the gap to RI, reaching within about 4% of its profit on average, though not consistently surpassing it. The learned policy exhibits a distinct cycle-efficient trading style with fewer equivalent full cycles and lower throughput than RI but higher value extracted per cycle, which translates over the project lifetime into a stronger business case, yielding an internal rate of return approximately twice that of the RI baseline. Training TD3+BC for five million steps is computationally tractable on a standard laptop (≈11 hours), and inference runs in milliseconds per step, confirming real-time deployability. The overall framework thus demonstrates that deep reinforcement learning can scale to realistic battery sizes and minutelevel trading, yielding stable and interpretable policies. At the same time, the persistent strength of the RI optimiser highlights the value of structural priors, suggesting that hybrid approaches combining reinforcement learning with optimisation principles may offer the most promising path toward robust, market-ready battery trading systems.

Files

License info not available
warning

File under embargo until 05-11-2027