Deep Reinforcement Learning for Battery Arbitrage in the Continuous Intraday Market

None, None

Deep Reinforcement Learning for Battery Arbitrage in the Continuous Intraday Market

Scaling to Minute-Level Trading with TD3+BC under Realistic Market Constraints

Master Thesis (2025)

Author(s)

T.J. van Os (TU Delft - Mechanical Engineering)

Contributor(s)

S. Grammatico – Mentor (TU Delft - Team Sergio Grammatico)

A. Mallick – Mentor (TU Delft - Team Peyman Mohajerin Esfahani)

Faculty

Mechanical Engineering

Deep Reinforcement Learning Battery ESS TD3 Intraday Market Business case IDM Arbitrage DRL BC Trading Action Mapping Behaviour Cloning Minute-level Rolling Intrinsic Twin-delayed DDPG

To reference this document use:

https://resolver.tudelft.nl/uuid:2e6ab93a-ffb3-4827-a215-1578da4abedf

More Info

expand_more

Publication Year

2025

Language

English

Coordinates

52.380324788885986, 4.888964689481552

Graduation Date

01-11-2025

Awarding Institution

Delft University of Technology

Programme

['Mechanical Engineering | Systems and Control']

Abstract

This thesis investigates the minute-level operation of a 40 MWh battery in the Continuous Intraday Market. The study focuses on the Dutch market within its European cross-border setting and formulates dispatch as a finite-horizon Markov Decision Process that captures both battery physics and order-book depth. A continuous action-to-power mapping guarantees feasibility by enforcing state-ofcharge, efficiency, and market liquidity constraints. Two deep reinforcement learning methods—TD3 and TD3 with behaviour cloning (TD3+BC)—are implemented and benchmarked against a rollingintrinsic (RI) optimiser, which serves as both baseline and source of expert trajectories. Minute-level resolution proves empirically justified: the RI benchmark at one-minute granularity consistently outperforms its fifteen-minute counterpart. In comparative experiments, TD3+BC outperforms plain TD3 and narrows the gap to RI, reaching within about 4% of its profit on average, though not consistently surpassing it. The learned policy exhibits a distinct cycle-efficient trading style with fewer equivalent full cycles and lower throughput than RI but higher value extracted per cycle, which translates over the project lifetime into a stronger business case, yielding an internal rate of return approximately twice that of the RI baseline. Training TD3+BC for five million steps is computationally tractable on a standard laptop (≈11 hours), and inference runs in milliseconds per step, confirming real-time deployability. The overall framework thus demonstrates that deep reinforcement learning can scale to realistic battery sizes and minutelevel trading, yielding stable and interpretable policies. At the same time, the persistent strength of the RI optimiser highlights the value of structural priors, suggesting that hybrid approaches combining reinforcement learning with optimisation principles may offer the most promising path toward robust, market-ready battery trading systems.

Files

Thesis_final_thomas_van_os.pdf

(pdf | 0 Mb)

License info not available

File under embargo until 05-11-2027