Reinforcement Learning with Self-Play and Domain Randomisation for Robust Market Making

Master Thesis (2025)
Author(s)

J. Teurlings (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

M.M. Celikok – Mentor (TU Delft - Sequential Decision Making)

F. Yu – Mentor (TU Delft - Applied Probability)

Frans A Oliehoek – Mentor (TU Delft - Sequential Decision Making)

A. Papapantoleon – Graduation committee member (TU Delft - Applied Probability)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
26-08-2025
Awarding Institution
Delft University of Technology
Programme
['Computer Science']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Modern electronic markets reward liquidity providers that can continuously quote competitive bid–ask spreads while dynamically controlling risk. This thesis investigates whether Reinforcement Learning (RL) can produce robust market-making strategies when an agent is trained in two settings that have so far received little attention in the literature: Domain Randomization (DR) and Self-Play (SP).
We first design MMakr, an extension of the ABIDES-Gym Limit Order Book (LOB) simulator that (i)
exposes a dimensionless, continuous control interface, (ii) supports per-component reward shaping,
(iii) gives the option to randomise liquidity, volatility and order-flow regimes on every episode, and (iv) allows earlier policy snapshots to be injected as adaptive opponents, thus natively enabling DR and SP training.
Using MMakr, we train Soft Actor-Critic (SAC) and Proximaly Policy Optimization (PPO) agents
under three curricula: a fixed single configuration, DR, and SP. A carefully tuned six-term reward—
combining directional profit, spread capture, fill ratio, inventory cost, quote-cliff and terminal inventory penalties—guides learning towards realistic quoting behaviour while keeping risk in check.
Experiments on six previously unseen market scenarios show that SAC learns profitable policies in
the single-configuration setting but over-fits and degrades out-of-sample. DR substantially improves
PPO’s stability and generalisation, while forcing SAC to adopt more conservative quoting, thereby
degrading its performance. SP introduces non-stationarity that current SAC was not able to overcome, whereas on-policy PPO shows promise in solving the problem but fails to find meaningful information in the time allocated.
The thesis contributes (a) the open-source MMakr environment, (b) an optimisation framework for
reward-component weight search, and (c) a systematic comparison of DR and SP in a realistic multi-
agent LOB simulator. While results reveal clear benefits of environmental diversity, they also highlight the brittleness of current RL algorithms under adversarial liquidity conditions, pointing to the need for curriculum-based randomisation, risk-aware objectives and more carefully configured and realistic simulators in future work.

Files

Thesis_Final_report.pdf
(pdf | 6.91 Mb)
- Embargo expired in 19-08-2025
License info not available