M.M. Celikok | TU Delft Repository

Reinforcement Learning with Self-Play and Domain Randomisation for Robust Market Making

Master thesis (2025) - J. Teurlings (author) , M.M. Celikok (mentor) , F. Yu (mentor) , FA Oliehoek (mentor) , A. Papapantoleon (graduation committee member)

Modern electronic markets reward liquidity providers that can continuously quote competitive bid–ask spreads while dynamically controlling risk. This thesis investigates whether Reinforcement Learning (RL) can produce robust market-making strategies when an agent is trained in tw ...

Modern electronic markets reward liquidity providers that can continuously quote competitive bid–ask spreads while dynamically controlling risk. This thesis investigates whether Reinforcement Learning (RL) can produce robust market-making strategies when an agent is trained in two settings that have so far received little attention in the literature: Domain Randomization (DR) and Self-Play (SP).
We first design MMakr, an extension of the ABIDES-Gym Limit Order Book (LOB) simulator that (i)
exposes a dimensionless, continuous control interface, (ii) supports per-component reward shaping,
(iii) gives the option to randomise liquidity, volatility and order-flow regimes on every episode, and (iv) allows earlier policy snapshots to be injected as adaptive opponents, thus natively enabling DR and SP training.
Using MMakr, we train Soft Actor-Critic (SAC) and Proximaly Policy Optimization (PPO) agents
under three curricula: a fixed single configuration, DR, and SP. A carefully tuned six-term reward—
combining directional profit, spread capture, fill ratio, inventory cost, quote-cliff and terminal inventory penalties—guides learning towards realistic quoting behaviour while keeping risk in check.
Experiments on six previously unseen market scenarios show that SAC learns profitable policies in
the single-configuration setting but over-fits and degrades out-of-sample. DR substantially improves
PPO’s stability and generalisation, while forcing SAC to adopt more conservative quoting, thereby
degrading its performance. SP introduces non-stationarity that current SAC was not able to overcome, whereas on-policy PPO shows promise in solving the problem but fails to find meaningful information in the time allocated.
The thesis contributes (a) the open-source MMakr environment, (b) an optimisation framework for
reward-component weight search, and (c) a systematic comparison of DR and SP in a realistic multi-
agent LOB simulator. While results reveal clear benefits of environmental diversity, they also highlight the brittleness of current RL algorithms under adversarial liquidity conditions, pointing to the need for curriculum-based randomisation, risk-aware objectives and more carefully configured and realistic simulators in future work.

Detecting Environment Changes via Quantile Spread in Quantile Regression Deep-Q Networks

Bachelor thesis (2025) - P. Stan (author) , M.M. Celikok (mentor) , Frans A Oliehoek (mentor) , Annibale Panichella (graduation committee member)

Reinforcement learning agents are trained in well-defined environments and evaluated under the assumption that the test time conditions match those encountered during training. However, even small changes in the environment’s dynamics can degrade the policy’s performance, even mo ...

Evaluating and Enhancing the Robustness of Proximal Policy Optimization to Test-Time Corruptions in Sequential Domains

Bachelor thesis (2025) - M.R. Rodić (author) , M.M. Celikok (mentor) , Frans A Oliehoek (mentor) , Annibale Panichella (graduation committee member)

Reinforcement learning (RL) agents often achieve impressive results in simulation but can fail catastrophically when facing small deviations at deployment time. In this work, we examine the brittleness of Proximal Policy Optimization (PPO) agents when subjected to test-time obser ...

Evaluating the Robustness of SAC under Distributional Shifts in Driving Domain

Bachelor thesis (2025) - L. Polovina (author) , F.A. Oliehoek (mentor) , M.M. Celikok (mentor)

Reinforcement Learning (RL) has shown strong potential in complex decision-making domains, but its likelihood to distributional shifts between training and deployment environments remains a significant barrier to real-world reliability, particularly in safety-critical contexts su ...

Evaluating the Robustness of DQN and QR-DQN in Traffic Simulation

Analyzing the Effect of Quantile Manipulation in Environmental Variability

Bachelor thesis (2025) - C. Toadere (author) , M.M. Celikok (mentor) , Frans A Oliehoek (graduation committee member) , Annibale Panichella (graduation committee member)

As autonomous driving systems advance, ensuring the robustness of underlying decision-making algorithms becomes increasingly critical. This study assesses the performance and reliability of two reinforcement learning models, Deep Q-Network (DQN) and Quantile Regression DQN (QR-DQ ...

Evaluating the robustness of DQN and QR-DQN under domain randomization

Analyzing the effects of domain variation on value-based methods

Bachelor thesis (2025) - Y. Zwetsloot (author) , M.M. Celikok (mentor) , Frans A Oliehoek (mentor) , Annibale Panichella (graduation committee member)

Domain randomization (or DR) is a widely used technique in reinforcement learning to improve robustness and enable sim-to-real transfer. While prior work has focused extensively on DR in combination with algorithms such as PPO and SAC, its effects on value-based methods like DQN ...

Multi-Agent Reinforcement Learning for Portfolio Management

Master thesis (2024) - M. Choi (author) , Jenny Della Santina (mentor) , M.M. Celikok (mentor) , Mike Chen (mentor) , Clint Howard (mentor) , Danny Huang (mentor) , Javier Alonso-Mora (graduation committee member)

Reinforcement learning (RL) is a powerful tool where the agents – or “robots” can learn from the environment based on their actions. Reinforcement learning approaches were found successful in combining predicting stock returns and portfolio allocation. Diversification is a critic ...