Options markets offer traders powerful leverage and sophisticated hedging tools, yet their nonlinear
payoffs and high volatility could cause catastrophic losses when not traded carefully. On the other
hand, when managed properly, options can yield substantial profits, underscorin
...
Options markets offer traders powerful leverage and sophisticated hedging tools, yet their nonlinear
payoffs and high volatility could cause catastrophic losses when not traded carefully. On the other
hand, when managed properly, options can yield substantial profits, underscoring their high-risk, highreward nature. Despite the high stakes and growing automation in finance, applying Reinforcement
Learning (RL) to intraday options trading remains underexplored.
This thesis aims to determine whether RL can learn robust and profitable trading strategies for S&P
500 options and to identify which reward designs best balance return and risk. To do so, we extend
the FinRL framework to simulate a multi-option trading environment rich in state information, including prices, Greeks, and implied volatility. FinRL is an open-source Deep Reinforcement Learning
library specifically designed for financial applications, providing standardized data pipelines, market
simulators, and agent interfaces. We chose FinRL because it offers backtesting capabilities, solid environment designs, and seamless integration with popular RL algorithms, which allowed us to focus
on customizing the option-specific state and reward structures rather than building infrastructure from
scratch.
We trained the Proximal Policy Optimization (PPO) agents under eight distinct reward formulations,
including (un)realized PnL, margin penalties, and normalized profit to margin ratios. Evaluation on out
of sample data shows that many agents fail to generalize the returns it has achieved during training.
Margin penalties enforce safety at the expense of profitability, and normalized rewards improve agent
behaviour slightly but suffer from unstable learning. A combined reward that integrates realized profit
and margin penalties achieves the best balance, producing small positive test returns with intended
margin control.
Shortcomings include sensitivity to hyperparameter choices, challenges in reward term scaling, choice
of data granularity in the simulated environment, limited state-action representation, and the inherent noisy data of the financial markets. These findings underscore the role of reward engineering in
learning viable options trading policies. Future work should focus on hyperparameters tuning, state
and action space engineering and alternative RL algorithms with diverse reward functions to further
enhance robustness and real-world applicability.
Overall, this thesis contributes to bridging the gap between RL and options trading by demonstrating
the role of reward engineering in shaping agent behavior.