Are base Large Language Models good human driver models?

The behavioural differences between a 1D merging agent controlled by a Large Language Model and human driving data

Master Thesis (2026)
Author(s)

W. Mooi (TU Delft - Mechanical Engineering)

Contributor(s)

A. Zgonnikov – Mentor (TU Delft - Human-Robot Interaction)

S.H.A. Mohammad – Mentor (TU Delft - Traffic Systems Engineering)

J.C.F. de Winter – Mentor (TU Delft - Human-Robot Interaction)

More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
14-01-2026
Awarding Institution
Programme
Mechanical Engineering, Vehicle Engineering, Cognitive Robotics
Downloads counter
42
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Human driver models are essential for the development and testing of Automated Driving Systems (ADS), yet current approaches often struggle to capture the complex, stochastic nature of human tactical decision-making. Large Language Models (LLMs) have emerged as potential reasoning agents capable of emulating human-like social behaviour, but their application as direct vehicle control agents remains largely underexplored.

This thesis investigates the extent to which a base LLM, guided by systematic prompt engineering, can replicate the tactical decisions and control of human drivers in a 1-D highway merging scenario. Using the OpenAI o3 model, an LLM-driven agent was developed and systematically benchmarked against a dataset of human driver behaviour recorded in a simulator experiment.
The study utilised Linear Mixed-Effects Regression (LMER) to analyse decision-making mechanisms and performed a sensitivity analysis using the Google Gemini-2.5-pro model to assess generalisability.

The results demonstrate that the LLM agent successfully replicated high-level tactical behaviours, satisfying qualitative criteria such as symmetrical yielding in neutral conditions and increased yield rates when the opposing vehicle held a headway advantage. However, a fundamental disparity was observed in operational control. While human drivers relied significantly on relative velocity to negotiate merges (p = 1.88 × 10−26), the LLM adopted a conservative, calculation-heavy gap-based strategy driven by absolute distance, resulting in average safety margins more than double the human benchmark (9.18 m vs. 3.85 m). Furthermore, a sensitivity analysis revealed severe model dependency. While the optimised prompt achieved a 0.0% collision rate with the o3 model, it resulted in a 25.5% collision rate with Gemini-2.5-pro.

This research concludes that while base LLMs possess the emergent reasoning capabilities to function as high-level strategic agents, their lack of continuous perceptual flow limits their validity as direct operational controllers. The findings suggest that future implementations should adopt hierarchical architectures, leveraging LLMs for tactical reasoning while relying on physics-based controllers for dynamic execution.

Files

License info not available