Reproducing the “Flip a Coin or Vote” experiment with GPT-3.5 and GPT-4o
An simulation study to the suitability of LLMs as participants in economic and behavioural experiments
More Info
expand_more
Abstract
The latest state-of-the-art large language models (LLMs) are implicit computational models of humans due to how they are trained and designed. This implies that LLMs can be used as participants in economic and behavioural experiments. However, their technical and ethical limitations have sparked ongoing debate about the usefulness of LLMs in experimental research. This study contributes to the research area by reproducing the “Flip a Coin or Vote” experiment conducted by Hoffmann & Renes (2021), with GPT-3.5 and GPT-4o as participants.
The findings indicate that GPT-3.5 (1) struggles to fully understand the rules of the experiment, especially in calculating payoffs; (2) has difficulties interpreting negative and positive valuations; (3) fails to match the percentage of rational choices observed in the lab results; and (4) demonstrate results that deviate too much with the lab results to be considered human-like.
In contrast, GPT-4o shows greater promise as a ‘good’ participant. GPT-4o understands the rules of the game correctly and demonstrates a much stronger ability to make rational choices, even matching, and sometimes surpassing, the percentage of rational choices made by human participants. However, in the second part of the experiment, GPT-4o's decision-making becomes "inhumanly" accurate. Moreover the results deviate too much with the lab results to be considered human-like.
Future research should be focused on how to steer the results of LLMs towards more human-like results, possibly through enhanced prompting techniques or further model training and fine-tuning. Furthermore, future research should focus on establishing best practices for developing a more standardised approach to conducting simulation research with LLMs.