Reproducing the “Flip a Coin or Vote” experiment with GPT-3.5 and GPT-4o

An simulation study to the suitability of LLMs as participants in economic and behavioural experiments

Master Thesis (2024)
Author(s)

M.J. Hutschemaekers (TU Delft - Technology, Policy and Management)

Contributor(s)

Pieter Bots – Mentor (TU Delft - Policy Analysis)

Rutger van Bergem – Graduation committee member (TU Delft - Economics of Technology and Innovation)

Sander Renes – Graduation committee member (TU Delft - Economics of Technology and Innovation)

Faculty
Technology, Policy and Management
More Info
expand_more
Publication Year
2024
Language
English
Graduation Date
28-08-2024
Awarding Institution
Delft University of Technology
Programme
['Complex Systems Engineering and Management (CoSEM)']
Faculty
Technology, Policy and Management
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

The latest state-of-the-art large language models (LLMs) are implicit computational models of humans due to how they are trained and designed. This implies that LLMs can be used as participants in economic and behavioural experiments. However, their technical and ethical limitations have sparked ongoing debate about the usefulness of LLMs in experimental research. This study contributes to the research area by reproducing the “Flip a Coin or Vote” experiment conducted by Hoffmann & Renes (2021), with GPT-3.5 and GPT-4o as participants.

The findings indicate that GPT-3.5 (1) struggles to fully understand the rules of the experiment, especially in calculating payoffs; (2) has difficulties interpreting negative and positive valuations; (3) fails to match the percentage of rational choices observed in the lab results; and (4) demonstrate results that deviate too much with the lab results to be considered human-like.

In contrast, GPT-4o shows greater promise as a ‘good’ participant. GPT-4o understands the rules of the game correctly and demonstrates a much stronger ability to make rational choices, even matching, and sometimes surpassing, the percentage of rational choices made by human participants. However, in the second part of the experiment, GPT-4o's decision-making becomes "inhumanly" accurate. Moreover the results deviate too much with the lab results to be considered human-like.

Future research should be focused on how to steer the results of LLMs towards more human-like results, possibly through enhanced prompting techniques or further model training and fine-tuning. Furthermore, future research should focus on establishing best practices for developing a more standardised approach to conducting simulation research with LLMs.

Files

License info not available