Reproducing the “Flip a Coin or Vote” experiment with GPT-3.5 and GPT-4o

Hutschemaekers, M.J.

Reproducing the “Flip a Coin or Vote” experiment with GPT-3.5 and GPT-4o

An simulation study to the suitability of LLMs as participants in economic and behavioural experiments

Master thesis (2024)

Authors

M.J. Hutschemaekers Technology, Policy and Management

Contributors

P.W.G. Bots Policy Analysis (mentor)

Rutger van Bergem Economics of Technology and Innovation (graduation committee member)

Sander Renes Economics of Technology and Innovation (graduation committee member)

Faculty

Technology, Policy and Management

Large Language Models (LLMs) Social Choice Mechanisms Simulation Research

To reference this document use:

http://resolver.tudelft.nl/uuid:5e35189e-0221-474a-a0f4-b4f0d1ba8dae

More Info

expand_more

Published Date

28-08-2024

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Technology, Policy and Management

Abstract

The latest state-of-the-art large language models (LLMs) are implicit computational models of humans due to how they are trained and designed. This implies that LLMs can be used as participants in economic and behavioural experiments. However, their technical and ethical limitations have sparked ongoing debate about the usefulness of LLMs in experimental research. This study contributes to the research area by reproducing the “Flip a Coin or Vote” experiment conducted by Hoffmann & Renes (2021), with GPT-3.5 and GPT-4o as participants.

The findings indicate that GPT-3.5 (1) struggles to fully understand the rules of the experiment, especially in calculating payoffs; (2) has difficulties interpreting negative and positive valuations; (3) fails to match the percentage of rational choices observed in the lab results; and (4) demonstrate results that deviate too much with the lab results to be considered human-like.

In contrast, GPT-4o shows greater promise as a ‘good’ participant. GPT-4o understands the rules of the game correctly and demonstrates a much stronger ability to make rational choices, even matching, and sometimes surpassing, the percentage of rational choices made by human participants. However, in the second part of the experiment, GPT-4o's decision-making becomes "inhumanly" accurate. Moreover the results deviate too much with the lab results to be considered human-like.

Future research should be focused on how to steer the results of LLMs towards more human-like results, possibly through enhanced prompting techniques or further model training and fine-tuning. Furthermore, future research should focus on establishing best practices for developing a more standardised approach to conducting simulation research with LLMs.

Files

M.J._Hutschemaekers_Thesis_Fin... (pdf)

(pdf | 8.42 Mb)

License info not available