Performance of Large Language Models in Prediction Markets
K.H. Halldórsson (TU Delft - Technology, Policy and Management)
A.Y. Ding – Mentor
S. Renes – Mentor (TU Delft - Economics of Technology and Innovation)
R. van Bergem – Mentor (TU Delft - Economics of Technology and Innovation)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Decision-making under uncertainty often relies on access to accurate probabilistic forecasts. In many contexts, such forecasts are scarce or difficult to obtain. Decentralised prediction markets are widely regarded as effective tools to aggregate dispersed information that decision-makers can use as forecasts to make better decisions about an uncertain future. Recently, there have been major advances in large language models that have led to claims that large language models could complement or replace market-based forecasting by synthesising information without the need for incentive-driven markets. However, there is limited empirical evidence comparing forecasts generated by large language models to market-based aggregation under real-world conditions. This thesis puts these claims to the test by examining the extent to which large language models can replicate or complement human forecasting as reflected in decentralised prediction markets. Using Polymarket as a benchmark for collective human forecasting, probability forecasts generated by large language models are compared to live market probabilities. Forecasting performance is evaluated across different market conditions, and the decision-making relevance of forecasts generated by large language models is evaluated through trading simulations. The results show that market probabilities are consistently more accurate than the forecasts generated by large language models in terms of predictive accuracy. The findings hold across all evaluated models, model combinations, prompting strategies, market stages, and liquidity levels of markets. A regression-based aggregation model that mixes market probabilities and large language model forecasts achieves predictive performance comparable to that of the market in some cases, but it fails to generalise when put to the test under realistic conditions. The findings suggest that large language models at their current stage cannot substitute prediction markets as information aggregation mechanisms. The results challenge claims that large language models can replicate the performance of prediction markets in the generation of accurate probabilistic forecasts. The results highlight the need for caution when deploying large language models in the context of high-stakes decision-making.
https://github.com/KetillHafdal/llm-vs-prediction-markets