Investigating Theory of Mind Capabilities in Multimodal Large Language Models
A.M. van Groenestijn (TU Delft - Mechanical Engineering)
Jens Kober – Mentor (TU Delft - Learning & Autonomous Control)
Chirag Raman – Mentor (TU Delft - Pattern Recognition and Bioinformatics)
Martijn Wisse – Graduation committee member (TU Delft - Robot Dynamics)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Human Theory of Mind (ToM), the ability to infer others’ mental states, is essential for effective social interaction. It allows us to predict behavior and make decisions accordingly.In Human Robot Interaction (HRI), however, this remains a significant challenge, especially in dynamic, real-world scenarios. Enabling robots to possess ToM-like capabilities has the potential to greatly improve their interaction with humans. Recent advancements have introduced Large Language Models (LLMs) as robot controllers, leveraging their strengths in generalization, reasoning, and code comprehension. Some have claimed that LLMs may exhibit emergent ToM capabilities, but these claims have yet to be substantiated with rigorous evidence. This study investigates the ToM-like abilities of Multimodal Large Language Models (MLLMs) by creating a benchmark dataset from humans performing object rearrangement tasks in a simulated environment. The dataset visually captures the participants’ behavior and textually captures their internal monologues. Based on this dataset (text, video, or hybrid) three state-of-the-art models made predictions about the participants’ belief updates. While the results do not conclusively establish ToM capabilities in MLLMs, they offer promising insights into mental model inference and suggest future directions for research in this domain.