Investigating Theory of Mind Capabilities in Multimodal Large Language Models

Master Thesis (2024)
Author(s)

A.M. van Groenestijn (TU Delft - Mechanical Engineering)

Contributor(s)

Jens Kober – Mentor (TU Delft - Learning & Autonomous Control)

Chirag Raman – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

Martijn Wisse – Graduation committee member (TU Delft - Robot Dynamics)

Faculty
Mechanical Engineering
More Info
expand_more
Publication Year
2024
Language
English
Coordinates
51.9962559, 4.3758659
Graduation Date
23-10-2024
Awarding Institution
Delft University of Technology
Programme
['Mechanical Engineering | Vehicle Engineering | Cognitive Robotics']
Faculty
Mechanical Engineering
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Human Theory of Mind (ToM), the ability to infer others’ mental states, is essential for effective social interaction. It allows us to predict behavior and make decisions accordingly.In Human Robot Interaction (HRI), however, this remains a significant challenge, especially in dynamic, real-world scenarios. Enabling robots to possess ToM-like capabilities has the potential to greatly improve their interaction with humans. Recent advancements have introduced Large Language Models (LLMs) as robot controllers, leveraging their strengths in generalization, reasoning, and code comprehension. Some have claimed that LLMs may exhibit emergent ToM capabilities, but these claims have yet to be substantiated with rigorous evidence. This study investigates the ToM-like abilities of Multimodal Large Language Models (MLLMs) by creating a benchmark dataset from humans performing object rearrangement tasks in a simulated environment. The dataset visually captures the participants’ behavior and textually captures their internal monologues. Based on this dataset (text, video, or hybrid) three state-of-the-art models made predictions about the participants’ belief updates. While the results do not conclusively establish ToM capabilities in MLLMs, they offer promising insights into mental model inference and suggest future directions for research in this domain.

Files

License info not available