Investigating Theory of Mind Capabilities in Multimodal Large Language Models

None, None

Investigating Theory of Mind Capabilities in Multimodal Large Language Models

Master Thesis (2024)

Author(s)

A.M. van Groenestijn (TU Delft - Mechanical Engineering)

Contributor(s)

Jens Kober – Mentor (TU Delft - Learning & Autonomous Control)

Chirag Raman – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

Martijn Wisse – Graduation committee member (TU Delft - Robot Dynamics)

Faculty

Mechanical Engineering

Robotics Multimodality Human Robot Interaction Theory of Mind (ToM) Large Language Model

To reference this document use:

https://resolver.tudelft.nl/uuid:0f5b1496-8133-4a2d-8ec8-38bfe9732631

More Info

expand_more

Publication Year

2024

Language

English

Coordinates

51.9962559, 4.3758659

Graduation Date

23-10-2024

Awarding Institution

Delft University of Technology

Programme

['Mechanical Engineering | Vehicle Engineering | Cognitive Robotics']

Faculty

Mechanical Engineering

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Human Theory of Mind (ToM), the ability to infer others’ mental states, is essential for effective social interaction. It allows us to predict behavior and make decisions accordingly.In Human Robot Interaction (HRI), however, this remains a significant challenge, especially in dynamic, real-world scenarios. Enabling robots to possess ToM-like capabilities has the potential to greatly improve their interaction with humans. Recent advancements have introduced Large Language Models (LLMs) as robot controllers, leveraging their strengths in generalization, reasoning, and code comprehension. Some have claimed that LLMs may exhibit emergent ToM capabilities, but these claims have yet to be substantiated with rigorous evidence. This study investigates the ToM-like abilities of Multimodal Large Language Models (MLLMs) by creating a benchmark dataset from humans performing object rearrangement tasks in a simulated environment. The dataset visually captures the participants’ behavior and textually captures their internal monologues. Based on this dataset (text, video, or hybrid) three state-of-the-art models made predictions about the participants’ belief updates. While the results do not conclusively establish ToM capabilities in MLLMs, they offer promising insights into mental model inference and suggest future directions for research in this domain.

Files

Thesis_document_4592557-1.pdf

(pdf | 4.51 Mb)

License info not available