Large multimodal models evaluation

None, None; None, None; None, None; None, None; None, None; None, None; None, None; None, None; None, None; None, None

Large multimodal models evaluation

a survey

Review (2025)

Author(s)

Zicheng Zhang (Shanghai Artificial Intelligence Laboratory)

Junying Wang (Shanghai Artificial Intelligence Laboratory, Fudan University)

Farong Wen (Shanghai Artificial Intelligence Laboratory, Shanghai Jiao Tong University)

Yijin Guo (Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory)

Xiangyu Zhao (Shanghai Artificial Intelligence Laboratory, Shanghai Jiao Tong University)

Xinyu Fang (Shanghai Artificial Intelligence Laboratory, Zhejiang University - Hangzhou)

Shengyuan Ding (Fudan University, Shanghai Artificial Intelligence Laboratory)

Xuemei Zhou (TU Delft - Multimedia Computing)

Guangtao Zhai (Shanghai Artificial Intelligence Laboratory)

undefined More Authors

Research Group

Multimedia Computing

DOI related publication

https://doi.org/10.1007/s11432-025-4676-4

Large multimodal models Multimodal generation Multimodal understanding

To reference this document use:

https://resolver.tudelft.nl/uuid:ad5cee03-a54d-4e97-b905-a32889325bd7

More Info

expand_more

Publication Year

2025

Language

English

Research Group

Multimedia Computing

Bibliographical Note

Green Open Access added to TU Delft Institutional Repository as part of the Taverne amendment. More information about this copyright law amendment can be found at https://www.openaccess.nl. Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.

Journal title

Science China Information Sciences

Issue number

12

Volume number

68

Article number

221301

Downloads counter

36

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

As large multimodal models (LMMs) advance rapidly across diverse multimodal understanding and generation tasks, the need for systematic and reliable evaluation frameworks becomes increasingly critical. To address this need, this survey provides a structured overview of LMM evaluation, centered around two main axes: multimodal evaluation for understanding and generation. (1) For understanding, a dual-perspective framework is introduced to distinguish benchmarks between general capabilities, which emphasize common tasks, and specialized capabilities, which reflect expert-level competence in domain-specific fields. (2) For generation, evaluation is organized by output modality, including image, video, audio, and 3D content. (3) From a community perspective, this survey further highlights authoritative leaderboards and foundational tools that have been instrumental in establishing a comprehensive evaluation ecosystem for LMMs. By unifying general-specialized understanding and modality-specific generation evaluations, this survey clarifies the current landscape and provides guidance for future research in the LMM evaluation field.

Files

S11432-025-4676-4.pdf

(pdf | 5 Mb)

License info not available

File under embargo until 18-05-2026