Multimodal Quantitative Measures for Multiparty Behavior Evaluation

Conference Paper (2025)
Author(s)

O.K. Shirekar (TU Delft - Pattern Recognition and Bioinformatics)

Wim Pouw (Tilburg University)

C. Hao (TU Delft - Pattern Recognition and Bioinformatics)

Vrushank Phadnis (Google LLC)

Thabo Beeler (Google Switzerland GmbH)

C.A. Raman (TU Delft - Pattern Recognition and Bioinformatics)

Research Group
Pattern Recognition and Bioinformatics
DOI related publication
https://doi.org/10.1145/3716553.3750752
More Info
expand_more
Publication Year
2025
Language
English
Research Group
Pattern Recognition and Bioinformatics
Pages (from-to)
249-264
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Digital humans are emerging as autonomous agents in multiparty interactions, yet existing evaluation metrics largely ignore contextual coordination dynamics. We introduce a unified, intervention-driven framework for objective assessment of multiparty social behaviour in skeletal motion data, spanning three complementary dimensions: (1) synchrony via Cross-Recurrence Quantification Analysis, (2) temporal alignment via Multiscale Empirical Mode Decomposition-based Beat Consistency, and (3) structural similarity via Soft Dynamic Time Warping. We validate metric sensitivity through three theory-driven perturbations - gesture kinematic dampening, uniform speech-gesture delays, and prosodic pitch-variance reduction - applied to ≈ 145 30-second thin slices of group interactions from the DnD dataset. Mixed-effects analyses reveal predictable, joint-independent shifts: dampening increases CRQA determinism and reduces beat consistency, delays weaken cross-participant coupling, and pitch flattening elevates F0 Soft-DTW costs. A complementary perception study (N = 27) compares judgments of full-video and skeleton-only renderings to quantify representation effects. Our three measures deliver orthogonal insights into spatial structure, timing alignment, and behavioural variability. Thereby forming a robust toolkit for evaluating and refining socially intelligent agents. Code available on GitHub.