C.A. Raman | TU Delft Repository

Systematic review of machine learning applications using nonoptical motion tracking in surgery

Journal article (2025) - Teona Z. Carciumaru, Cadey M. Tang, Mohsen Farsi, Wichor M. Bramer, Jenny Dankelman, Chirag Raman, Clemens M.F. Dirven, Maryam Gholinejad, Dalibor Vasilic

This systematic review explores machine learning (ML) applications in surgical motion analysis using non-optical motion tracking systems (NOMTS), alone or with optical methods. It investigates objectives, experimental designs, model effectiveness, and future research directions. From 3632 records, 84 studies were included, with Artificial Neural Networks (38%) and Support Vector Machines (11%) being the most common ML models. Skill assessment was the primary objective (38%). NOMTS used included internal device kinematics (56%), electromagnetic (17%), inertial (15%), mechanical (11%), and electromyography (1%) sensors. Surgical settings were robotic (60%), laparoscopic (18%), open (16%), and others (6%). Procedures focused on bench-top tasks (67%), clinical models (17%), clinical simulations (9%), and non-clinical simulations (7%). Over 90% accuracy was achieved in 36% of studies. Literature shows NOMTS and ML can enhance surgical precision, assessment, and training. Future research should advance ML in surgical environments, ensure model interpretability and reproducibility, and use larger datasets for accurate evaluation. ...

Multimodal Quantitative Measures for Multiparty Behavior Evaluation

Conference paper (2025) - Ojas Shirekar, Wim Pouw, Chenxu Hao, Vrushank Phadnis, Thabo Beeler, Chirag Raman

Digital humans are emerging as autonomous agents in multiparty interactions, yet existing evaluation metrics largely ignore contextual coordination dynamics. We introduce a unified, intervention-driven framework for objective assessment of multiparty social behaviour in skeletal motion data, spanning three complementary dimensions: (1) synchrony via Cross-Recurrence Quantification Analysis, (2) temporal alignment via Multiscale Empirical Mode Decomposition-based Beat Consistency, and (3) structural similarity via Soft Dynamic Time Warping. We validate metric sensitivity through three theory-driven perturbations - gesture kinematic dampening, uniform speech-gesture delays, and prosodic pitch-variance reduction - applied to ≈ 145 30-second thin slices of group interactions from the DnD dataset. Mixed-effects analyses reveal predictable, joint-independent shifts: dampening increases CRQA determinism and reduces beat consistency, delays weaken cross-participant coupling, and pitch flattening elevates F0 Soft-DTW costs. A complementary perception study (N = 27) compares judgments of full-video and skeleton-only renderings to quantify representation effects. Our three measures deliver orthogonal insights into spatial structure, timing alignment, and behavioural variability. Thereby forming a robust toolkit for evaluating and refining socially intelligent agents. Code available on GitHub. ...

Big team science reveals promises and limitations of machine learning efforts to model physiological markers of affective experience

Journal article (2025) - Nicholas A. Coles, Bartosz Perz, Maciej Behnke, Johannes C. Eichstaedt, Soo Hyung Kim, Tu N. Vu, Chirag Raman, Julian Tejada, Van-Thong Huynh, More authors...

Researchers are increasingly using machine learning to study physiological markers of emotion. We evaluated the promises and limitations of this approach via a big team science competition. Twelve teams competed to predict self-reported affective experiences using a multi-modal set of peripheral nervous system measures. Models were trained and tested in multiple ways: with data divided by participants, targeted emotion, inductions, and time. In 100% of tests, teams outperformed baseline models that made random predictions. In 46% of tests, teams also outperformed baseline models that relied on the simple average of ratings from training datasets. More notably, results uncovered a methodological challenge: multiplicative constraints on generalizability. Inferences about the accuracy and theoretical implications of machine learning efforts depended not only on their architecture, but also how they were trained, tested, and evaluated. For example, some teams performed better when tested on observations from the same (vs. different) subjects seen during training. Such results could be interpreted as evidence against claims of universality. However, such conclusions would be premature because other teams exhibited the opposite pattern. Taken together, results illustrate how big team science can be leveraged to understand the promises and limitations of machine learning methods in affective science and beyond. ...

Mesh-Tension Driven Expression-Based Wrinkles for Synthetic Faces

Conference paper (2023) - Chirag Raman, Charlie Hewitt, Erroll Wood, Tadas Baltrusaitis

Recent advances in synthesizing realistic faces have shown that synthetic training data can replace real data for various face-related computer vision tasks. A question arises: how important is realism? Is the pursuit of photorealism excessive? In this work, we show otherwise. We boost the realism of our synthetic faces by introducing dynamic skin wrinkles in response to facial expressions, and observe significant performance improvements in downstream computer vision tasks. Previous approaches for producing such wrinkles either required prohibitive artist effort to scale across identities and expressions, or were not capable of reconstructing high-frequency skin details with sufficient fidelity. Our key contribution is an approach that produces realistic wrinkles across a large and diverse population of digital humans. Concretely, we formalize the concept of mesh-tension and use it to aggregate possible wrinkles from high-quality expression scans into albedo and displacement texture maps. At synthesis, we use these maps to produce wrinkles even for expressions not represented in the source scans. Additionally, to provide a more nuanced indicator of model performance under deformations resulting from com-pressed expressions, we introduce the 300W-winks evaluation subset and the Pexels dataset of closed eyes and winks. ...

Towards Artificial Social Intelligence in the Wild

Sensing, Synthesizing, Modeling, and Perceiving Nonverbal Social Human Behavior

Doctoral thesis (2023) - C.A. Raman

Over the last three decades, the social roots of human intelligence have come to influence the development of artificial intelligence (AI). Researchers in AI have moved beyond agents operating in isolation towards developing socially situated agents that can operate in the real world. Meanwhile, researchers in the social sciences have been leveraging AI techniques to analyze and theorize about social phenomena. Both these research endeavors came to be independently termed Artificial Social Intelligence (ASI), leading to the emergence of a field spanning several subdisciplines of the social and computational sciences. This Thesis takes a holistic view of ASI and makes contributions toward both its historical goals. Moreover, the work presented here focuses on taking ASI research into natural real-world settings in the wild. The research is organized under three themes: acquiring, modeling, and perceiving social human behavior. The Thesis begins by addressing the challenge of data acquisition. We propose a replicable data collection concept for curating datasets of real-world social human behavior, incorporating technical innovations and ethical considerations required for the noninvasive sensing of multimodal behavioral streams. To overcome the limited availability of real-world data, we also explore the potential of synthetic training data for downstream tasks. Next, we tackle the challenge of modeling real-world social behavioral cues. Evidence from social psychology suggests that individuals uniquely adapt their behaviors to different conversation partners to sustain interactions. How can we jointly forecast these mutually dependent future cues of conversation partners? We propose a stochastic meta-learning method that adapts its forecasts to the unique dynamics of a conversation group given example behavior sequences. Thereby, it generalizes to unseen groups in a data-efficient manner by avoiding the need for group-specific models. Further, to facilitate the integration of data-driven and hypothesis-driven research, we propose a post hoc explanation framework for identifying timesteps that are salient to a forecasting model's predictions. Finally, we contribute to a nuanced perception of social interactions by establishing evidence of multiple conversation floors within a single conversing group, in contrast to the prevailing implicit assumption in the automatic detection of conversation groups. We also develop an instrument for measuring the perceived quality of conversations at the individual and group levels. Through these research themes, we provide novel contributions to the field of ASI, taking important steps toward the development of socially intelligent machines that can operate effectively in complex real-world settings. ...

Over the last three decades, the social roots of human intelligence have come to influence the development of artificial intelligence (AI). Researchers in AI have moved beyond agents operating in isolation towards developing socially situated agents that can operate in the real world. Meanwhile, researchers in the social sciences have been leveraging AI techniques to analyze and theorize about social phenomena. Both these research endeavors came to be independently termed Artificial Social Intelligence (ASI), leading to the emergence of a field spanning several subdisciplines of the social and computational sciences. This Thesis takes a holistic view of ASI and makes contributions toward both its historical goals. Moreover, the work presented here focuses on taking ASI research into natural real-world settings in the wild. The research is organized under three themes: acquiring, modeling, and perceiving social human behavior. The Thesis begins by addressing the challenge of data acquisition. We propose a replicable data collection concept for curating datasets of real-world social human behavior, incorporating technical innovations and ethical considerations required for the noninvasive sensing of multimodal behavioral streams. To overcome the limited availability of real-world data, we also explore the potential of synthetic training data for downstream tasks. Next, we tackle the challenge of modeling real-world social behavioral cues. Evidence from social psychology suggests that individuals uniquely adapt their behaviors to different conversation partners to sustain interactions. How can we jointly forecast these mutually dependent future cues of conversation partners? We propose a stochastic meta-learning method that adapts its forecasts to the unique dynamics of a conversation group given example behavior sequences. Thereby, it generalizes to unseen groups in a data-efficient manner by avoiding the need for group-specific models. Further, to facilitate the integration of data-driven and hypothesis-driven research, we propose a post hoc explanation framework for identifying timesteps that are salient to a forecasting model's predictions. Finally, we contribute to a nuanced perception of social interactions by establishing evidence of multiple conversation floors within a single conversing group, in contrast to the prevailing implicit assumption in the automatic detection of conversation groups. We also develop an instrument for measuring the perceived quality of conversations at the individual and group levels. Through these research themes, we provide novel contributions to the field of ASI, taking important steps toward the development of socially intelligent machines that can operate effectively in complex real-world settings.

Perceived Conversation Quality in Spontaneous Interactions

Journal article (2023) - Chirag Raman, Navin Raj Prabhu, Hayley Hung

The quality of daily spontaneous conversations is of importance towards both our well-being as well as the development of interactive social agents. Prior research directly studying the quality of social conversations has operationalized it in narrow terms, associating greater quality to less small talk. Other works taking a broader perspective of interaction experience have indirectly studied quality through one of the several overlapping constructs such as rapport or engagement, in isolation. In this work we bridge this gap by proposing a holistic conceptualization of conversation quality, building upon the collaborative attributes of cooperative conversation floors. Taking a multilevel perspective of conversation, we develop and validate two instruments for perceived conversation quality (PCQ) at the individual and group levels. Specifically, we motivate capturing external raters' gestalt impressions of participant experiences from thin slices of behavior, and collect annotations of PCQ on the publicly available MatchNMingle dataset of in-the-wild mingling conversations. Finally, we present an analysis of behavioral features that are predictive of PCQ. We find that for the conversations in MatchNMingle, raters tend to associate smaller group sizes, equitable speaking turns with fewer interruptions, and time taken for synchronous bodily coordination with higher PCQ. ...

Social Processes

Self-supervised Meta-learning Over Conversational Groups for Forecasting Nonverbal Social Cues

Conference paper (2023) - Chirag Raman, Hayley Hung, Marco Loog

Free-standing social conversations constitute a yet underexplored setting for human behavior forecasting. While the task of predicting pedestrian trajectories has received much recent attention, an intrinsic difference between these settings is how groups form and disband. Evidence from social psychology suggests that group members in a conversation explicitly self-organize to sustain the interaction by adapting to one another’s behaviors. Crucially, the same individual is unlikely to adapt similarly across different groups; contextual factors such as perceived relationships, attraction, rapport, etc., influence the entire spectrum of participants’ behaviors. A question arises: how can we jointly forecast the mutually dependent futures of conversation partners by modeling the dynamics unique to every group? In this paper, we propose the Social Process (SP) models, taking a novel meta-learning and stochastic perspective of group dynamics. Training group-specific forecasting models hinders generalization to unseen groups and is challenging given limited conversation data. In contrast, our SP models treat interaction sequences from a single group as a meta-dataset: we condition forecasts for a sequence from a given group on other observed-future sequence pairs from the same group. In this way, an SP model learns to adapt its forecasts to the unique dynamics of the interacting partners, generalizing to unseen groups in a data-efficient manner. Additionally, we first rethink the task formulation itself, motivating task requirements from social science literature that prior formulations have overlooked. For our formulation of Social Cue Forecasting, we evaluate the empirical performance of our SP models against both non-meta-learning and meta-learning approaches with similar assumptions. The SP models yield improved performance on synthetic and real-world behavior datasets. ...

Towards a Real-time Measure of the Perception of Anthropomorphism in Human-robot Interaction

Conference paper (2021) - Maria Tsfasman, Avinash Saravanan, Dekel Viner, Daan Goslinga, Sarah De Wolf, Chirag Raman, Catholijn M. Jonker, Catharine Oertel

How human-like do conversational robots need to look to enable long-term human-robot conversation? One essential aspect of long-term interaction is a human's ability to adapt to the varying degrees of a conversational partner's engagement and emotions. Prosodically, this can be achieved through (dis)entrainment. While speech-synthesis has been a limiting factor for many years, restrictions in this regard are increasingly mitigated. These advancements now emphasise the importance of studying the effect of robot embodiment on human entrainment. In this study, we conducted a between-subjects online human-robot interaction experiment in an educational use-case scenario where a tutor was either embodied through a human or a robot face. 43 English-speaking participants took part in the study for whom we analysed the degree of acoustic-prosodic entrainment to the human or robot face, respectively. We found that the degree of subjective and objective perception of anthropomorphism positively correlates with acoustic-prosodic entrainment. ...

Defining and quantifying conversation quality in spontaneous interactions

Conference paper (2020) - Navin Raj Prabhu, Chirag Raman, Hayley Hung

Social interactions in general are multifaceted and there exists a wide set of factors and events that influence them. In this paper, we quantify social interactions with a holistic viewpoint on individual experiences, particularly focusing on non-task-directed spontaneous interactions. To achieve this, we design a novel perceived measure, the perceived Conversation Quality, which intends to quantify spontaneous interactions by accounting for several socio-dimensional aspects of individual experiences. To further quantitatively study spontaneous interactions, we devise a questionnaire which measures the perceived Conversation Quality, at both the individual- and at the group- level. Using the questionnaire, we collected perceived annotations for conversation quality in a publicly available dataset using naive annotators. The results of the analysis performed on the distribution and the inter-annotator agreeability shows that naive annotators tend to agree less in cases of low conversation quality samples, especially while annotating for group-level conversation quality. ...

A Modular Approach for Synchronized Wireless Multimodal Multisensor Data Acquisition in Highly Dynamic Social Settings

Conference paper (2020) - C.A. Raman, S. Tan, H.S. Hung

Existing data acquisition literature for human behavior research provides wired solutions, mainly for controlled laboratory setups. In uncontrolled free-standing conversation settings, where participants are free to walk around, these solutions are unsuitable. While wireless solutions are employed in the broadcasting industry, they can be prohibitively expensive. In this work, we propose a modular and cost-effective wireless approach for synchronized multisensor data acquisition of social human behavior. Our core idea involves a cost-accuracy trade-off by using Network Time Protocol (NTP) as a source reference for all sensors. While commonly used as a reference in ubiquitous computing, NTP is widely considered to be insufficiently accurate as a reference for video applications, where Precision Time Protocol (PTP) or Global Positioning System (GPS) based references are preferred. We argue and show, however, that the latency introduced by using NTP as a source reference is adequate for human behavior research, and the subsequent cost and modularity benefits are a desirable trade-off for applications in this domain. We also describe one instantiation of the approach deployed in a real-world experiment to demonstrate the practicality of our setup in-the-wild. ...

Multimodal data collection for social interaction analysis in-the-wild

Conference paper (2019) - Hayley Hung, Chirag Raman, Ekin Gedik, Stephanie Tan, Jose Vargas Quiros

The benefits of exploiting multi-modality in the analysis of human-human social behaviour has been demonstrated widely in the community. An important aspect of this problem is the collection of data-sets that provide a rich and realistic representation of how people actually socialize with each other in real life. These subtle coordination patterns are influenced by individual beliefs, goals, and, desires related to what an individual stands to lose or gain in the activities they perform in their every day life. These conditions cannot be easily replicated in a lab setting and require a radical re-thinking of both how and what to collect. This tutorial provides a guide on how to create such multi-modal multi-sensor data sets when holistically considering the entire experimental design and data collection process. ...

Towards automatic estimation of conversation floors within F-formations

Conference paper (2019) - Chirag Raman, Hayley Hung

The detection of free-standing conversing groups has received significant attention in recent years. In the absence of a formal definition, most studies operationalize the notion of a conversation group either through a spatial or a temporal lens. Spatially, the most commonly used representation is the F-formation, defined by social scientists as the configuration in which people arrange themselves to sustain an interaction. However, the use of this representation is often accompanied with the simplifying assumption that a single conversation occurs within an F-formation. Temporally, various categories have been used to organize conversational units; these include, among others, turn, topic, and floor. Some of these concepts are hard to define objectively by themselves. The present work constitutes an initial exploration into unifying these perspectives by primarily posing the question: can we use the observation of simultaneous speaker turns to infer whether multiple conversation floors exist within an F-formation? We motivate a metric for the existence of distinct conversation floors based on simultaneous speaker turns, and provide an analysis using this metric to characterize conversations across F-formations of varying cardinality. We contribute two key findings: firstly, at the average speaking turn duration of about two seconds for humans, there is evidence for the existence of multiple floors within an F-formation; and secondly, an increase in the cardinality of an F-formation correlates with a decrease in duration of simultaneous speaking turns. ...