E. Gedik
Please Note
13 records found
1
Space exploration is evolving with the recent increase in interest and investment. For the success of planned long-duration crewed missions, good interpersonal interactions between crew members are crucial. In this study, we evaluate the use of wearables for detection and estimation of the quality of each social interaction participants have throughout a long mission rather than aggregate measures of interactions. Our proposed method utilizes Temporal Convolutional Networks(TCNs) for extracting individual representations from acceleration and audio streams and learnable pooling layers(NetVLAD) to aggregate these representations into fixed-size representations. Use of NetVLAD layers provides an intelligent alternative to simple aggregation for handling variable-sized interactions and interactions with missing data. We evaluate our method on a 4-month simulated space mission where 5 participants wore Sociometric Badges and provided reports on their interactions in terms of effectiveness, frustration, and satisfaction. Our method provides an average ROC-AUC score of 0.64. Since we are not aware of any comparable baselines, we compare our method to hand-crafted features formerly utilized for cohesion estimation in similar scenarios and show it significantly outperforms them. We also present ablation studies where we replace the components in our approach with well-known alternatives and show that they provide better performance than their respective counterparts.
This paper focuses on the automatic classification of self-assessed personality traits from the HEXACO inventory during crowded mingle scenarios. These scenarios provide rich study cases for social behavior analysis but are also challenging to analyze automatically as people in them interact dynamically and freely in an in-the-wild face-to-face setting. To do so, we leverage the use of wearable sensors recording acceleration and proximity, and video from overhead cameras. We use 3 different behavioral modality types (movement, speech and proximity) coming from 2 sensors (wearable and camera). Unlike other works, we extract an individual's speaking status from a single body worn triaxial accelerometer instead of audio, which scales easily to large populations. Additionally, we study the effect of different combinations of modality types on the personality estimation, and how this relates to the nature of each trait. We also include an analysis of feature complementarity and an evaluation of feature importance for the classification, showing that combining complementary modality types further improves the classification performance. We estimate the self-assessed personality traits both using a binary classification (community's standard) and as a regression over the trait scores. Finally, we analyze the impact of the accuracy of the speech detection on the overall performance of the personality estimation.
Detecting F-formations Roles in Crowded Social Scenes with Wearables
Combining Proxemics Dynamics using LSTMs
In this paper, we investigate the use of proxemics and dynamics for automatically identifying conversing groups, or so-called F-formations. More formally we aim to automatically identify whether wearable sensor data coming from 2 people is indicative of F-formation membership. We also explore the problem of jointly detecting membership and more descriptive information about the pair relating to the role they take in the conversation (i.e. speaker or listener). We jointly model the concepts of proxemics and dynamics using binary proximity and acceleration obtained through a single wearable sensor per person. We test our approaches on the publicly available MatchNMingle dataset which was collected during real-life mingling events. We find out that fusion of these two modalities performs significantly better than them independently, providing an AUC of 0.975 when data from 30-second windows are used. Furthermore, our investigation into roles detection shows that each role pair requires a different time resolution for accurate detection.
We present an approach to interpret the response of audiences to live performances by processing mobile sensor data. We apply our method on three different datasets obtained from three live performances, where each audience member wore a single tri-axial accelerometer and proximity sensor embedded inside a smart sensor pack. Using these sensor data, we developed a novel approach to predict audience members' self-reported experience of the performances in terms of enjoyment, immersion, willingness to recommend the event to others and change in mood. The proposed method uses an unsupervised method to identify informative intervals of the event, using the linkage of the audience members' bodily movements, and uses data from these intervals only to estimate the audience members' experience. We also analyze how the relative location of members of the audience can affect their experience and present an automatic way of recovering neighborhood information based on proximity sensors. We further show that the linkage of the audience members' bodily movements is informative of memorable moments which were later reported by the audience.
The MatchNMingle dataset
A novel multi-sensor resource for the analysis of social interactions and group dynamics in-the-wild during free-standing conversations and speed dates
We present MatchNMingle, a novel multimodal/multisensor dataset for the analysis of free-standing conversational groups and speed-dates in-the-wild. MatchNMingle leverages the use of wearable devices and overhead cameras to record social interactions of 92 people during real-life speed-dates, followed by a cocktail party. To our knowledge, MatchNMingle has the largest number of participants, longest recording time and largest set of manual annotations for social actions available in this context in a real-life scenario. It consists of 2 hours of data from wearable acceleration, binary proximity, video, audio, personality surveys, frontal pictures and speed-date responses. Participants' positions and group formations were manually annotated; as were social actions (eg. speaking, hand gesture) for 30 minutes at 20fps making it the first dataset to incorporate the annotation of such cues in this context. We present an empirical analysis of the performance of crowdsourcing workers against trained annotators in simple and complex annotation tasks, founding that although efficient for simple tasks, using crowdsourcing workers for more complex tasks like social action annotation led to additional overhead and poor inter-annotator agreement compared to trained annotators (differences up to 0.4 in Fleiss' Kappa coefficients). We also provide example experiments of how MatchNMingle can be used.
We investigate the task of detecting speakers in crowded environments using a single body worn triaxial accelerometer. Detection of such behaviour is very challenging to model as people’s body movements during speech vary greatly. Similar to previous studies, by assuming that body movements are indicative of speech, we show experimentally, on a real-world dataset of 3 h including 18 people, that transductive parameter transfer learning (Zen et al. in Proceedings of the 16th international conference on multimodal interaction. ACM, 2014) can better model individual differences in speaking behaviour, significantly improving on the state-of-the-art performance. We also discuss the challenges introduced by the in-the-wild nature of our dataset and experimentally show how they affect detection performance. We strengthen the need for an adaptive approach by comparing the speech detection problem to a more traditional activity (i.e. walking). We provide an analysis of the transfer by considering different source sets which provides a deeper investigation of the nature of both speech and body movements, in the context of transfer learning.
...
We investigate the task of detecting speakers in crowded environments using a single triaxial accelerometer worn around the neck. Similar to the previous studies, by assuming that body movements are indicative of speech, we show experimentally that transductive transfer learning can better model individual differences in speaking behaviour compared to a traditional person independent setup. Such behaviour is very challenging to model as people’s body movements during speech vary greatly. To our knowledge, this is the first time that a transfer learning approach has been considered in the context of speaking status detection using a single body worn accelerometer. We show that by transferring knowledge across subjects, competitive performance scores compared to a person dependent training can be obtained.