S. Tan | TU Delft Repository

Towards Hybrid Intelligence in Learning Organizations

Conference paper (2025) - Stephanie Tan, Wendy M. Aartsen, Dicky van Hamersveld, Catholijn M. Jonker

The roles of humans and AI as the labor force of organizations need continuous re-evaluation with the advancement of AI. While automation has replaced some tasks, knowledge-intensive work environments rely on human intelligence, as those work practices transcend canonical procedures. We propose a hybrid intelligence methodology for organizations to address knowledge erosion. We contextualize this methodology in an example case study from the Legal Desk in the Netherlands, following the six principles of designing intelligent organizations [1], i.e., addition, relevance, substitution, diversity, collaboration, and explanation. We found that adhering to these six basic principles appeared to be a balancing act on two axes: contribution of AI versus human intelligence towards the tasks, and the way of working of human and artificial agents over time. We propose two additional principles. The first is human oversight, which highlights the importance of human control in organizational decision-making. The second principle is collaborative reflection which emphasizes the need to actively manage organizational intelligence. We also discuss the challenges to enable our methodology in the organizational context. This paper aims to inspire researchers and practitioners to pursue new initiatives towards achieving hybrid intelligence for learning organizations. ...

Sensing and Modeling Human Behaviors In Complex Conversational Scenes

Doctoral thesis (2023) - S. Tan

Understanding human behavior has been an intriguing topic studied by many disciplines, including social science, neuroscience, etc. Humans exhibit social behaviors, through for example, interacting, conversing, empathizing with each other. Systematically and scientifically studying these behaviors often requires granular observations and measurements. With increasing digital sensor and computer sensing and processing capability, accurately measuring and recording large amount of real-life human social behavior has become possible. Computational methods, such as machine learning, can be developed to analyze these data in unprecedented ways by detecting and learning patterns in the signals. However, even with the available data and advanced machine learning methods, understanding human social behavior is still challenging, as it is contextual and could result in variations. This thesis focuses on analyzing human behaviors in complex conversational scenes. It proposes novel computational methods that incorporate the context, which is the conversation group and the interaction scene. Prominent behavioral cues in social interaction include head and body orientations, as they are proxy indicators for visual attention and conversation group membership. This thesis first covers methods for head and body orientation estimation (under data-scarce and data-rich settings), and conversation group detection. These methods have an emphasis on learning from multimodal data and context modeling, and their efficacy is shown empirically. Then, the thesis addresses an open challenge in acquiring human social data in real-life by proposing an accurate and scalable method for data synchronization. Lastly, this thesis introduces a new dataset collected by the aforementioned synchronization method, capturing real-life interaction in a conference settings. Therein, results of tasks such as keypoint detection, action recognition, and conversation group detection are reported, which also motivate future research in this area. Combining these contributions in both computational method development and data collection, this thesis takes a step forward in understanding human behaviors in conversation scenes. ...

Understanding human behavior has been an intriguing topic studied by many disciplines, including social science, neuroscience, etc. Humans exhibit social behaviors, through for example, interacting, conversing, empathizing with each other. Systematically and scientifically studying these behaviors often requires granular observations and measurements. With increasing digital sensor and computer sensing and processing capability, accurately measuring and recording large amount of real-life human social behavior has become possible. Computational methods, such as machine learning, can be developed to analyze these data in unprecedented ways by detecting and learning patterns in the signals. However, even with the available data and advanced machine learning methods, understanding human social behavior is still challenging, as it is contextual and could result in variations. This thesis focuses on analyzing human behaviors in complex conversational scenes. It proposes novel computational methods that incorporate the context, which is the conversation group and the interaction scene. Prominent behavioral cues in social interaction include head and body orientations, as they are proxy indicators for visual attention and conversation group membership. This thesis first covers methods for head and body orientation estimation (under data-scarce and data-rich settings), and conversation group detection. These methods have an emphasis on learning from multimodal data and context modeling, and their efficacy is shown empirically. Then, the thesis addresses an open challenge in acquiring human social data in real-life by proposing an accurate and scalable method for data synchronization. Lastly, this thesis introduces a new dataset collected by the aforementioned synchronization method, capturing real-life interaction in a conference settings. Therein, results of tasks such as keypoint detection, action recognition, and conversation group detection are reported, which also motivate future research in this area. Combining these contributions in both computational method development and data collection, this thesis takes a step forward in understanding human behaviors in conversation scenes.

Conversation Group Detection With Spatio-Temporal Context

Conference paper (2022) - Stephanie Tan, David M.J. Tax, Hayley Hung

In this work, we propose an approach for detecting conversation groups in social scenarios like cocktail parties and networking events, from overhead camera recordings. We posit the detection of conversation groups as a learning problem that could benefit from leveraging the spatial context of the surroundings, and the inherent temporal context in interpersonal dynamics which is reflected in the temporal dynamics in human behavior signals, an aspect that has not been addressed in recent prior works. This motivates our approach which consists of a dynamic LSTM-based deep learning model that predicts continuous pairwise affinity values indicating how likely two people are in the same conversation group. These affinity values are also continuous in time, since relationships and group membership do not occur instantaneously, even though the ground truths of group membership are binary. Using the predicted affinity values, we apply a graph clustering method based on Dominant Set extraction to identify the conversation groups. We benchmark the proposed method against established methods on multiple social interaction datasets. Our results showed that the proposed method improves group detection performance in data that has more temporal granularity in conversation group labels. Additionally, we provide an analysis in the predicted affinity values in relation to the conversation group detection. Finally, we demonstrate the usability of the predicted affinity values in a forecasting framework to predict group membership for a given forecast horizon. ...

Multimodal Joint Head Orientation Estimation in Interacting Groups via Proxemics and Interaction Dynamics

Journal article (2021) - Stephanie Tan, David M.J. Tax, Hayley Hung

Human head orientation estimation has been of interest because head orientation serves as a cue to directed social attention. Most existing approaches rely on visual and high-fidelity sensor inputs and deep learning strategies that do not consider the social context of unstructured and crowded mingling scenarios. We show that alternative inputs, like speaking status, body location, orientation, and acceleration contribute towards head orientation estimation. These are especially useful in crowded and in-the-wild settings where visual features are either uninformative due to occlusions or prohibitive to acquire due to physical space limitations and concerns of ecological validity. We argue that head orientation estimation in such social settings needs to account for the physically evolving interaction space formed by all the individuals in the group. To this end, we propose an LSTM-based head orientation estimation method that combines the hidden representations of the group members. Our framework jointly predicts head orientations of all group members and is applicable to groups of different sizes. We explain the contribution of different modalities to model performance in head orientation estimation. The proposed model outperforms baseline methods that do not explicitly consider the group context, and generalizes to an unseen dataset from a different social event. ...

A Modular Approach for Synchronized Wireless Multimodal Multisensor Data Acquisition in Highly Dynamic Social Settings

Conference paper (2020) - C.A. Raman, S. Tan, H.S. Hung

Existing data acquisition literature for human behavior research provides wired solutions, mainly for controlled laboratory setups. In uncontrolled free-standing conversation settings, where participants are free to walk around, these solutions are unsuitable. While wireless solutions are employed in the broadcasting industry, they can be prohibitively expensive. In this work, we propose a modular and cost-effective wireless approach for synchronized multisensor data acquisition of social human behavior. Our core idea involves a cost-accuracy trade-off by using Network Time Protocol (NTP) as a source reference for all sensors. While commonly used as a reference in ubiquitous computing, NTP is widely considered to be insufficiently accurate as a reference for video applications, where Precision Time Protocol (PTP) or Global Positioning System (GPS) based references are preferred. We argue and show, however, that the latency introduced by using NTP as a source reference is adequate for human behavior research, and the subsequent cost and modularity benefits are a desirable trade-off for applications in this domain. We also describe one instantiation of the approach deployed in a real-world experiment to demonstrate the practicality of our setup in-the-wild. ...

Inverse-designed spinodoid metamaterials

Journal article (2020) - Siddhant Kumar, Stephanie Tan, Li Zheng, Dennis M. Kochmann

After a decade of periodic truss-, plate-, and shell-based architectures having dominated the design of metamaterials, we introduce the non-periodic class of spinodoid topologies. Inspired by natural self-assembly processes, spinodoid metamaterials are a close approximation of microstructures observed during spinodal phase separation. Their theoretical parametrization is so intriguingly simple that one can bypass costly phase-field simulations and obtain a rich and seamlessly tunable property space. Counter-intuitively, breaking with the periodicity of classical metamaterials is the enabling factor to the large property space and the ability to introduce seamless functional grading. We introduce an efficient and robust machine learning technique for the inverse design of (meta-)materials which, when applied to spinodoid topologies, enables us to generate uniform and functionally graded cellular mechanical metamaterials with tailored direction-dependent (anisotropic) stiffness and density. We specifically present biomimetic artificial bone architectures that not only reproduce the properties of trabecular bone accurately but also even geometrically resemble natural bone. ...

Multimodal data collection for social interaction analysis in-the-wild

Conference paper (2019) - Hayley Hung, Chirag Raman, Ekin Gedik, Stephanie Tan, Jose Vargas Quiros

The benefits of exploiting multi-modality in the analysis of human-human social behaviour has been demonstrated widely in the community. An important aspect of this problem is the collection of data-sets that provide a rich and realistic representation of how people actually socialize with each other in real life. These subtle coordination patterns are influenced by individual beliefs, goals, and, desires related to what an individual stands to lose or gain in the activities they perform in their every day life. These conditions cannot be easily replicated in a lab setting and require a radical re-thinking of both how and what to collect. This tutorial provides a guide on how to create such multi-modal multi-sensor data sets when holistically considering the entire experimental design and data collection process. ...

Improving temporal interpolation of head and body pose using Gaussian process regression in a matrix completion setting

Conference paper (2018) - Stephanie Tan, David M.J. Tax, Hayley Hung

This paper presents a model for head and body pose estimation (HBPE) when labelled samples are highly sparse. The current state-of-the-art multimodal approach to HBPE utilizes the matrix completion method in a transductive setting to predict pose labels for unobserved samples. Based on this approach, the proposed method tackles HBPE when manually annotated ground truth labels are temporally sparse. We posit that the current state of the art approach oversimplifies the temporal sparsity assumption by using Laplacian smoothing. Our final solution uses: i) Gaussian process regression in place of Laplacian smoothing, ii) head and body coupling, and iii) nuclear norm minimization in the matrix completion setting. The model is applied to the challenging SALSA dataset for benchmark against the state-of-the-art method. Our presented formulation outperforms the state-of-the-art significantly in this particular setting, e.g. at 5% ground truth labels as training data, head pose accuracy and body pose accuracy is approximately 62% and 70%, respectively. As well as fitting a more flexible model to missing labels in time, we posit that our approach also loosens the head and body coupling constraint, allowing for a more expressive model of the head and body pose typically seen during conversational interaction in groups. This provides a new baseline to improve upon for future integration of multimodal sensor data for the purpose of HBPE. ...