Inferring Segments of Speaking Intention Using a Body-worn Accelerometer

Enhancing social interaction with AI-powered systems

Bachelor Thesis (2023)
Author(s)

N.C. Achy (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

H.S. Hung – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

J. Molhoek – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

L. Li – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

S. Tan – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

A.W.F.A.M. Elnouty – Graduation committee member (TU Delft - Computer Science & Engineering-Teaching Team)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2023 Nils Achy
More Info
expand_more
Publication Year
2023
Language
English
Copyright
© 2023 Nils Achy
Graduation Date
28-06-2023
Awarding Institution
Delft University of Technology
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This research paper proposes a deep learning model to infer segments of speaking intentions using body language captured by a body-worn accelerometer. The objective of the study is to detect instances where individuals exhibit a desire to speak based on their body language cues. The labeling scheme employed is a binary string, with “0” indicating no intention to speak and “1” indicating the presence of an intention to speak on a defined window size of 40 (corresponding to a 2 seconds segment recorded at a frequency of 100 Hz scaled down to 20 binary points per second). In this experiment, a real-life social event dataset was employed, and intensions to speak were manually annotated. A 10-minute segment from the dataset was selected and annotated using the ELAN software. The annotations included two categories: realized intentions, where individuals intended to speak and actually did so, and unrealized intentions, where individuals displayed intentions to speak but did not take their turn. The dataset consisted of 255 segments with realized intentions and 31 segments with unrealized intentions. Additionally, 255 negative samples were included, representing instances where no intention to speak was observed throughout the entire segment. To address the class imbalance inherent in the dataset, the model was evaluated using the Area Under the ROC Curve (AUC) metric using 5- fold cross validation. The model was tested on realized intentions, unrealized intentions, and a combination of both. Its performance was compared against a baseline model, which is assumed to always predict the start of the intention in the middle of the segment. Also, a new model was built that enables classification using varying window sizes as input. The classification task is performed to provide a comparison to another study [1] in which the training segments were predetermined window sizes instead of precisely annotated segments. The results of the study indicate that the deep learning models perform consistently better than the baseline on the segmentation task, and surpasses the performance of the model trained exclusively on window sizes on the classification task. This not only demonstrates the potential of body language as an informative cue for inferring speaking intentions, but also suggests that a supervised learning where intentions are identified with greater precision can lead to a superior outcome.

Files

License info not available