Inferring Segments of Speaking Intention Using a Body-worn Accelerometer

Enhancing social interaction with AI-powered systems

More Info


This research paper proposes a deep learning model to infer segments of speaking intentions using body language captured by a body-worn accelerometer. The objective of the study is to detect instances where individuals exhibit a desire to speak based on their body language cues. The labeling scheme employed is a binary string, with “0” indicating no intention to speak and “1” indicating the presence of an intention to speak on a defined window size of 40 (corresponding to a 2 seconds segment recorded at a frequency of 100 Hz scaled down to 20 binary points per second). In this experiment, a real-life social event dataset was employed, and intensions to speak were manually annotated. A 10-minute segment from the dataset was selected and annotated using the ELAN software. The annotations included two categories: realized intentions, where individuals intended to speak and actually did so, and unrealized intentions, where individuals displayed intentions to speak but did not take their turn. The dataset consisted of 255 segments with realized intentions and 31 segments with unrealized intentions. Additionally, 255 negative samples were included, representing instances where no intention to speak was observed throughout the entire segment. To address the class imbalance inherent in the dataset, the model was evaluated using the Area Under the ROC Curve (AUC) metric using 5- fold cross validation. The model was tested on realized intentions, unrealized intentions, and a combination of both. Its performance was compared against a baseline model, which is assumed to always predict the start of the intention in the middle of the segment. Also, a new model was built that enables classification using varying window sizes as input. The classification task is performed to provide a comparison to another study [1] in which the training segments were predetermined window sizes instead of precisely annotated segments. The results of the study indicate that the deep learning models perform consistently better than the baseline on the segmentation task, and surpasses the performance of the model trained exclusively on window sizes on the classification task. This not only demonstrates the potential of body language as an informative cue for inferring speaking intentions, but also suggests that a supervised learning where intentions are identified with greater precision can lead to a superior outcome.