TU

T. Uno

info

Please Note

2 records found

Understanding how users retrospectively evaluate their interactions with adaptive intelligent systems is crucial to improving their behaviours during interactions. Prior work has shown the potential to predict retrospective evaluations based on different real-time aspects of conversations, such as verbal cues and non-verbal behaviours. However, the relationship between how one retrospectively evaluates and the real-time evaluations in the moment of conversations remains unclear. This study investigates the relationship between real-time evaluations of a situation, using the Situational Interdependence Scale (SIS) framework, and its retrospective evaluations. We investigate the presence of the peak-end rule and a complex relationship that could be modelled using Long Short-Term Memory (LSTM) for each SIS dimension using the PACO dataset. Due to the absence of ground truth for real-time SIS evaluations, we also present a methodologically sound technical approach to utilize a Large Language Model (LLM) to estimate values for each SIS dimension for each spoken utterance in conversations. Analysis of the experiments revealed the absence of both the peak-end rule and an LSTM-modelled relationship across all dimensions of SIS. However, both types of models at least predict better than the average of the estimated real-time evaluation. This may be largely due to the inaccuracy of the estimated real-time SIS evaluations and the limited LLM’s capability of labelling real-time SIS in conversational data. Future works may focus on improving the annotation of real-time SIS evaluations through human annotation or human-supervised few-shot learning of LLM, using other modalities in combinations with verbal content, and exploring other predictive models. ...
Bachelor thesis (2022) - T. Uno, H.S. Hung, J.D. Vargas Quiros, J.A. Baaijens
The interactions between human and machines are now common in our daily life. The audio data of human communication is a rich source of information, but it is con- sidered privacy-invasive for machines to listen to it. By reducing sampling frequency, it is possible to preserve privacy by making conversation unclear while still being possible to detect if someone is speaking or not. The topic of this paper is to investigate how low sampled frequency audio data hinders the detection of speech. To detect speaking, voice activity detection has been applied, which is a technology in the signal process- ing field that identifies which short segments of audio contain speakings. Two types of state-of-art voice activity detector(VAD) were used for this experiment including a supervised (pyannote) and two unsupervised (rVAD pitch and flatness mode) methods. As a result, the unsupervised methods outperformed the supervised model, where rVAD pitch mode has resulted in the best performance out of all three. More specifically, the unsupervised VAD’s performance became lower as the sample rates decreased while the supervised VAD did not work well at higher sample frequency. rVAD pitch mode at sample rates of 8000Hz or higher was possible to perform at the almost same level as a state-of-art supervised VAD that is trained in a similar data set. Furthermore, it was able to perform as well as a modern unsupervised VAD at 2000Hz or higher sample frequencies. At the sample rate of 1250Hz or lower, any VAD was not able to perform at the same level as a state-of-art VAD. Regarding the privacy aspect, it is observed that human ears detect speaking better than computers, where humans can understand parts or all of the contents of speaking at 2000Hz or higher, which infers that current technology is not enough to detect speech from downsampled privacy-preserving audio. However, there is still a need for further research to verify the effects of the training set and its sample frequencies for the supervised method and also proper scientific so- cial experiments to test the ability of humans of speech detection for reduced sampled audio. ...