Salient moment detection for depression prediction
E. Papadopoulou (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Catherine Oertel – Mentor (TU Delft - Interactive Intelligence)
C.A. Raman – Mentor (TU Delft - Pattern Recognition and Bioinformatics)
A. Axelsson – Mentor (TU Delft - Interactive Intelligence)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Early detection of depression is crucial in mental healthcare. Augmenting depression diagnosing with AI seems to be promising in detecting depression from subtle non-verbal cues and early signs that can be missed from domain experts. For this to be achieved, AI procedures and decision processes need to be interpretable to humans. In this thesis, we use and evaluate a saliency-based explainability framework for a multi-modal depression-prediction model and validate its outputs through human judgment. The multimodal input is created by combining high level facial features extracted from Action-Unit via a 1D-CNN and high level vocal features extracted from log-mel spectrograms via a modified AlexNet. Then a simple Feed Forward Network is used as a classification predictor for 3.5 second segments.
To assess whether these AI-flagged moments align with human reasoning, 17 lay participants viewed thirty 8.5-second clips (half depressed, half non-depressed). For each clip they (1) rated depression confidence on a 1–10 scale, (2) selected the single frame they found most influential, and (3) described the facial or vocal cues that informed their choice. The goal is for the participants to give us an insight on what the model may be 'seeing'. So we ask them to tell us what facial and voice features they observed in their influential moments. From those experiments, we gained some useful insights to the model. The results show that participants observations in non-verbal cues are valuable and align with literature findings. And we find that there is alignment in participant's observations on their own influential moment and on the model's salient moment when the salient moment is correctly classified by the model and they do not align when the salient moment is wrongly classified by the model. These findings suggest that humans and the model value similar cues to make the correctly classify depression, and help to enhance the interpretability of AI models.