Forensic speaker recognition

Based on text analysis of transcribed speech fragments

More Info
expand_more

Abstract

Currently, speaker recognition research is mainly based on phonetics and speech signal processing. This research addresses speaker recognition from a new perspective, analysing the transcription of a fragment of speech with text analysis methods. Since text analysis is based on the transcription text only, it can be assumed independent from current automatic speaker recognition software. Hence, it would contribute significantly to the overall evidential value. The analysis is based on the frequencies of non-content, highly frequent words. We study whether information about the identity of the speaker is contained in the transcription of spoken text. The value of evidence is quantified using a score-based likelihood ratio. The score-based approach is chosen because in most forensic cases, there is not enough data from the suspect or of the disputed speech fragment available to model a robust feature-based likelihood ratio. Different methods to model the system from feature vector over score to likelihood ratio have been compared. As a baseline, a distance based method is used, where the score is the distance between the feature vectors. To improve upon this baseline, machine learning algorithms are implemented. The results from SVM and XGBoost are explored. As a third method a feature-based likelihood ratio is calculated and used as a score instead of as a direct likelihood ratio. With this method, both similarity and typicality are taken into account. The model is trained and tested on the FRIDA data set from the Netherlands Forensic Institute, consisting of Dutch conversations from a homogeneous group of 250 individuals. The performance of the likelihood ratio system is evaluated through computing the cost log-likelihood-ratio (Cllr), which is a measure for the accuracy and quality of the likelihood ratios, and the accuracy (A) of the likelihood ratios solely. The performance is also evaluated by inspecting the Tippett, empirical cross-entropy and pool-adjacent-violators plots. Different values for parameters used in the calculation of the likelihood ratios are investigated: the length of the sample, the number of frequent words (number of features) and the number of samples needed to train the model. The distance method showed a strong baseline, with good performance for large sample lengths. The SVM method outperformed the distance method for all parameter settings, with a peak performance of A=0.94 and Cllr=0.24. The XGBoost method showed promising results for smaller samples lengths, but a too large amount of data is needed to obtain good performance for larger sample lengths. The LR score method showed moderate results, but no improvements due to the necessity to estimate high-dimensional distributions. This thesis shows that information about the identity of the speaker is contained in transcriptions of speech. The complete process from data to likelihood ratio is constructed, where the likelihood ratio quantifies the evidential value of a transcribed speech fragment.