O.E. Scharenborg | TU Delft Repository

Formant-based vowel categorization for cross-lingual phone recognition

Journal article (2025) - Marija Stepanović (author) , Christian Hardmeier (author) , O.E. Scharenborg (author)

Multilingual phone recognition models can learn language-independent pronunciation patterns from large volumes of spoken data and recognize them across languages. This potential can be harnessed to improve speech technologies for underresourced languages. However, these models ar ...

Exploring the impact of noise, language familiarity, and experimental settings on emotion recognition

Journal article (2025) - Terry Amorese (author) , Marialucia Cuciniello (author) , Anna Alterio (author) , Daniele Pepe (author) , O.E. Scharenborg (author) , Gennaro Cordasco (author) , Anna Esposito (author)

Introduction: This work aims to understand the contextual factors affecting speech emotion recognition (SER), more specifically the current research investigates whether the identification of vocal emotional expressions of anger, fear, sadness, joy, and neutrality is affected by ...

Introduction: This work aims to understand the contextual factors affecting speech emotion recognition (SER), more specifically the current research investigates whether the identification of vocal emotional expressions of anger, fear, sadness, joy, and neutrality is affected by three factors: (a) the experimental setting, exploring vocal emotion recognition in both a controlled, soundproof laboratory and a more natural listening environment; (b) the effect of stimuli’s background noise: sentences were presented with three different levels of noise to gradually increase the level of difficulty: one clear (no noise) condition and two noise conditions; (c) language familiarity, since the stimuli comprised Italian sentences, and participants were both native (Italians) and Dutch speakers, who did not know Italian. Method: Dutch and Italian participants were involved in a vocal emotion recognition task carried out in two different experimental settings (realistic vs. laboratory). The stimuli were vocal utterances from the Italian EMOVO dataset, conveying emotions like anger, fear, sadness, joy, and neutrality, and were presented in three different noise conditions. Results: Concerning the effect of the experimental setting, even in higher levels of background noise conditions, individuals possess the remarkable ability to discern emotional nuances conveyed through voice. Regarding familiarity with the language, differences in emotion recognition performance between the Italian and Dutch listeners were observed, but the error magnitude was contingent on the emotional categories. Higher noise levels reduced accuracy, but people could still discern emotions, especially prosody. Conclusion: The study highlighted that emotion recognition is influenced by variables such as listening context, background noise, and language familiarity. These results could be useful for developing robust Speech Emotion Recognition (SER) systems and improving human-computer interaction.

Finding Spoken Identifications

Using GPT-4 Annotation For An Efficient And Fast Dataset Creation Pipeline

Conference paper (2024) - Maliha Jahan (author) , Helin Wang (author) , Thomas Thebaud (author) , Yinglun Sun (author) , Giang Le (author) , Zsuzsanna Fagyal (author) , O.E. Scharenborg (author) , Mark Hasegawa-Johnson (author) , Laureano Moro Velázquez (author) , Najim Dehak (author)

The growing emphasis on fairness in speech-processing tasks requires datasets with speakers from diverse subgroups that allow training and evaluating fair speech technology systems. However, creating such datasets through manual annotation can be costly. To address this challenge ...

Using articulated speech EEG signals for imagined speech decoding

Conference paper (2024) - Chris Bras (author) , T.B. Patel (author) , O.E. Scharenborg (author)

Brain-Computer Interfaces (BCIs) open avenues for communication among individuals unable to use voice or gestures. Silent speech interfaces are one such approach for BCIs that could offer a transformative means of connecting with the external world. Performance on imagined speech ...

Improving End-to-End Models for Children’s Speech Recognition

Journal article (2024) - T.B. Patel (author) , O.E. Scharenborg (author)

Children’s Speech Recognition (CSR) is a challenging task due to the high variability in children’s speech patterns and limited amount of available annotated children’s speech data. We aim to improve CSR in the often-occurring scenario that no children’s speech data is available ...

Improving child speech recognition with augmented child-like speech

Conference paper (2024) - Y. Zhang (author) , Z. Yue (author) , T.B. Patel (author) , O.E. Scharenborg (author)

State-of-the-art ASRs show suboptimal performance for child speech. The scarcity of child speech limits the development of child speech recognition (CSR). Therefore, we studied child-to-child voice conversion (VC) from existing child speakers in the dataset and additional (new) c ...

AnyoneNet

Synchronized Speech and Talking Head Generation for Arbitrary Persons

Journal article (2023) - X. Wang (author) , Qicong Xie (author) , Lei Xie (author) , Jihua Zhu (author) , O.E. Scharenborg (author)

Automatically generating videos in which synthesized speech is synchronized with lip movements in a talking head has great potential in many human-computer interaction scenarios. In this paper, we present an automatic method to generate synchronized speech and talking-head videos ...

Improving Adaptive Learning Models Using Prosodic Speech Features

Conference paper (2023) - Thomas Wilschut (author) , Florian Sense (author) , O.E. Scharenborg (author) , Hedderik van Rijn (author)

Cognitive models of memory retrieval aim to describe human learning and forgetting over time. Such models have been successfully applied in digital systems that aid in memorizing information by adapting to the needs of individual learners. The memory models used in these systems ...

DAIS: The Delft Database of EEG Recordings of Dutch Articulated and Imagined Speech

Conference paper (2023) - Bo Dekker (author) , A.C. Schouten (author) , O.E. Scharenborg (author)

Silent speech interfaces could enable people who lost the ability to use their voice or gestures to communicate with the external world, e.g., through decoding the person’s brain signals when imagining speech. Only a few and small databases exist that allow for the development an ...

Towards inclusive automatic speech recognition

Journal article (2023) - S. Feng (author) , B.M. Halpern (author) , O. Kudina (author) , O.E. Scharenborg (author)

Practice and recent evidence show that state-of-the-art (SotA) automatic speech recognition (ASR) systems do not perform equally well for all speaker groups. Many factors can cause this bias against different speaker groups. This paper, for the first time, systematically quantifi ...

Automatic evaluation of spontaneous oral cancer speech using ratings from naive listeners

Journal article (2023) - B.M. Halpern (author) , S. Feng (author) , Rob van Son (author) , Michiel W.M. van den Brekel (author) , O.E. Scharenborg (author)

In this paper, we build and compare multiple speech systems for the automatic evaluation of the severity of a speech impairment due to oral cancer, based on spontaneous speech. To be able to build and evaluate such systems, we collected a new spontaneous oral cancer speech corpus ...

Improving Whispered Speech Recognition Performance Using Pseudo-Whispered Based Data Augmentation

Conference paper (2023) - Zhaofeng Lin (author) , T.B. Patel (author) , O.E. Scharenborg (author)

Whispering is a distinct form of speech known for its soft, breathy, and hushed characteristics, often used for private communication. The acoustic characteristics of whispered speech differ substantially from normally phonated speech and the scarcity of adequate training data le ...

The Multimodal Information Based Speech Processing (Misp) 2022 Challenge

Audio-Visual Diarization And Recognition

Conference paper (2023) - Zhe Wang (author) , Shilong Wu (author) , Diyuan Liu (author) , More authors (author) , Hang Chen (author) , Mao-Kui He (author) , Jun Du (author) , Chin Hui Lee (author) , Jingdong Chen (author) , Shinji Watanabe (author) , Sabato Marco Siniscalchi (author) , O.E. Scharenborg (author)

The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 ch ...

Neural representations of non-native speech reflect proficiency and interference from native language knowledge

Journal article (2023) - Christian Brodbeck (author) , Katerina Danae Kandylaki (author) , O.E. Scharenborg (author)

Learning to process speech in a foreign language involves learning new representations for mapping the auditory signal to linguistic structure. Behavioral experiments suggest that even listeners that are highly proficient in a non-native language experience interference from repr ...

Exploring Data Augmentation in Bias Mitigation Against Non-Native-Accented Speech

Conference paper (2023) - Y. Zhang (author) , Aaricia Herygers (author) , T.B. Patel (author) , Z. Yue (author) , O.E. Scharenborg (author)

Automatic speech recognition (ASR) should serve every speaker, not only the majority “standard” speakers of a language. In order to build inclusive ASR, mitigating the bias against speaker groups who speak in a “non-standard” or “diverse” way is crucial. We aim to mitigate the bi ...

BIAS in Flemish automatic speech recognition

Conference paper (2023) - Aaricia Herygers (author) , Vass Verkhodanova (author) , Matt Coler (author) , O.E. Scharenborg (author) , Munir Georges (author)

Research has shown that automatic speech recognition (ASR) systems exhibit biases against different speaker groups, e.g., based on age or gender. This paper presents an investigation into bias in recent Flemish ASR. Seeing as Belgian Dutch, which is also known as Flemish, is ofte ...

Modelling Human Word Learning and Recognition Using Visually Grounded Speech

Journal article (2022) - D.G.M. Merkx (author) , Sebastiaan Scholten (author) , Stefan L. Frank (author) , Mirjam Ernestus (author) , O.E. Scharenborg (author)

Many computational models of speech recognition assume that the set of target words is already given. This implies that these models learn to recognise speech in a biologically unrealistic manner, i.e. with prior lexical knowledge and explicit supervision. In contrast, visually g ...

The differential roles of lexical and sublexical processing during spoken-word recognition in clear and in noise

Journal article (2022) - Antje Strauß (author) , Tongyu Wu (author) , James M. McQueen (author) , O.E. Scharenborg (author) , Florian Hintz (author)

Successful spoken-word recognition relies on interplay between lexical and sublexical processing. Previous research demonstrated that listeners readily shift between more lexically-biased and more sublexically-biased modes of processing in response to the situational context in w ...

Successful spoken-word recognition relies on interplay between lexical and sublexical processing. Previous research demonstrated that listeners readily shift between more lexically-biased and more sublexically-biased modes of processing in response to the situational context in which language comprehension takes place. Recognizing words in the presence of background noise reduces the perceptual evidence for the speech signal and – compared to the clear – results in greater uncertainty. It has been proposed that, when dealing with greater uncertainty, listeners rely more strongly on sublexical processing. The present study tested this proposal using behavioral and electroencephalography (EEG) measures. We reasoned that such an adjustment would be reflected in changes in the effects of variables predicting recognition performance with loci at lexical and sublexical levels, respectively. We presented native speakers of Dutch with words featuring substantial variability in (1) word frequency (locus at lexical level), (2) phonological neighborhood density (loci at lexical and sublexical levels) and (3) phonotactic probability (locus at sublexical level). Each participant heard each word in noise (presented at one of three signal-to-noise ratios) and in the clear and performed a two-stage lexical decision and transcription task while EEG was recorded. Using linear mixed-effects analyses, we observed behavioral evidence that listeners relied more strongly on sublexical processing when speech quality decreased. Mixed-effects modelling of the EEG signal in the clear condition showed that sublexical effects were reflected in early modulations of ERP components (e.g., within the first 300 msec post word onset). In noise, EEG effects occurred later and involved multiple regions activated in parallel. Taken together, we found evidence – especially in the behavioral data – supporting previous accounts that the presence of background noise induces a stronger reliance on sublexical processing.

Using Mixed Incentives to Document Xi’an Guanzhong

Conference paper (2022) - Juhong Zhan (author) , Yue Jiang (author) , Christopher Cieri (author) , Mark Liberman (author) , Jiahong Yuan (author) , Yiya Chen (author) , O.E. Scharenborg (author)

This paper describes our use of mixed incentives and the citizen science portal LanguageARC to prepare, collect and quality control a large corpus of object namings for the purpose of providing speech data to document the under-represented Guanzhong dialect of Chinese spoken in t ...

Recognizing non-native spoken words in background noise increases interference from the native language

Journal article (2022) - Florian Hintz (author) , Cesko C. Voeten (author) , O.E. Scharenborg (author)

Listeners frequently recognize spoken words in the presence of background noise. Previous research has shown that noise reduces phoneme intelligibility and hampers spoken-word recognition – especially for non-native listeners. In the present study, we investigated how noise influ ...