Search results | TU Delft Repositories

document

Improving End-to-End Models for Children’s Speech Recognition

Patel, T.B. (author), Scharenborg, O.E. (author)

Children’s Speech Recognition (CSR) is a challenging task due to the high variability in children’s speech patterns and limited amount of available annotated children’s speech data. We aim to improve CSR in the often-occurring scenario that no children’s speech data is available for training the Automatic Speech Recognition (ASR) systems....

journal article 2024

document

Towards inclusive automatic speech recognition

Feng, S. (author), Halpern, B.M. (author), Kudina, O. (author), Scharenborg, O.E. (author)

Practice and recent evidence show that state-of-the-art (SotA) automatic speech recognition (ASR) systems do not perform equally well for all speaker groups. Many factors can cause this bias against different speaker groups. This paper, for the first time, systematically quantifies and finds speech recognition bias against gender, age, regional...

journal article 2023

document

Improving Adaptive Learning Models Using Prosodic Speech Features

Wilschut, Thomas (author), Sense, Florian (author), Scharenborg, O.E. (author), van Rijn, Hedderik (author)

Cognitive models of memory retrieval aim to describe human learning and forgetting over time. Such models have been successfully applied in digital systems that aid in memorizing information by adapting to the needs of individual learners. The memory models used in these systems typically measure the accuracy and latency of typed retrieval...

conference paper 2023

document

Improving Whispered Speech Recognition Performance Using Pseudo-Whispered Based Data Augmentation

Lin, Zhaofeng (author), Patel, T.B. (author), Scharenborg, O.E. (author)

Whispering is a distinct form of speech known for its soft, breathy, and hushed characteristics, often used for private communication. The acoustic characteristics of whispered speech differ substantially from normally phonated speech and the scarcity of adequate training data leads to low automatic speech recognition (ASR) performance. To...

conference paper 2023

document

The Multimodal Information Based Speech Processing (Misp) 2022 Challenge: Audio-Visual Diarization And Recognition

Wang, Zhe (author), Wu, Shilong (author), Chen, Hang (author), He, Mao-Kui (author), Du, Jun (author), Lee, Chin-Hui (author), Chen, Jingdong (author), Watanabe, Shinji (author), Siniscalchi, Sabato Marco (author), Scharenborg, O.E. (author), Liu, Diyuan (author)

The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 challenge has two tracks: 1) audio-visual speaker diarization (AVSD),...

conference paper 2023

document

Exploring Data Augmentation in Bias Mitigation Against Non-Native-Accented Speech

Zhang, Y. (author), Herygers, Aaricia (author), Patel, T.B. (author), Yue, Z. (author), Scharenborg, O.E. (author)

Automatic speech recognition (ASR) should serve every speaker, not only the majority “standard” speakers of a language. In order to build inclusive ASR, mitigating the bias against speaker groups who speak in a “non-standard” or “diverse” way is crucial. We aim to mitigate the bias against non-native-accented Flemish in a Flemish ASR system....

conference paper 2023

document

Mitigating bias against non-native accents

Zhang, Y. (author), Zhang, Yixuan (author), Halpern, B.M. (author), Patel, T.B. (author), Scharenborg, O.E. (author)

Automatic speech recognition (ASR) systems have seen substantial improvements in the past decade; however, not for all speaker groups. Recent research shows that bias exists against different types of speech, including non-native accents, in state-of-the-art (SOTA) ASR systems. To attain inclusive speech recognition, i.e., ASR for everyone...

journal article 2022

document

Low-resource automatic speech recognition and error analyses of oral cancer speech

Halpern, B.M. (author), Feng, S. (author), van Son, Rob (author), van den Brekel, Michiel (author), Scharenborg, O.E. (author)

In this paper, we introduce a new corpus of oral cancer speech and present our study on the automatic recognition and analysis of oral cancer speech. A two-hour English oral cancer speech dataset is collected from YouTube. Formulated as a low-resource oral cancer ASR task, we investigate three acoustic modelling approaches that previously...

journal article 2022

document

Modelling Human Word Learning and Recognition Using Visually Grounded Speech

Merkx, D.G.M. (author), Scholten, Sebastiaan (author), Frank, Stefan L. (author), Ernestus, Mirjam (author), Scharenborg, O.E. (author)

Many computational models of speech recognition assume that the set of target words is already given. This implies that these models learn to recognise speech in a biologically unrealistic manner, i.e. with prior lexical knowledge and explicit supervision. In contrast, visually grounded speech models learn to recognise speech without prior...

journal article 2022

document

The differential roles of lexical and sublexical processing during spoken-word recognition in clear and in noise

Strauß, Antje (author), Wu, Tongyu (author), McQueen, James M. (author), Scharenborg, O.E. (author), Hintz, Florian (author)

Successful spoken-word recognition relies on interplay between lexical and sublexical processing. Previous research demonstrated that listeners readily shift between more lexically-biased and more sublexically-biased modes of processing in response to the situational context in which language comprehension takes place. Recognizing words in...

journal article 2022

document

The First Multimodal Information Based Speech Processing (Misp) Challenge: Data, Tasks, Baselines And Results

Chen, Hang (author), Zhou, Hengshun (author), Du, Jun (author), Lee, Chin-Hui (author), Chen, Jingdong (author), Watanabe, Shinji (author), Siniscalchi, Sabato Marco (author), Scharenborg, O.E. (author), Liu, Di-Yuan (author)

In this paper we discuss the rational of the Multi-model Information based Speech Processing (MISP) Challenge, and provide a detailed description of the data recorded, the two evaluation tasks and the corresponding baselines, followed by a summary of submitted systems and evaluation results. The MISP Challenge aims at tack-ling speech processing...

conference paper 2022

document

The Presence of Background Noise Extends the Competitor Space in Native and Non-Native Spoken-Word Recognition: Insights from Computational Modeling

Karaminis, Themis (author), Hintz, Florian (author), Scharenborg, O.E. (author)

Oral communication often takes place in noisy environments, which challenge spoken-word recognition. Previous research has suggested that the presence of background noise extends the number of candidate words competing with the target word for recognition and that this extension affects the time course and accuracy of spoken-word recognition....

journal article 2022

document

The Effectiveness of Time Stretching for Enhancing Dysarthric Speech for Improved Dysarthric Speech Recognition

Prananta, Luke (author), Halpern, B.M. (author), Feng, S. (author), Scharenborg, O.E. (author)

In this paper, we investigate several existing and a new state-of-the-art generative adversarial network-based (GAN) voice conversion method for enhancing dysarthric speech for improved dysarthric speech recognition. We compare key components of existing methods as part of a rigorous ablation study to find the most effective solution to...

journal article 2022

document

Audio-Visual Speech Recognition in MISP2021 Challenge: Dataset Release and Deep Analysis

Chen, Hang (author), Du, Jun (author), Dai, Yusheng (author), Lee, Chin Hui (author), Siniscalchi, Sabato Marco (author), Watanabe, Shinji (author), Scharenborg, O.E. (author), Chen, Jingdong (author), Yin, Bao Cai (author), Pan, Jia (author)

In this paper, we present the updated Audio-Visual Speech Recognition (AVSR) corpus of MISP2021 challenge, a large-scale audio-visual Chinese conversational corpus consisting of 141h audio and video data collected by far/middle/near microphones and far/middle cameras in 34 real-home TV rooms. To our best knowledge, our corpus is the first...

journal article 2022

document

Discovering phonetic inventories with crosslingual automatic speech recognition

Żelasko, Piotr (author), Feng, S. (author), Moro Velázquez, Laureano (author), Abavisani, Ali (author), Bhati, Saurabhchand (author), Scharenborg, O.E. (author), Hasegawa-Johnson, Mark (author), Dehak, Najim (author)

The high cost of data acquisition makes Automatic Speech Recognition (ASR) model training problematic for most existing languages, including languages that do not even have a written script, or for which the phone inventories remain unknown. Past works explored multilingual training, transfer learning, as well as zero-shot learning in order...

journal article 2022

document

How phonotactics affect multilingual and zero-shot asr performance

Feng, S. (author), Żelasko, Piotr (author), Moro-Velázquez, Laureano (author), Abavisani, Ali (author), Hasegawa-Johnson, Mark (author), Scharenborg, O.E. (author), Dehak, Najim (author)

The idea of combining multiple languages’ recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training....

conference paper 2021

document

Speech technology for unwritten languages

Scharenborg, O.E. (author), Besacier, Laurent (author), Black, Alan W. (author), Hasegawa-Johnson, Mark (author), Metze, Florian (author), Neubig, Graham (author), Stueker, Sebastian (author), Godard, Pierre (author), Mueller, M (author)

Speech technology plays an important role in our everyday life. Among others, speech is used for human-computer interaction, for instance for information retrieval and on-line shopping. In the case of an unwritten language, however, speech technology is unfortunately difficult to create, because it cannot be created by the standard...

journal article 2020

document

That Sounds Familiar: an Analysis of Phonetic Representations Transfer Across Languages

Żelasko, Piotr (author), Moro-Velázquez, Laureano (author), Hasegawa-Johnson, Mark (author), Scharenborg, O.E. (author), Dehak, Najim (author)

Only a handful of the world’s languages are abundant with the resources that enable practical applications of speech processing technologies. One of the methods to overcome this problem is to use the resources existing in other languages to train a multilingual automatic speech recognition (ASR) model, which, intuitively, should learn some...

conference paper 2020

document

Why listening in background noise is harder in a non-native language than in a native language: A review

Scharenborg, O.E. (author), van Os, Marjolein (author)

There is ample evidence that recognising words in a non-native language is more difficult than in a native language, even for those with a high proficiency in the non-native language involved, and particularly in the presence of background noise. Why is this the case? To answer this question, this paper provides a systematic review of the...

review 2019

document

Study of the performance of automatic speech recognition systems in speakers with Parkinson’s Disease

Moro-Velazquez, Laureano (author), Cho, JaeJin (author), Watanabe, Shinji (author), Hasegawa-Johnson, Mark A. (author), Scharenborg, O.E. (author), Kim, Heejin (author), Dehak, Najim (author)

Parkinson’s Disease (PD) affects motor capabilities of patients, who in some cases need to use human-computer assistive technologies to regain independence. The objective of this work is to study in detail the differences in error patterns from state-of-the-art Automatic Speech Recognition (ASR) systems on speech from people with and without PD....

conference paper 2019

Pages

Pages