S. Feng | TU Delft Repository

Automatic evaluation of spontaneous oral cancer speech using ratings from naive listeners

Journal article (2023) - Bence Mark Halpern (author) , Siyuan Feng (author) , Rob J.J.H. van Son (author) , Michiel van den Brekel (author) , Odette Scharenborg (author)

In this paper, we build and compare multiple speech systems for the automatic evaluation of the severity of a speech impairment due to oral cancer, based on spontaneous speech. To be able to build and evaluate such systems, we collected a new spontaneous oral cancer speech corpus ...

Towards inclusive automatic speech recognition

Journal article (2023) - S. Feng (author) , Bence Mark Halpern (author) , O. Kudina (author) , Odette Scharenborg (author)

Practice and recent evidence show that state-of-the-art (SotA) automatic speech recognition (ASR) systems do not perform equally well for all speaker groups. Many factors can cause this bias against different speaker groups. This paper, for the first time, systematically quantifi ...

The Effectiveness of Time Stretching for Enhancing Dysarthric Speech for Improved Dysarthric Speech Recognition

Journal article (2022) - Luke Prananta (author) , B.M. Halpern (author) , Siyuan Feng (author) , O.E. Scharenborg (author)

In this paper, we investigate several existing and a new state-of-the-art generative adversarial network-based (GAN) voice conversion method for enhancing dysarthric speech for improved dysarthric speech recognition. We compare key components of existing methods as part of a rigo ...

Discovering phonetic inventories with crosslingual automatic speech recognition

Journal article (2022) - Piotr Żelasko (author) , Siyuan Feng (author) , Laureano Moro Velázquez (author) , Ali Abavisani (author) , Saurabhchand Bhati (author) , O.E. Scharenborg (author) , Mark Hasegawa-Johnson (author) , Najim Dehak (author)

The high cost of data acquisition makes Automatic Speech Recognition (ASR) model training problematic for most existing languages, including languages that do not even have a written script, or for which the phone inventories remain unknown. Past works explored multilingual train ...

Self-supervised Semantic-driven Phoneme Discovery for Zero-resource Speech Recognition

Conference paper (2022) - Liming Wang (author) , Siyuan Feng (author) , Mark Hasegawa-Johnson (author) , Chang D. Yoo (author)

Phonemes are defined by their relationship to words: changing a phoneme changes the word. Learning a phoneme inventory with little supervision has been a longstanding challenge with important applications to under-resourced speech technology. In this paper, we bridge the gap betw ...

Low-resource automatic speech recognition and error analyses of oral cancer speech

Journal article (2022) - Bence Mark Halpern (author) , Siyuan Feng (author) , Rob J.J.H. van Son (author) , Michiel van den Brekel (author) , Odette Scharenborg (author)

In this paper, we introduce a new corpus of oral cancer speech and present our study on the automatic recognition and analysis of oral cancer speech. A two-hour English oral cancer speech dataset is collected from YouTube. Formulated as a low-resource oral cancer ASR task, we inv ...

Show and speak

Directly synthesize spoken description of images

Conference paper (2021) - Xinsheng Wang (author) , Siyuan Feng (author) , Jihua Zhu (author) , Mark Hasegawa-Johnson (author) , O.E. Scharenborg (author)

This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The basic structure of SAS is an encoder-decoder architecture that t ...

How phonotactics affect multilingual and zero-shot asr performance

Conference paper (2021) - Siyuan Feng (author) , Piotr Żelasko (author) , Laureano Moro-Velazquez (author) , Ali Abavisani (author) , Mark Hasegawa-Johnson (author) , O.E. Scharenborg (author) , Najim Dehak (author)

The idea of combining multiple languages’ recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well ...

Unsupervised acoustic unit discovery by leveraging a language-independent subword discriminative feature representation

Conference paper (2021) - S. Feng (author) , Piotr Zelasko (author) , Laureano Moro-Velázquez (author) , O.E. Scharenborg (author)

This paper tackles automatically discovering phone-like acoustic units (AUD) from unlabeled speech data. Past studies usually proposed single-step approaches. We propose a twostage approach: the first stage learns a subword-discriminative feature representation, and the second st ...

The effectiveness of self-supervised representation learning in zero-resource subword modeling

Conference paper (2021) - S. Feng (author) , O.E. Scharenborg (author)

For a language with no transcribed speech available (the zero-resource scenario), conventional acoustic modeling algorithms are not applicable. Recently, zero-resource acoustic modeling has gained much interest. One research problem is unsupervised subword modeling (USM), i.e., l ...

The effectiveness of unsupervised subword modeling with autoregressive and cross-lingual phone-aware networks

Journal article (2021) - S. Feng (author) , O.E. Scharenborg (author)

This study addresses unsupervised subword modeling, i.e., learning acoustic feature representations that can distinguish between subword units of a language. We propose a two-stage learning framework that combines self-supervised learning and cross-lingual knowledge transfer. The ...

Unsupervised Subword Modeling Using Autoregressive Pretraining and Cross-Lingual Phone-Aware Modeling

Conference paper (2020) - S. Feng (author) , O.E. Scharenborg (author)

This study addresses unsupervised subword modeling, i.e.,
learning feature representations that can distinguish subword
units of a language. The proposed approach adopts a two-stage
bottleneck feature (BNF) learning framework, consisting of autoregressive
predicti ...