Mark Hasegawa-Johnson | TU Delft Repository

Finding Spoken Identifications

Using GPT-4 Annotation For An Efficient And Fast Dataset Creation Pipeline

Conference paper (2024) - Maliha Jahan (author) , Helin Wang (author) , Thomas Thebaud (author) , Yinglun Sun (author) , Giang Le (author) , Zsuzsanna Fagyal (author) , Odette Scharenborg (author) , Mark Hasegawa-Johnson (author) , Laureano Moro-Velázquez (author) , Najim Dehak (author)

The growing emphasis on fairness in speech-processing tasks requires datasets with speakers from diverse subgroups that allow training and evaluating fair speech technology systems. However, creating such datasets through manual annotation can be costly. To address this challenge ...

Discovering phonetic inventories with crosslingual automatic speech recognition

Journal article (2022) - Piotr Żelasko (author) , Siyuan Feng (author) , Laureano Moro Velázquez (author) , Ali Abavisani (author) , Saurabhchand Bhati (author) , O.E. Scharenborg (author) , Mark Hasegawa-Johnson (author) , Najim Dehak (author)

The high cost of data acquisition makes Automatic Speech Recognition (ASR) model training problematic for most existing languages, including languages that do not even have a written script, or for which the phone inventories remain unknown. Past works explored multilingual train ...

Self-supervised Semantic-driven Phoneme Discovery for Zero-resource Speech Recognition

Conference paper (2022) - Liming Wang (author) , Siyuan Feng (author) , Mark Hasegawa-Johnson (author) , Chang D. Yoo (author)

Phonemes are defined by their relationship to words: changing a phoneme changes the word. Learning a phoneme inventory with little supervision has been a longstanding challenge with important applications to under-resourced speech technology. In this paper, we bridge the gap betw ...

Synthesizing Spoken Descriptions of Images

Journal article (2021) - X. Wang (author) , Justin van der Hout (author) , Jihua Zhu (author) , Mark Hasegawa-Johnson (author) , O.E. Scharenborg (author)

Image captioning technology has great potential in many scenarios. However, current text-based image captioning methods cannot be applied to approximately half of the world's languages due to these languages’ lack of a written form. To solve this problem, recently the image-to-sp ...

How phonotactics affect multilingual and zero-shot asr performance

Conference paper (2021) - Siyuan Feng (author) , Piotr Żelasko (author) , Laureano Moro-Velazquez (author) , Ali Abavisani (author) , Mark Hasegawa-Johnson (author) , O.E. Scharenborg (author) , Najim Dehak (author)

The idea of combining multiple languages’ recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well ...

Position Paper

Brain Signal-Based Dialogue Systems

Book chapter (2021) - O.E. Scharenborg (author) , Mark Hasegawa-Johnson (author)

This position paper focuses on the problem of building dialogue systems for people who have lost the ability to communicate via speech, e.g., patients of locked-in syndrome or severely disabled people. In order for such people to communicate to other people and computers, dialogu ...

Align or attend?

Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval

Conference paper (2021) - Liming Wang (author) , Xinsheng Wang (author) , Mark Hasegawa-Johnson (author) , Odette Scharenborg (author) , Najim Dehak (author)

Multimodal word discovery (MWD) is often treated as a byproduct of the speech-to-image retrieval problem. However, our theoretical analysis shows that some kind of alignment/attention mechanism is crucial for a MWD system to learn meaningful word-level representation. We verify o ...

Show and speak

Directly synthesize spoken description of images

Conference paper (2021) - Xinsheng Wang (author) , S. Feng (author) , Jihua Zhu (author) , Mark Hasegawa-Johnson (author) , O.E. Scharenborg (author)

This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The basic structure of SAS is an encoder-decoder architecture that t ...

Evaluating automatically generated phoneme captions for images

Conference paper (2020) - Justin van der Hout (author) , Zoltán D’Haese (author) , Mark Hasegawa-Johnson (author) , O.E. Scharenborg (author)

Image2Speech is the relatively new task of generating a spoken description of an image. This paper presents an investigation into the evaluation of this task. For this, first an Image2Speech system was implemented which generates image captions consisting of phoneme sequences. Th ...

Speech technology for unwritten languages

Journal article (2020) - O.E. Scharenborg (author) , Laurent Besacier (author) , Alan W. Black (author) , Mark Hasegawa-Johnson (author) , Florian Metze (author) , Graham Neubig (author) , Sebastian Stueker (author) , Pierre Godard (author) , M. Mueller (author) , More Authors...

Speech technology plays an important role in our everyday life. Among others, speech is used for human-computer interaction, for instance for information retrieval and on-line shopping. In the case of an unwritten language, however, speech technology is unfortunately difficult to ...

That Sounds Familiar

An Analysis of Phonetic Representations Transfer Across Languages

Conference paper (2020) - Piotr Żelasko (author) , Laureano Moro-Velazquez (author) , Mark Hasegawa-Johnson (author) , O.E. Scharenborg (author) , Najim Dehak (author)

Only a handful of the world’s languages are abundant with the resources that enable practical applications of speech processing technologies. One of the methods to overcome this problem is to use the resources existing in other languages to train a multilingual automatic speech r ...

The Time-Course of Phoneme Category Adaptation in Deep Neural Networks

Conference paper (2019) - Junrui Ni (author) , Mark Hasegawa-Johnson (author) , O.E. Scharenborg (author)

Both human listeners and machines need to adapt their sound categories whenever a new speaker is encountered. This perceptual learning is driven by lexical information. In previous work, we have shown that deep neural network-based (DNN) ASR systems can learn to adapt their phone ...

Study of the performance of automatic speech recognition systems in speakers with Parkinson’s Disease

Conference paper (2019) - Laureano Moro-Velázquez (author) , JaeJin Cho (author) , S Watanabe (author) , Mark Hasegawa-Johnson (author) , O.E. Scharenborg (author) , H Kim (author) , Najim Dehak (author)

Parkinson’s Disease (PD) affects motor capabilities of patients, who in some cases need to use human-computer assistive technologies to regain independence. The objective of this work is to study in detail the differences in error patterns from state-of-the-art Automatic Speech R ...

The neural correlates underlying lexically-guided perceptual learning

Conference paper (2019) - O.E. Scharenborg (author) , Jiska Koemans (author) , Cybelle Smith (author) , Mark Hasegawa-Johnson (author) , Kara D. Federmeier (author)

There is ample evidence showing that listeners are able to quickly adapt their phoneme classes to ambiguous sounds using a process called lexically-guided perceptual learning. This paper presents the first attempt to examine the neural correlates underlying this process. Specific ...

Visualizing Phoneme Category Adaptation in Deep Neural Networks

Conference paper (2018) - Odette Scharenborg (author) , Sebastian Tiesmeyer (author) , Mark Hasegawa-Johnson (author) , Najim Dehak (author)

Both human listeners and machines need to adapt their sound categories whenever a new speaker is encountered. This perceptual learning is driven by lexical information. The aim of this paper is two-fold: investigate whether a deep neural network-based (DNN) ASR system can adapt t ...

Methods for Inferring the Phone Set of an Unwritten Language

Abstract (2018) - Mark Hasegawa-Johnson (author) , Wenda Chen (author) , O.E. Scharenborg (author)

In engineering applications, phones are the representation intermediate between text and speech in many text-to-speech (TTS) and speech-to-text (STT) systems. When a language has no written form, TTS and STT are no longer meaningful acronyms as there is no text; we consider inst ...

In engineering applications, phones are the representation intermediate between text and speech in many text-to-speech (TTS) and speech-to-text (STT) systems. When a language has no written form, TTS and STT are no longer meaningful acronyms as there is no text; we consider instead XTS and STX, where X is some other representation that can be easily interpreted by a human user, for example, image, translation, or chat. This paper presents experimental results from two speech applications for unwritten languages: image-to-speech, and speech-to-chat. Experimental evidence from these two applications suggests that the performance of an XTS or STX application can be significantly improved by defining or inferring a phone set for the unwritten language. Image-to-speech (ITS) is the task of generating a spoken description of an image, in a language that has no written form. ITS can be trained and tested as a neural sequence-to-sequence transduction problem, in which an input sequence of sub-images is encoded, attended, and converted into a sequence of phone symbols, from which an output audio signal can be generated. The quality of ITS output varies dramatically depending on the quality of the phone set. Cheating experiments using a known correct phone set resulted in intelligible and meaningful spoken descriptions, but experiments using a cross-language phone set, or one automatically created using unsupervised methods, do not. Extrapolating beyond current experimental results, a simulated annealing algorithm will be presented that may be capable of finding the globally optimal phone set for matching a given ITS training database. Speech-to-chat (STC) is the task of converting speech into a variably spelled transcription in the Latin alphabet, similar to the Latin-alphabet transcriptions used in online chat forums to represent colloquial dialects of multi-register languages such as Arabic and Hindi. Such chat transcripts can be easily collected, even from non-speakers of the language. When a non-speaker of the language writes down what she hears using a chat alphabet, she tends to map every phoneme in the utterance language to the most similar phoneme in her own language, where similarity can be defined by a weighted L1 distance between articulatory feature vectors. For this reason, the speech-to-chat paradigm allows us to infer a phone set that's actually pretty close to the unknown phoneme set of the unwritten language. Experiments were performed in which pseudo-under-resourced languages (Cantonese and Vietnamese, neither of which is truly "unwritten," though few people know how to write Cantonese) were transcribed by native speakers, and phonemic transcripts were generated from their transcriptions. Chat-alphabet transcriptions by non-speakers of Cantonese were then clustered in order to estimate the phonemic transcript. Extra information about the Cantonese phonemes (e.g., elicited from non-native transcribers with more than one native language) improves the quality of transcription. We interpret these two results to mean that defining a better phone set for an unwritten language improves the quality of both image-to-speech and speech-to-chat applications. @en

Building an ASR System for Mboshi Using A Cross-language Definition of Acoustic Units Approach

Conference paper (2018) - O.E. Scharenborg (author) , Patrick Ebel (author) , Francesco Ciannella (author) , Mark Hasegawa-Johnson (author) , Najim Dehak (author)

For many languages in the world, not enough (annotated) speech data is available to train an ASR system. Recently, we proposed a cross-language method for training an ASR system using linguistic knowledge and semi-supervised training. Here, we apply this approach to the low-resou ...