MH

Mark Hasegawa-Johnson

info

Please Note

17 records found

Using GPT-4 Annotation For An Efficient And Fast Dataset Creation Pipeline

Conference paper (2024) - Maliha Jahan, Helin Wang, Thomas Thebaud, Yinglun Sun, Giang Le, Zsuzsanna Fagyal, Odette Scharenborg, Mark Hasegawa-Johnson, Laureano Moro-Velazquez, Najim Dehak
The growing emphasis on fairness in speech-processing tasks requires datasets with speakers from diverse subgroups that allow training and evaluating fair speech technology systems. However, creating such datasets through manual annotation can be costly. To address this challenge, we present a semi-automated dataset creation pipeline that leverages large language models. We use this pipeline to generate a dataset of speakers identifying themself or another speaker as belonging to a particular race, ethnicity, or national origin group. We use OpenaAI's GPT-4 to perform two complex annotation tasks- separating files relevant to our intended dataset from the irrelevant ones (filtering) and finding and extracting information on identifications within a transcript (tagging). By evaluating GPT-4's performance using human annotations as ground truths, we show that it can reduce resources required by dataset annotation while barely losing any important information. For the filtering task, GPT-4 had a very low miss rate of 6.93%. GPT-4's tagging performance showed a trade-off between precision and recall, where the latter got as high as 97%, but precision never exceeded 45%. Our approach reduces the time required for the filtering and tagging tasks by 95% and 80%, respectively. We also present an in-depth error analysis of GPT-4's performance. ...
Conference paper (2022) - Liming Wang, Siyuan Feng, Mark Hasegawa-Johnson, Chang D. Yoo
Phonemes are defined by their relationship to words: changing a phoneme changes the word. Learning a phoneme inventory with little supervision has been a longstanding challenge with important applications to under-resourced speech technology. In this paper, we bridge the gap between the linguistic and statistical definition of phonemes and propose a novel neural discrete representation learning model for self-supervised learning of phoneme inventory with raw speech and word labels. Given the availability of phoneme segmentation and some mild conditions, we prove that the phoneme inventory learned by our approach converges to the true one with an exponentially low error rate. Moreover, in experiments on TIMIT and Mboshi benchmarks, our approach consistently learns a better phoneme-level representation and achieves a lower error rate in a zero-resource phoneme recognition task than previous state-of-the-art self-supervised representation learning algorithms. ...
Journal article (2022) - Piotr Żelasko, Siyuan Feng, Laureano Moro Velázquez, Ali Abavisani, Saurabhchand Bhati, Odette Scharenborg, Mark Hasegawa-Johnson, Najim Dehak
The high cost of data acquisition makes Automatic Speech Recognition (ASR) model training problematic for most existing languages, including languages that do not even have a written script, or for which the phone inventories remain unknown. Past works explored multilingual training, transfer learning, as well as zero-shot learning in order to build ASR systems for these low-resource languages. While it has been shown that the pooling of resources from multiple languages is helpful, we have not yet seen a successful application of an ASR model to a language unseen during training. A crucial step in the adaptation of ASR from seen to unseen languages is the creation of the phone inventory of the unseen language. The ultimate goal of our work is to build the phone inventory of a language unseen during training in an unsupervised way without any knowledge about the language. In this paper, we (1) investigate the influence of different factors (i.e., model architecture, phonotactic model, type of speech representation) on phone recognition in an unknown language; (2) provide an analysis of which phones transfer well across languages and which do not in order to understand the limitations of and areas for further improvement for automatic phone inventory creation; and (3) present different methods to build a phone inventory of an unseen language in an unsupervised way. To that end, we conducted mono-, multi-, and crosslingual experiments on a set of 13 phonetically diverse languages and several in-depth analyses. We found a number of universal phone tokens (IPA symbols) that are well-recognized cross-linguistically. Through a detailed analysis of results, we conclude that unique sounds, similar sounds, and tone languages remain a major challenge for phonetic inventory discovery. ...

Brain Signal-Based Dialogue Systems

Book chapter (2021) - Odette Scharenborg, Mark Hasegawa-Johnson
This position paper focuses on the problem of building dialogue systems for people who have lost the ability to communicate via speech, e.g., patients of locked-in syndrome or severely disabled people. In order for such people to communicate to other people and computers, dialogue systems that are based on brain responses to (imagined) speech are needed. A speech-based dialogue system typically consists of an automatic speech recognition module and a speech synthesis module. In order to build a dialogue system that is able to work on the basis of brain signals, a system needs to be developed that is able to recognize speech imagined by a person and can synthesize speech from imagined speech. This paper proposes combining new and emerging technology on neural speech recognition and auditory stimulus construction from brain signals to build brain signal-based dialogue systems. Such systems have a potentially large impact on society. ...

Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval

Conference paper (2021) - Liming Wang, Xinsheng Wang, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak
Multimodal word discovery (MWD) is often treated as a byproduct of the speech-to-image retrieval problem. However, our theoretical analysis shows that some kind of alignment/attention mechanism is crucial for a MWD system to learn meaningful word-level representation. We verify our theory by conducting retrieval and word discovery experiments on MSCOCO and Flickr8k, and empirically demonstrate that both neural MT with self-attention and statistical MT achieve word discovery scores that are superior to those of a state-of-the-art neural retrieval system, outperforming it by 2% and 5% alignment F1 scores respectively. ...
Conference paper (2021) - Siyuan Feng, Piotr Żelasko, Laureano Moro-Velázquez, Ali Abavisani, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak
The idea of combining multiple languages’ recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic model (AM) and language model (LM), it is unclear to what degree the performance suffered from differences in pronunciation or the mismatch in phono-tactics. To gain more insight into the factors limiting zero-shot ASR transfer, we replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM. Then, we perform an extensive evaluation of monolingual, multilingual, and crosslingual (zero-shot) acoustic and language models on a set of 13 phonetically diverse languages. We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer. Furthermore, we find that a multilingual LM hurts a multilingual ASR system’s performance, and retaining only the target language’s phonotactic data in LM training is preferable. ...
Journal article (2021) - Xinsheng Wang, Justin van der Hout, Jihua Zhu, Mark Hasegawa-Johnson, Odette Scharenborg
Image captioning technology has great potential in many scenarios. However, current text-based image captioning methods cannot be applied to approximately half of the world's languages due to these languages’ lack of a written form. To solve this problem, recently the image-to-speech task was proposed, which generates spoken descriptions of images bypassing any text via an intermediate representation consisting of phonemes (image-to-phoneme). Here, we present a comprehensive study on the image-to-speech task in which, 1) several representative image-to-text generation methods are implemented for the image-to-phoneme task, 2) objective metrics are sought to evaluate the image-to-phoneme task, and 3) an end-to-end image-to-speech model that is able to synthesize spoken descriptions of images bypassing both text and phonemes is proposed. Extensive experiments are conducted on the public benchmark database Flickr8k. Results of our experiments demonstrate that 1) State-of-the-art image-to-text models can perform well on the image-to-phoneme task, and 2) several evaluation metrics, including BLEU3, BLEU4, BLEU5, and ROUGE-L can be used to evaluate image-to-phoneme performance. Finally, 3) end-to-end image-to-speech bypassing text and phonemes is feasible. ...

Directly synthesize spoken description of images

Conference paper (2021) - Xinsheng Wang, Siyuan Feng, Jihua Zhu, Mark Hasegawa-Johnson, Odette Scharenborg
This paper proposes a new model, referred to as the show and speak (SAS) model that, for the first time, is able to directly synthesize spoken descriptions of images, bypassing the need for any text or phonemes. The basic structure of SAS is an encoder-decoder architecture that takes an image as input and predicts the spectrogram of speech that describes this image. The final speech audio is obtained from the predicted spectrogram via WaveNet. Extensive experiments on the public benchmark database Flickr8k demonstrate that the proposed SAS is able to synthesize natural spoken descriptions for images, indicating that synthesizing spoken descriptions for images while bypassing text and phonemes is feasible. ...
Conference paper (2020) - Justin van der Hout, Zoltán D’Haese, Mark Hasegawa-Johnson, Odette Scharenborg
Image2Speech is the relatively new task of generating a spoken description of an image. This paper presents an investigation into the evaluation of this task. For this, first an Image2Speech system was implemented which generates image captions consisting of phoneme sequences. This system outperformed the original Image2Speech system on the Flickr8k corpus. Subsequently, these phoneme captions were converted into sentences of words. The captions were rated by human evaluators for their goodness of describing the image. Finally, several objective metric scores of the results were correlated with these human ratings. Although BLEU4 does not perfectly correlate with human ratings, it obtained the highest correlation among the investigated metrics, and is the best currently existing metric for the Image2Speech task. Current metrics are limited by the fact that they assume their input to be words. A more appropriate metric for the Image2Speech task should assume its input to be parts of words, i.e. phonemes, instead. ...
Journal article (2020) - Odette Scharenborg, Laurent Besacier, Alan W. Black, Mark Hasegawa-Johnson, Florian Metze, Graham Neubig, Sebastian Stueker, Pierre Godard, M Mueller, More Authors...
Speech technology plays an important role in our everyday life. Among others, speech is used for human-computer interaction, for instance for information retrieval and on-line shopping. In the case of an unwritten language, however, speech technology is unfortunately difficult to create, because it cannot be created by the standard combination of pre-trained speech-to-text and text-to-speech subsystems. The research presented in this article takes the first steps towards speech technology for unwritten languages. Specifically, the aim of this work was 1) to learn speech-to-meaning representations without using text as an intermediate representation, and 2) to test the sufficiency of the learned representations to regenerate speech or translated text, or to retrieve images that depict the meaning of an utterance in an unwritten language. The results suggest that building systems that go directly from speech-to-meaning and from meaning-to-speech, bypassing the need for text, is possible. ...

An Analysis of Phonetic Representations Transfer Across Languages

Conference paper (2020) - Piotr Żelasko, Laureano Moro-Velázquez, Mark Hasegawa-Johnson, Odette Scharenborg, Najim Dehak
Only a handful of the world’s languages are abundant with the resources that enable practical applications of speech processing technologies. One of the methods to overcome this problem is to use the resources existing in other languages to train a multilingual automatic speech recognition (ASR) model, which, intuitively, should learn some universal phonetic representations. In this work, we focus on gaining a deeper understanding of how general these representations might be, and how individual phones are getting improved in a multilingual setting. To that end, we select a phonetically diverse set of languages, and perform a series of monolingual, multilingual and crosslingual (zero-shot) experiments. The ASR is trained to recognize the International Phonetic Alphabet (IPA) token sequences. We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting, where the model, among other errors, considers Javanese as a tone language. Notably, as little as 10 hours of the target language training data tremendously reduces ASR error rates. Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages — an encouraging result for the low-resource speech community. ...
Conference paper (2019) - Odette Scharenborg, Jiska Koemans, Cybelle Smith, Mark Hasegawa-Johnson, Kara D. Federmeier
There is ample evidence showing that listeners are able to quickly adapt their phoneme classes to ambiguous sounds using a process called lexically-guided perceptual learning. This paper presents the first attempt to examine the neural correlates underlying this process. Specifically, we compared the brain’s responses to ambiguous [f/s] sounds in Dutch non-native listeners of English (N=36) before and after exposure to the ambiguous sound to induce learning, using Event-Related Potentials (ERPs). We identified a group of participants who showed lexically-guided perceptual learning in their phonetic categorization behavior as observed by a significant difference in /s/ responses between pretest and posttest and a group who did not. Moreover, we observed differences in mean ERP amplitude to ambiguous phonemes at pretest and posttest, shown by a reliable reduction in amplitude of a positivity over medial central channels from 250 to 550 ms. However, we observed no significant correlation between the size of behavioral and neural pre/posttest effects. Possibly, the observed behavioral and ERP differences between pretest and posttest link to different aspects of the sound classification task. In follow-up research, these differences will be further investigated by assessing their relationship to neural responses to the ambiguous sounds in the exposure phase. ...
Conference paper (2019) - Junrui Ni, Mark Hasegawa-Johnson, Odette Scharenborg
Both human listeners and machines need to adapt their sound categories whenever a new speaker is encountered. This perceptual learning is driven by lexical information. In previous work, we have shown that deep neural network-based (DNN) ASR systems can learn to adapt their phoneme category boundaries from a few labeled examples after exposure (i.e., training) to ambiguous sounds, as humans have been found to do. Here, we investigate the time-course of phoneme category adaptation in a DNN in more detail, with the ultimate aim to investigate the DNN’s ability to serve as a model of human perceptual learning. We do so by providing the DNN with an increasing number of ambiguous retraining tokens (in 10 bins of 4 ambiguous items), and comparing classification accuracy on the ambiguous items in a held-out test set for the different bins. Results showed that DNNs, similar to human listeners, show a step-like function: The DNNs show perceptual learning already after the first bin (only 4 tokens of the ambiguous phone), with little further adaptation for subsequent bins. In follow-up research, we plan to test specific predictions made by the DNN about human speech processing. ...
Conference paper (2019) - Laureano Moro-Velazquez, JaeJin Cho, Shinji Watanabe, Mark A. Hasegawa-Johnson, Odette Scharenborg, Heejin Kim, Najim Dehak
Parkinson’s Disease (PD) affects motor capabilities of patients, who in some cases need to use human-computer assistive technologies to regain independence. The objective of this work is to study in detail the differences in error patterns from state-of-the-art Automatic Speech Recognition (ASR) systems on speech from people with and without PD. Two different speech recognizers (attention-based end-to-end and Deep Neural Network - Hidden Markov Models hybrid systems) were trained on a Spanish language corpus and subsequently tested on speech from 43 speakers with PD and 46 without PD. The differences related to error rates, substitutions, insertions and deletions of characters and phonetic units between the two groups were analyzed, showing that the word error rate is 27% higher in speakers with PD than in control speakers, with a moderated correlation between that rate and the developmental stage of the disease. The errors were related to all manner classes, and were more pronounced in the vowel /u/. This study is the first to evaluate ASR systems’ responses to speech from patients at different stages of PD in Spanish. The analyses showed general trends but individual speech deficits must be studied in the future when designing new ASR systems for this population. ...
Conference paper (2018) - Odette Scharenborg, Sebastian Tiesmeyer, Mark Hasegawa-Johnson, Najim Dehak
Both human listeners and machines need to adapt their sound categories whenever a new speaker is encountered. This perceptual learning is driven by lexical information. The aim of this paper is two-fold: investigate whether a deep neural network-based (DNN) ASR system can adapt to only a few examples of ambiguous speech as humans have been found to do; investigate a DNN’s ability to serve as a model of human perceptual learning. Crucially, we do so by looking at intermediate levels of phoneme category adaptation rather than at the output level. We visualize the activations in the hidden layers of the DNN during perceptual learning. The results show that, similar to humans, DNN systems learn speaker-adapted phone category boundaries from a few labeled examples. The DNN adapts its category boundaries not only by adapting the weights of the output layer, but also by adapting the implicit feature maps computed by the hidden layers, suggesting the possibility that human perceptual learning might involve a similar nonlinear distortion of a perceptual space that is intermediate between the acoustic input and the phonological categories. Comparisons between DNNs and humans can thus provide valuable insights into the way humans process speech and improve ASR technology. ...
Abstract (2018) - Mark Hasegawa-Johnson, Wenda Chen, Odette Scharenborg
In engineering applications, phones are the representation intermediate between text and speech in many text-to-speech (TTS) and speech-to-text (STT) systems. When a language has no written form, TTS and STT are no longer meaningful acronyms as there is no text; we consider instead XTS and STX, where X is some other representation that can be easily interpreted by a human user, for example, image, translation, or chat. This paper presents experimental results from two speech applications for unwritten languages: image-to-speech, and speech-to-chat. Experimental evidence from these two applications suggests that the performance of an XTS or STX application can be significantly improved by defining or inferring a phone set for the unwritten language.
Image-to-speech (ITS) is the task of generating a spoken description of an image, in a language that has no written form. ITS can be trained and tested as a neural sequence-to-sequence transduction problem, in which an input sequence of sub-images is encoded, attended, and converted into a sequence of phone symbols, from which an output audio signal can be generated. The quality of ITS output varies dramatically depending on the quality of the phone set. Cheating experiments using a known correct phone set resulted in intelligible and meaningful spoken descriptions, but experiments using a cross-language phone set, or one automatically created using unsupervised methods, do not. Extrapolating beyond current experimental results, a simulated annealing algorithm will be presented that may be capable of finding the globally optimal phone set for matching a given ITS training database.
Speech-to-chat (STC) is the task of converting speech into a variably spelled transcription in the Latin alphabet, similar to the Latin-alphabet transcriptions used in online chat forums to represent colloquial dialects of multi-register languages such as Arabic and Hindi. Such chat transcripts can be easily collected, even from non-speakers of the language. When a non-speaker of the language writes down what she hears using a chat alphabet, she tends to map every phoneme in the utterance language to the most similar phoneme in her own language, where similarity can be defined by a weighted L1 distance between articulatory feature vectors. For this reason, the speech-to-chat paradigm allows us to infer a phone set that's actually pretty close to the unknown phoneme set of the unwritten language. Experiments were performed in which pseudo-under-resourced languages (Cantonese and Vietnamese, neither of which is truly "unwritten," though few people know how to write Cantonese) were transcribed by native speakers, and phonemic transcripts were generated from their transcriptions. Chat-alphabet transcriptions by non-speakers of Cantonese were then clustered in order to estimate the phonemic transcript. Extra information about the Cantonese phonemes (e.g., elicited from non-native transcribers with more than one native language) improves the quality of transcription.
We interpret these two results to mean that defining a better phone set for an unwritten language improves the quality of both image-to-speech and speech-to-chat applications.
...
Conference paper (2018) - Odette Scharenborg, Patrick Ebel, Francesco Ciannella, Mark Hasegawa-Johnson, Najim Dehak
For many languages in the world, not enough (annotated) speech data is available to train an ASR system. Recently, we proposed a cross-language method for training an ASR system using linguistic knowledge and semi-supervised training. Here, we apply this approach to the low-resource language Mboshi. Using an ASR system trained on Dutch, Mboshi acoustic units were first created using cross-language initialization of the phoneme vectors in the output layer. Subsequently, this adapted system was retrained using Mboshi self-labels. Two training methods were investigated: retraining of only the output layer and retraining the full deep neural network (DNN). The resulting Mboshi system was analyzed by investigating per phoneme accuracies, phoneme confusions, and by visualizing the hidden layers of the DNNs prior to and following retraining with the self-labels. Results showed a fairly similar performance for the two training methods but a better phoneme representation for the fully retrained DNN. ...