Mark Hasegawa-Johnson
Please Note
17 records found
1
Finding Spoken Identifications
Using GPT-4 Annotation For An Efficient And Fast Dataset Creation Pipeline
The growing emphasis on fairness in speech-processing tasks requires datasets with speakers from diverse subgroups that allow training and evaluating fair speech technology systems. However, creating such datasets through manual annotation can be costly. To address this challenge, we present a semi-automated dataset creation pipeline that leverages large language models. We use this pipeline to generate a dataset of speakers identifying themself or another speaker as belonging to a particular race, ethnicity, or national origin group. We use OpenaAI's GPT-4 to perform two complex annotation tasks- separating files relevant to our intended dataset from the irrelevant ones (filtering) and finding and extracting information on identifications within a transcript (tagging). By evaluating GPT-4's performance using human annotations as ground truths, we show that it can reduce resources required by dataset annotation while barely losing any important information. For the filtering task, GPT-4 had a very low miss rate of 6.93%. GPT-4's tagging performance showed a trade-off between precision and recall, where the latter got as high as 97%, but precision never exceeded 45%. Our approach reduces the time required for the filtering and tagging tasks by 95% and 80%, respectively. We also present an in-depth error analysis of GPT-4's performance.
Phonemes are defined by their relationship to words: changing a phoneme changes the word. Learning a phoneme inventory with little supervision has been a longstanding challenge with important applications to under-resourced speech technology. In this paper, we bridge the gap between the linguistic and statistical definition of phonemes and propose a novel neural discrete representation learning model for self-supervised learning of phoneme inventory with raw speech and word labels. Given the availability of phoneme segmentation and some mild conditions, we prove that the phoneme inventory learned by our approach converges to the true one with an exponentially low error rate. Moreover, in experiments on TIMIT and Mboshi benchmarks, our approach consistently learns a better phoneme-level representation and achieves a lower error rate in a zero-resource phoneme recognition task than previous state-of-the-art self-supervised representation learning algorithms.
The high cost of data acquisition makes Automatic Speech Recognition (ASR) model training problematic for most existing languages, including languages that do not even have a written script, or for which the phone inventories remain unknown. Past works explored multilingual training, transfer learning, as well as zero-shot learning in order to build ASR systems for these low-resource languages. While it has been shown that the pooling of resources from multiple languages is helpful, we have not yet seen a successful application of an ASR model to a language unseen during training. A crucial step in the adaptation of ASR from seen to unseen languages is the creation of the phone inventory of the unseen language. The ultimate goal of our work is to build the phone inventory of a language unseen during training in an unsupervised way without any knowledge about the language. In this paper, we (1) investigate the influence of different factors (i.e., model architecture, phonotactic model, type of speech representation) on phone recognition in an unknown language; (2) provide an analysis of which phones transfer well across languages and which do not in order to understand the limitations of and areas for further improvement for automatic phone inventory creation; and (3) present different methods to build a phone inventory of an unseen language in an unsupervised way. To that end, we conducted mono-, multi-, and crosslingual experiments on a set of 13 phonetically diverse languages and several in-depth analyses. We found a number of universal phone tokens (IPA symbols) that are well-recognized cross-linguistically. Through a detailed analysis of results, we conclude that unique sounds, similar sounds, and tone languages remain a major challenge for phonetic inventory discovery.
Position Paper
Brain Signal-Based Dialogue Systems
This position paper focuses on the problem of building dialogue systems for people who have lost the ability to communicate via speech, e.g., patients of locked-in syndrome or severely disabled people. In order for such people to communicate to other people and computers, dialogue systems that are based on brain responses to (imagined) speech are needed. A speech-based dialogue system typically consists of an automatic speech recognition module and a speech synthesis module. In order to build a dialogue system that is able to work on the basis of brain signals, a system needs to be developed that is able to recognize speech imagined by a person and can synthesize speech from imagined speech. This paper proposes combining new and emerging technology on neural speech recognition and auditory stimulus construction from brain signals to build brain signal-based dialogue systems. Such systems have a potentially large impact on society.
Align or attend?
Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval
Show and speak
Directly synthesize spoken description of images
Speech technology plays an important role in our everyday life. Among others, speech is used for human-computer interaction, for instance for information retrieval and on-line shopping. In the case of an unwritten language, however, speech technology is unfortunately difficult to create, because it cannot be created by the standard combination of pre-trained speech-to-text and text-to-speech subsystems. The research presented in this article takes the first steps towards speech technology for unwritten languages. Specifically, the aim of this work was 1) to learn speech-to-meaning representations without using text as an intermediate representation, and 2) to test the sufficiency of the learned representations to regenerate speech or translated text, or to retrieve images that depict the meaning of an utterance in an unwritten language. The results suggest that building systems that go directly from speech-to-meaning and from meaning-to-speech, bypassing the need for text, is possible.
That Sounds Familiar
An Analysis of Phonetic Representations Transfer Across Languages
Image-to-speech (ITS) is the task of generating a spoken description of an image, in a language that has no written form. ITS can be trained and tested as a neural sequence-to-sequence transduction problem, in which an input sequence of sub-images is encoded, attended, and converted into a sequence of phone symbols, from which an output audio signal can be generated. The quality of ITS output varies dramatically depending on the quality of the phone set. Cheating experiments using a known correct phone set resulted in intelligible and meaningful spoken descriptions, but experiments using a cross-language phone set, or one automatically created using unsupervised methods, do not. Extrapolating beyond current experimental results, a simulated annealing algorithm will be presented that may be capable of finding the globally optimal phone set for matching a given ITS training database.
Speech-to-chat (STC) is the task of converting speech into a variably spelled transcription in the Latin alphabet, similar to the Latin-alphabet transcriptions used in online chat forums to represent colloquial dialects of multi-register languages such as Arabic and Hindi. Such chat transcripts can be easily collected, even from non-speakers of the language. When a non-speaker of the language writes down what she hears using a chat alphabet, she tends to map every phoneme in the utterance language to the most similar phoneme in her own language, where similarity can be defined by a weighted L1 distance between articulatory feature vectors. For this reason, the speech-to-chat paradigm allows us to infer a phone set that's actually pretty close to the unknown phoneme set of the unwritten language. Experiments were performed in which pseudo-under-resourced languages (Cantonese and Vietnamese, neither of which is truly "unwritten," though few people know how to write Cantonese) were transcribed by native speakers, and phonemic transcripts were generated from their transcriptions. Chat-alphabet transcriptions by non-speakers of Cantonese were then clustered in order to estimate the phonemic transcript. Extra information about the Cantonese phonemes (e.g., elicited from non-native transcribers with more than one native language) improves the quality of transcription.
We interpret these two results to mean that defining a better phone set for an unwritten language improves the quality of both image-to-speech and speech-to-chat applications.
...
Image-to-speech (ITS) is the task of generating a spoken description of an image, in a language that has no written form. ITS can be trained and tested as a neural sequence-to-sequence transduction problem, in which an input sequence of sub-images is encoded, attended, and converted into a sequence of phone symbols, from which an output audio signal can be generated. The quality of ITS output varies dramatically depending on the quality of the phone set. Cheating experiments using a known correct phone set resulted in intelligible and meaningful spoken descriptions, but experiments using a cross-language phone set, or one automatically created using unsupervised methods, do not. Extrapolating beyond current experimental results, a simulated annealing algorithm will be presented that may be capable of finding the globally optimal phone set for matching a given ITS training database.
Speech-to-chat (STC) is the task of converting speech into a variably spelled transcription in the Latin alphabet, similar to the Latin-alphabet transcriptions used in online chat forums to represent colloquial dialects of multi-register languages such as Arabic and Hindi. Such chat transcripts can be easily collected, even from non-speakers of the language. When a non-speaker of the language writes down what she hears using a chat alphabet, she tends to map every phoneme in the utterance language to the most similar phoneme in her own language, where similarity can be defined by a weighted L1 distance between articulatory feature vectors. For this reason, the speech-to-chat paradigm allows us to infer a phone set that's actually pretty close to the unknown phoneme set of the unwritten language. Experiments were performed in which pseudo-under-resourced languages (Cantonese and Vietnamese, neither of which is truly "unwritten," though few people know how to write Cantonese) were transcribed by native speakers, and phonemic transcripts were generated from their transcriptions. Chat-alphabet transcriptions by non-speakers of Cantonese were then clustered in order to estimate the phonemic transcript. Extra information about the Cantonese phonemes (e.g., elicited from non-native transcribers with more than one native language) improves the quality of transcription.
We interpret these two results to mean that defining a better phone set for an unwritten language improves the quality of both image-to-speech and speech-to-chat applications.