Repository hosted by TU Delft Library

Home · Contact · About · Disclaimer ·

The benefit obtained from visually displayed text from an automatic speech recognizer during listening to speech presented in noise

Publication files not online:

Author: Zekveld, A.A. · Kramer, S.E. · Kessens, J.M. · Vlaming, M.S.M.G. · Houtgast, T.
Institution: TNO Kwaliteit van Leven
Source:Ear and Hearing, 6, 29, 838-852
Identifier: 241200
doi: doi:10.1097/AUD.0b013e31818005bd
Keywords: Acoustics and Audiology · adolescent · adult · article · auditory stimulation · automatic speech recognition · communication aid · female · hearing impairment · human · male · middle aged · noise · phonetics · photostimulation · reading · speech · speech audiometry · speech perception · Acoustic Stimulation · Adolescent · Adult · Communication Aids for Disabled · Deafness · Female · Humans · Male · Middle Aged · Noise · Phonetics · Photic Stimulation · Reading · Speech · Speech Perception · Speech Reception Threshold Test · Speech Recognition Software · Young Adult


OBJECTIVES: The aim of this study was to evaluate the benefit that listeners obtain from visually presented output from an automatic speech recognition (ASR) system during listening to speech in noise. DESIGN: Auditory-alone and audiovisual speech reception thresholds (SRTs) were measured. The SRT is defined as the speech-to-noise ratio at which 50% of the test sentences are reproduced correctly. In the auditory-alone SRT tests, the test sentences were presented only auditorily; in the audiovisual SRT test, the ASR output of each test sentence was also presented textually. The ASR system was used in two recognition modes: recognition of spoken words (word output), or recognition of speech sounds or phones (phone output). The benefit obtained from the ASR output was defined as the difference between the auditory-alone and the audiovisual SRT. We also examined the readability of unimodally displayed ASR output (i.e., the percentage of sentences in which ASR errors were identified and accurately corrected). In experiment 1, the readability and benefit obtained from ASR word output (n = 14) was compared with the benefit obtained from ASR phone output (n = 10). In experiment 2, the effect of presenting an indication of the ASR confidence level was examined (n = 14). The effect of delaying the presentation of the text relative to the speech (up to 6 sec) was examined in experiment 3 (n = 24). The ASR accuracy level was varied systematically in each experiment. RESULTS: Mean readability scores ranged from 0 to 46%, depending on ASR accuracy. Speech comprehension improved when the ASR output was displayed. For example, when the ASR output corresponded to readability scores of only about 20% correct, the text improved the SRT by about 3 dB SNR in the audiovisual SRT test. This improvement corresponds to an increase in speech comprehension of about 35% in critical conditions. Equally readable phone and word output provides similar benefit in speech comprehension. For equal ASR accuracies, both the readability and the benefit from the word output generally exceeded the benefits from the phone output. Presenting information about the ASR confidence level did not influence either the readability or the benefit obtained from the word output. Delaying the text relative to the speech moderately decreased the benefit. CONCLUSIONS: The present study indicates that speech comprehension improves considerably by textual ASR output with moderate accuracies. The study shows that this improvement depends on the readability of the ASR output. Word output has better accuracy and readability than phone output. Listeners are therefore better able to use the ASR word output than phone output to improve speech comprehension. The ability of older listeners and listeners with hearing impairments to use ASR output in speech comprehension requires further study. © 2008 Lippincott Williams & Wilkins, Inc.