Speech-based automatic closed caption alignment

None, None

Speech-based automatic closed caption alignment

Master Thesis (2010)

Author(s)

J.A. Boogaard

Contributor(s)

P. Wiggers – Mentor

H. Geers – Mentor

H. Jongebloed – Mentor

L.J.M. Rothkrantz – Mentor

Copyright

Speech recognition Natural language processing Closed captioning Subtitle alignment

To reference this document use:

https://resolver.tudelft.nl/uuid:282b4760-a2a4-42e4-9040-1158ef8327fa

More Info

expand_more

Publication Year

2010

Copyright

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

In the Netherlands, four million people watch television programs with closed captions because they are hearing impaired or non-native speakers. Closed captions contain Dutch speech transcriptions and non-speech sound descriptions and are displayed as subtitles. Due to government obligation, the number of television programs that must be closed-captioned will increase to at least 95% in 2011. Closed caption alignment comprises the timing of the subtitles as closely as possible to the corresponding times of the video signal. Since alignment is a costly and labor intensive process demanding high quality outputs, an automated solution is desirable. This thesis addresses the application of automatic speech recognition to the task of on-line closed-captioning of television programs. The thesis focuses on the development of an automatic closed caption alignment system for TT888, a company that produces subtitles for Dutch-language television programs. Investigation of related research, consulting professional editors and analyses of a variety of captioned television programs have contributed to the development of an automatic closed-captioning system named SETH (Speech Estimating Title Heuristics). The core of the system is an algorithm capable of matching manually produced captions with speech transcriptions produced by a large vocabulary speech recognizer. The architecture of SETH combines the benefits of modular programming and the pipes and filters architecture. The best results are achieved when the speech is rather formal and nonspontaneous, pure Dutch pronounced by a native speaker and does not contain crosstalk nor background noise. Dissimilarities between the speech and captions are not a major problem as long as the captions include the most important words. The alignment algorithm is also robust to most of the insertions caused by music. Deviant language use, songs, spontaneous speech, strong regional accents are still a difficult job for the speech recognizer and hence a major problem in automatic closed caption alignment. Since there will always be broadcasts with poor speech quality, manual verification or adaptation of the subtitles remain necessary.

Files

Report.pdf

(pdf | 6.64 Mb)

License info not available