Video Captioning for the Visually Impaired

None, None

Video Captioning for the Visually Impaired

Master Thesis (2024)

Author(s)

F. Xu (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Julián Urbano – Mentor (TU Delft - Multimedia Computing)

O.E. Scharenborg – Graduation committee member (TU Delft - Multimedia Computing)

J.C. van Gemert – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

Benjamin Timmermans – Mentor (IBM Center for Advanced Studies Benelux)

Roger Zhe Li – Mentor

Faculty

Electrical Engineering, Mathematics and Computer Science

Deep Learning Video Captioning The Visually Impaired

To reference this document use:

https://resolver.tudelft.nl/uuid:a68f0589-cc42-4718-928f-7ddb0f3f8043

More Info

expand_more

Publication Year

2024

Language

English

Graduation Date

11-09-2024

Awarding Institution

Delft University of Technology

Programme

['Computer Science']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Visual impairment affects over 2.2 billion individuals globally, emphasizing the critical need for effective assistive technologies. This work focuses on developing a video captioning model explicitly tailored for visually impaired users, leveraging advancements in deep learning techniques. Video captioning involves converting video frames into textual descriptions, effectively bridging the domains of computer vision (CV) and natural language processing (NLP). We surveyed young visually impaired individuals from the Visio organization, who provided key insights into the design of our model.
We enhance the existing S2VT model by modifying its temporal attention mechanism to improve the recognition of visual surroundings, addressing the unique challenges visually impaired individuals face.
This research explores critical questions surrounding the model's sensitivity to actions, the readability of generated captions, and methods for latency reduction. To evaluate the model's effectiveness, we implement readability metrics—an approach not previously utilized in video captioning assessments. Our findings contribute to enhancing accessibility and independence for visually impaired individuals through advanced video captioning solutions.

Files

Updated_thesis_Fenglu.pdf

(pdf | 16.2 Mb)

License info not available