Video Captioning for the Visually Impaired
F. Xu (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Julián Urbano – Mentor (TU Delft - Multimedia Computing)
Odette Scharenborg – Graduation committee member (TU Delft - Multimedia Computing)
J.C. Gemert – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)
Benjamin Timmermans – Mentor (IBM Center for Advanced Studies Benelux)
Roger Zhe Li – Mentor
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Visual impairment affects over 2.2 billion individuals globally, emphasizing the critical need for effective assistive technologies. This work focuses on developing a video captioning model explicitly tailored for visually impaired users, leveraging advancements in deep learning techniques. Video captioning involves converting video frames into textual descriptions, effectively bridging the domains of computer vision (CV) and natural language processing (NLP). We surveyed young visually impaired individuals from the Visio organization, who provided key insights into the design of our model.
We enhance the existing S2VT model by modifying its temporal attention mechanism to improve the recognition of visual surroundings, addressing the unique challenges visually impaired individuals face.
This research explores critical questions surrounding the model's sensitivity to actions, the readability of generated captions, and methods for latency reduction. To evaluate the model's effectiveness, we implement readability metrics—an approach not previously utilized in video captioning assessments. Our findings contribute to enhancing accessibility and independence for visually impaired individuals through advanced video captioning solutions.