From recognition to understanding: enriching visual models through multi-modal semantic integration

Doctoral Thesis (2025)
Author(s)

S. Sharifi Noorian (TU Delft - Web Information Systems)

Contributor(s)

G.J. Houben – Promotor (TU Delft - Web Information Systems)

A. Bozzon – Promotor (TU Delft - Sustainable Design Engineering)

Jie Yang – Copromotor (TU Delft - Web Information Systems)

Research Group
Web Information Systems
More Info
expand_more
Publication Year
2025
Language
English
Research Group
Web Information Systems
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

This thesis addresses the semantic gap in visual understanding, improving visual models with semantic reasoning capabilities so they can handle tasks like image captioning, question-answering, and scene understanding. The main focus is on integrating visual and textual data, leveraging human cognitive insights, and developing a robust multi-modal foundation model. The research starts with the exploration of multi-modal data integration to enhance semantic and contextual reasoning in fine-grained scene recognition. The proposed multi-modal models, which combine visual and textual inputs, outperform traditional models that rely solely on visuals. This is particularly true in complex urban environments where visual ambiguities often occur. This method emphasizes the significance of semantic enrichment through multi-modal integration, which helps resolve visual ambiguities and improve scene understanding.