Does text matter?

None, None

Does text matter?

Extending CLIP with OCR and NLP for image classification and retrieval

Bachelor Thesis (2023)

Author(s)

J. Sassoon (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Zilong Zhao – Mentor (Université Grenoble Alpes)

Y. Chen – Mentor (TU Delft - Data-Intensive Systems)

Anna Lukina – Graduation committee member (TU Delft - Algorithmics)

Faculty

Electrical Engineering, Mathematics and Computer Science

Copyright

Transformers Machine Learning Deep Learning Computer Vision CLIP Zero-shot learning

To reference this document use:

https://resolver.tudelft.nl/uuid:55a159a7-461a-490e-bc73-5194c0ed3b4e

More Info

expand_more

Publication Year

2023

Language

English

Copyright

Graduation Date

27-06-2023

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Contrastive Language-Image Pretraining (CLIP) has gained vast interest due to its impressive performance on a variety of computer vision tasks: image classification, image retrieval, action recognition, feature extraction, and more. The model learns to associate images with their descriptions, a powerful method which allows it to perform well on unseen domains. Often, the descriptions fail to capture text which is contained within the image, a source of information which could prove useful for a handful of computer vision tasks. This limitation requires finetuning in domains where contained text is important. In fact, CLIP has mixed performance on Optical Character Recognition (OCR). This paper proposes a novel architecture: OSBC (OCR Sentence BERT CLIP), which combines CLIP and a custom text extraction pipeline, composed of an OCR model, and a Natural Language Processing (NLP) model. OSBC uses the text contained within images as an additional feature when performing image classification and retrieval. We tested the model on multiple datasets for each task, occasionally outperforming CLIP when images contained text, while maintaining finetunability, and improving the model's robustness. In addition, OSBC was designed to be generalizable, meaning it is expected to perform well on unseen domains without finetuning, though this was not achieved in practice.

Files

OSBC_Jordan_Sassoon_2023.pdf

(pdf | 4.02 Mb)

License info not available