Self-Supervised Representation Learning for Relational Multimodal Data

None, None

Self-Supervised Representation Learning for Relational Multimodal Data

Should we combine multiple pretext tasks?

Bachelor Thesis (2024)

Author(s)

I. Mc Auliffe (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Kubilay Atasu – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

T.A. Akyıldız – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

B. Özkan – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

Representation Learning Self-supervised learning Multi-task learning

To reference this document use

https://resolver.tudelft.nl/uuid:13ca8b99-c261-4388-9030-27197c8930f9

More Info

expand_more

Publication Year

2024

Language

English

Graduation Date

27-06-2024

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

336

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Deep Learning models can use pretext tasks to learn representations on unlabelled datasets. Although there have been several works on representation learning and pre-training, to the best of our knowledge combining pretext tasks in a multi-task setting for relational multimodal data has not been done before. In this work, we implemented 4 pretext tasks on top of a framework for handling relational multi-modal data and evaluated them based on two datasets. We first identified the best-performing masking strategy for pretext tasks that require masking. Then, we compared different combinations of the pretext tasks based on self-supervised metrics as a proxy for the quality of the representation learned. The results reveal that masking values by replacing from the column's empirical distribution yields 4.6\% and 4\% higher accuracy on each dataset respectively than replacing them with a fixed value. We also found that different combinations of pretext tasks, even with different numbers of tasks, converge to marginally different values and MoCo further reduces this difference. Our findings imply that the number of pretext tasks can scale efficiently allowing for a more diverse representation to be learned.

Files

Ilias_RP_paper.pdf

(pdf | 0.384 Mb)

License info not available