The effectiveness of subspace mapping techniques adapted to unlabeled samples from a global domain in mitigating sample selection bias

Bachelor Thesis (2023)
Author(s)

T.F.R. van Hoorn (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Joana P. Goncalves – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

Yasin Tepeli – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

J. Urbano – Graduation committee member (TU Delft - Multimedia Computing)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2023 Timo van Hoorn
More Info
expand_more
Publication Year
2023
Language
English
Copyright
© 2023 Timo van Hoorn
Graduation Date
28-06-2023
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Sample selection bias occurs when the selected samples in a subset of the original data set follow a different distribution than the samples from the original data set. This type of bias in the training set could result in a classifier being unable to predict samples from a testing data set optimally. Domain adaptation techniques try to adapt classifiers to a possible bias in the training or testing set. Subspace mapping techniques specifically do this by trying to find common subspaces between the source and target domain, where the source domain is the domain with all samples used for training, and the target domain is the domain with samples that must be predicted. This project aims to evaluate the effectiveness of two subspace mapping techniques in mitigating sample selection bias. This research assumes that no data samples from a target domain are available, but only unlabelled samples coming from an underlying global domain. The two subspace mapping techniques that will be tested in this paper are subspace alignment (SA) and transfer component analysis (TCA). This paper will show that the subspace alignment method is more effective on data sets with fewer features and where the source and target domains are further away from each other. The transfer component analysis method is more effective when more training samples are available on data sets with fewer features and where the distance between the source and target domain is not too big. The effectiveness of both methods also depends on the type and form of the data sets they are used on.

Files

Final_Paper.pdf
(pdf | 0.612 Mb)
License info not available