AT

A.C. TOCIU

info

Please Note

2 records found

Master thesis (2026) - A.C. TOCIU, Joana Gonçalves, Y.I. Tepeli
Sample selection bias is a widespread cause of distribution shift between the train and test sets, which can significantly degrade the generalisability and performance of machine learning models. To mitigate distribution shifts, numerous domain adaptation techniques have been developed, which adapt the train set to the test set. However, adapting to a specific test set under sample selection bias might impede the model from properly generalizing across the entire problem domain and requires re-adaptation whenever the test data changes. Therefore, we propose a novel adaptation strategy,called global domain adaptation, in which we instead adapt to a larger (global) domain representative of the distribution from which both the train and test sets originate. We introduce a comprehensive benchmark to investigate the behavior and limitations of domain adaptation techniques when adapting to the global domain, which consists of synthetic datasets and selection biases as well as complex bioinformatics datasets with intrinsic biases. Our benchmark reveals interesting performance patterns across categories of domain adaptation techniques: minimax estimators are very fragile in practice,while deep domain adaptation has lower stability in spite of increased architectural complexity. Lastly, we find that global domain adaptation is a viable approach for certain techniques such as importance weighting, while semi-supervised techniques tend to perform best for existing test set adaptation. ...
Importance weighting is a class of domain adaptation techniques for machine learning, which aims to correct the discrepancy in distribution between the train and test datasets, often caused by sample selection bias. In doing so, it frequently uses unlabeled data from the test set. However, this approach has certain drawbacks: it requires retraining for each new test set and fails when the number of test samples is very small. Therefore, we seek to study the performance of importance weighting techniques when the unlabeled data comes from an underlying domain, instead of one specific test set. We propose an evaluation framework inspired from scenarios traditionally known for posing difficulties to importance weighting and apply it to two popular algorithms, KMM and KLIEP. Our results reveal that both algorithms produce statistically significant classification improvements in most experiments. However, their performance is highly dependent on the characteristics of the dataset and the sampling bias. In particular, class overlap seems to influence adaptation ability in the case of unequal conditional probabilities of the source and target domains, while the "intensity" of the sampling bias is an important confounding factor when the train set size is small. ...