Leveraging Related Datasets to Improve Model Performance on an Underrepresented Target Population

Master Thesis (2023)
Author(s)

M.V. Ries (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

DMJ Tax – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

Marcel J. T. Reinders – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

O.E. Scharenborg – Graduation committee member (TU Delft - Multimedia Computing)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2023 Maxmillan Ries
More Info
expand_more
Publication Year
2023
Language
English
Copyright
© 2023 Maxmillan Ries
Graduation Date
06-07-2023
Awarding Institution
Delft University of Technology
Programme
['Computer Science']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Training deep learning models for time-series prediction of a target population often requires a substantial amount of training data, which may not be readily available. This work addresses the challenge of leveraging multiple related sources of time series data in the same feature space to improve the prediction performance of a deep learning model for a target population. Specifically, we focus on a scenario where the target dataset, representing the desired target population, is underrepresented, while the source datasets consist of mismatched populations that are sufficiently representative for training a deep learning model. In this study, we explore state-of-the-art techniques, including transfer learning, ensemble learning, and domain adaptation to leverage source datasets towards a target population using real-world medical data. Additionally, we investigate the use of model performance-derived baselines as a heuristic to quantify the magnitude of the distribution mismatch between a source(s) and a target. Our results demonstrate that a set of well-defined baselines can effectively quantify the distribution mismatch and provide insights into the choice of leveraging technique for a given mismatch scenario. Furthermore, our results show that all state-of-the-art techniques can be employed to leverage related source datasets towards the target, though the performance of these techniques varies depending on the characteristics of the distribution mismatch. Eventually, we discuss the applicability of this research to new scenarios, along with avenues for future research.

Files

License info not available