Leveraging Related Datasets to Improve Model Performance on an Underrepresented Target Population

None, None

Leveraging Related Datasets to Improve Model Performance on an Underrepresented Target Population

Master Thesis (2023)

Author(s)

M.V. Ries (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

D.M.J. Tax – Mentor (TU Delft - Pattern Recognition and Bioinformatics)

M.J.T. Reinders – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)

O.E. Scharenborg – Graduation committee member (TU Delft - Multimedia Computing)

Machine Learning Data Science Deep Learning Timeseries data Medical data

To reference this document use

https://resolver.tudelft.nl/uuid:0822c597-d3fc-4047-93f3-3bb850d14c13

More Info

expand_more

Publication Year

2023

Language

English

Graduation Date

06-07-2023

Awarding Institution

Programme

Computer Science

Downloads counter

197

Collections

thesis

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Training deep learning models for time-series prediction of a target population often requires a substantial amount of training data, which may not be readily available. This work addresses the challenge of leveraging multiple related sources of time series data in the same feature space to improve the prediction performance of a deep learning model for a target population. Specifically, we focus on a scenario where the target dataset, representing the desired target population, is underrepresented, while the source datasets consist of mismatched populations that are sufficiently representative for training a deep learning model. In this study, we explore state-of-the-art techniques, including transfer learning, ensemble learning, and domain adaptation to leverage source datasets towards a target population using real-world medical data. Additionally, we investigate the use of model performance-derived baselines as a heuristic to quantify the magnitude of the distribution mismatch between a source(s) and a target. Our results demonstrate that a set of well-defined baselines can effectively quantify the distribution mismatch and provide insights into the choice of leveraging technique for a given mismatch scenario. Furthermore, our results show that all state-of-the-art techniques can be employed to leverage related source datasets towards the target, though the performance of these techniques varies depending on the characteristics of the distribution mismatch. Eventually, we discuss the applicability of this research to new scenarios, along with avenues for future research.

Files

Maxmillan_Ries_MSc_Thesis_Repo... (pdf)

(pdf | 1.09 Mb)

License info not available