The shift to precision medicine in cancer focuses on providing therapies targeting vulnerabilities of each individual patient tumor. This approach involves identifying cancer subtypes and discovering targets, such as genetic interactions, to treat patients who lack effective ther
...
The shift to precision medicine in cancer focuses on providing therapies targeting vulnerabilities of each individual patient tumor. This approach involves identifying cancer subtypes and discovering targets, such as genetic interactions, to treat patients who lack effective therapy. While computational tools, especially machine learning methods, are essential to analyze complex high-dimensional molecular data and suggest new candidate treatment strategies, their effectiveness is often questioned due to data-related challenges. Specifically, limitations in data collection result in sparse or biased biological data, hindering accurate decision-making and the identification of correct patterns. This thesis proposes state of the art solutions to learn improved prediction models for precision medicine and beyond by leveraging relevant data that was previously ignored, and addressing issues of data sparsity and bias.
Prediction of gene synthetic lethalities to identify novel therapeutic targets has overlooked sequence similarity, which is both a notable indicator of functional relation and available for every gene pair, unlike sparser data sources often used for this prediction task. Existing models also struggle to generalize beyond known synthetic lethalities due to an over reliance on data affected by prominent biases. Similarly, the stratification of cancer cohorts without effective treatments is challenging due to the small sample sizes of cancer (sub)cohorts such as oncogene-driven cohorts. In addition, stratification might not directly uncover an actionable treatment opportunity. The integration of dense protein sequence similarity and comprehensive drug response data each, together with methodological advances, led to significant improvements and revealed promising therapeutic opportunities.
Although these integrations improved the performance of computational methods, selection bias, a nonrandom sampling of training data, remained a significant issue affecting fair evaluation and generalizability. Thus, this thesis also introduces strategies to evaluate and mitigate the impact on model generalizability and fairness when the selected training data is not representative of the underlying population. We first artificially induce multivariate selection bias by favoring the selection of specific clusters of samples to study the fair evaluation of model generalizability. Then, to mitigate selection bias, we advance semi-supervised learning methods that use unlabeled data to gain insight into the distribution of the population beyond the labeled training data and promote sample diversity to counter confirmation bias typical of existing approaches. Our approaches include bias mitigation designed for specific machine learning models, such as forest ensembles and neural networks, and model-agnostic methods that operate under fewer assumptions. We show that diversity-guided semi-supervised learning strategies outperform existing domain adaptation techniques in the presence of various selection biases.
The computational methods proposed in this thesis enhance therapeutic target discovery in cancer and address selection bias in machine learning to advance precision medicine in cancer and improve the generalizability and fairness of bioinformatics models.