Explainable Survival Analysis

for Urothelial Cancer

More Info
expand_more

Abstract

Survival analysis is a statistical method used to predict when an event will occur. Machine learning survival models have been used in many cancer studies. However, machine learning models may not always be interpretable. The current lack of research for explainable survival analysis for urothelial cancer prompted this study. This study offers an insight into the generalizability and explainability of machine learning models for urothelial cancer. We also determine how we can make the models interpretable in the presence of collinearity. In this study, we compared the performance of the models; Rank Linear Support Vector Machine (SVM), Rank Kernel SVM, Coxnet, Random Survival Forest (RSF), and Gradient Boosting (Gboost). We used the Memorial Sloan Kettering (MSK) and The Cancer Genome Atlas (TCGA) datasets. We used gene expression variables and clinical variables to train our models. We evaluated these models based on the C-index. We used Permutation Feature Importance (PFI), a model-agnostic method, to explain our models and used Principal Component Analysis (PCA) to deal with collinearity. We determined that the best linear model was Rank Linear SVM (C-index = 0.58) and the best non-linear model was RSF (C-index = 0.63). Using PFI showed that some of the top-most important genes were expressed in urothelial cancer, one of them even being a prognostic marker. With PCA, we were able to deal with collinearity, and the performance using PCA was comparable to models not using it. PFI with PCA showed that processes exhibited in the top genes were prevalent in cancer.