The Cytomegalovirus (CMV) serostatus, of both donor and patient, plays an important role in allogeneic hematopoietic stem cell transplantation, yet it is only known for 19% of donors in the global database. In this research, the other available data in the global database of the
...
The Cytomegalovirus (CMV) serostatus, of both donor and patient, plays an important role in allogeneic hematopoietic stem cell transplantation, yet it is only known for 19% of donors in the global database. In this research, the other available data in the global database of the World Marrow Donor Association will be used to predict CMV serostatus for donors who’s status is unknown. In a statistical analysis, features such as sex, registry, ethnicity, age and height were found to be informative. A particular focus was put on investigating the relation between Human Leukocyte Antigen (HLA) and CMV. Previous literature on this relation consists of studies on small cohorts, Machida et al., 1998 (N=125) and Hassan et al., 2016 (N=1 849). The large global database (N=8 707 407 allows us to evaluate this relation for many more different HLA groups and validate their findings using a meta-analysis across registries. In this analysis, the majority of HLA groups showed small effects on the CMV serostatus. However, a few groups showed consistent large increases in likelihood to be CMV seropositive. This indicates that there is a biological relation between specific HLA and CMV serostatus. To predict CMV serostatus, a simple logistic regression classifier was iterated on by adding features and using more complex models. The best performing classifier uses XGBoost on all features in the database. This classifier has an AUC performance of 0.70. Although this AUC is too low to rely on it for patient-donor matching, it does show non-trivial predictive power. The classifier could be further improved by using a better embedding for the HLA that includes the similarities between different HLA alleles.