Interpretable classification of tumours through multiple instance learning and somatic mutations

More Info
expand_more

Abstract

Next generation sequencing is brought into the clinic. Screening of disease associated genes will aid the diagnosis of disorders with a genetic component. The diagnosis of cancer is of particular interest due to its variety and prevalence. The obtained mutations provide clues about underlying biological properties that could be used for classification. Classification of a tumour as a particular type of cancer is an important step towards treatment. But currently no method exists that can directly classify cancers using next generation sequencing derived mutations. We have developed a classification method in which tumours are modelled as bags of annotated somatic mutations. Our method uses a machine learning approach to identify and select the relevant mutations and subsequently train a classifier for each type of cancer. The selected mutations result in an interpretable model that sheds light onto which biological properties are important to separate one cancer type from the others. We compare the proposed method to two other approaches. First a gene based approach in which the mutations are reduced to a mutation count per gene. Second a distance approach that uses all the available mutations, but returns a model that is hard to interpret. We show that the proposed method performs equally well when compared to the first approach. Our method achieves performance close to the second approach, while it yields a model that allows for biological interpretation.