Interpretable classification of tumours through multiple instance learning and somatic mutations

Master thesis (2013)

Authors

S.C. Dentro

Contributors

J. De Ridder (mentor)

D.M.J. Tax (mentor)

D.J. Adams (mentor)

Programme

M.Sc. Computer Science: Bioinformatics () (TU Delft)

Machine learning Classification Cancer Multiple instance learning Somatic mutations

To reference this document use:

http://resolver.tudelft.nl/uuid:e23175cb-d1a8-4cb6-af7b-9909f6412cf5

More Info

expand_more

Published Date

13-12-2013

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Programme

M.Sc. Computer Science: Bioinformatics

Abstract

Next generation sequencing is brought into the clinic. Screening of disease associated genes will aid the diagnosis of disorders with a genetic component. The diagnosis of cancer is of particular interest due to its variety and prevalence. The obtained mutations provide clues about underlying biological properties that could be used for classification. Classification of a tumour as a particular type of cancer is an important step towards treatment. But currently no method exists that can directly classify cancers using next generation sequencing derived mutations. We have developed a classification method in which tumours are modelled as bags of annotated somatic mutations. Our method uses a machine learning approach to identify and select the relevant mutations and subsequently train a classifier for each type of cancer. The selected mutations result in an interpretable model that sheds light onto which biological properties are important to separate one cancer type from the others. We compare the proposed method to two other approaches. First a gene based approach in which the mutations are reduced to a mutation count per gene. Second a distance approach that uses all the available mutations, but returns a model that is hard to interpret. We show that the proposed method performs equally well when compared to the first approach. Our method achieves performance close to the second approach, while it yields a model that allows for biological interpretation.

Files

Sdentro_master_thesis_final.pd... (pdf)

(pdf | 4.93 Mb)