Using the Multiple Instance Learning framework to address differential regulation

More Info
expand_more

Abstract

Cell differentiation is a natural process occurring in all higher organisms, since the early fetal stage of life. It is, also, a part of disease – such as cancer – as the cell cycle becomes deregulated and cells behave differently compared to healthy ones. Differentiation occurs although the genome of all cells is identical across all cell types of the same organism. The motivation behind the current work is to understand why this happens. Cells differentiate because of different gene expression patterns. The genomic features close or around a gene determine its expression. One of these genomic features is the binding of Transcription Factors (TFs), which are proteins that bind in the promoter region of genes and are responsible for their (non-) expression. Other genomic features in?uence the binding of TFs close to genes, such as the accessibility of DNA, the levels of DNA methylation or the modi?cation of histones. The purpose of this study is to identify the genomic features that in?uence the binding of the TFs that are responsible for gene expression. Normal classi?cation cannot express that multiple TFs need to bind in a gene’s promoter region for it to be expressed and the number of TFs varies among genes. The TF labels are also unknown, meaning that it is not known which TF, or TFs, is/are responsible for gene expression. For these reasons, this problem – and the data – ?ts the Multiple Instance Learning (MIL) framework. A method is formulated, where a gene is treated as a bag and all the TF binding sites are instances. The results are promising, as TFs that were selected as important for gene expression were found to be so in a biological example.