A Critical Perspective On Microarray Breast Cancer Gene Expression Profiling

More Info
expand_more

Abstract

Microarrays offer biologists an exciting tool that allows the simultaneous assessment of gene expression levels for thousands of genes at once. At the time of their inception, microarrays were hailed as the new dawn in cancer biology and oncology practice with the hope that within a decade diseases like breast cancer would be solved. Various high-profile publications showed the immense potential of this technique in breast cancer event prediction and breast cancer subtyping. From these studies it became clear that breast cancer at the molecular level is not a single disease, but comprises a heterogeneous set of subtypes associated with clear differences in gene expression patterns and clinical outcomes. However, as microarrays became more popular, it became apparent that the accurate analysis and interpretation of microarray data provided a plethora of unique challenges. From a biological, as well as a technical perspective microarray data is complex, while the high feature-to-sample ratio associated with microarray studies rendered many classic statistical procedures useless. To make matters worse, various publications emerged that showed severe stability problems in the model fits of early pilot studies and showed that these studies were often overly optimistic. As a result the reliability of microarray based experiments in general was openly questioned. Given the multitude of different factors which may or may not influence results it is clear that a proper evaluation of microarray breast cancer profiling is both crucial and challenging. This dissertation provides a number of carefully devised protocols, by which the influence of important sources of variation can be isolated, controlled and/or explicitly quantified, even in the absence of a gold standard. Instead of applying these protocols to data from small spike-in or dilution studies, they were applied to a large collection of real life breast cancer datasets of considerable size. Furthermore, we extensively studied breast cancer subtyping and the evaluation of subtype-specific predictors constructed on these, from both a practical and a theoretical perspective. This work shows that the evaluation of subtype-specific event prediction, based on divide and conquer schemes brings various new statistical challenges. For a variety of frequently encountered performance measures from machine learning several decompositions of the overall performance into subtype-specific performances are provided which show that the relation between subtype-specific and overall performance can be highly complex and counterintuitive. Furthermore, the experiments in this dissertation show that with modern processing techniques and a standardized approach it is possible to construct extremely stable subtyping schemes. However, the selected approach has a strong impact on the obtained results, suggesting that a stringent standardization of the methodologies used for subtyping is not sufficient for the consistent assignment of subtypes to individual patient samples. From these findings we conclude that the molecular subtypes of breast cancer are not sufficiently well understood and need further refinements.

Files

PhD.pdf
(pdf | 20.1 Mb)