Exploiting noisy and incomplete biological data for prediction and knowledge discovery

More Info
expand_more

Abstract

In modern molecular biology, the vast amount of experimental data enables us to obtain more comprehensive understanding of cellular activities, from transcription to metabolism. However, due to the inherent complexity of the cell and the various limitations of the measuring techniques, these data are often noisy and incomplete. Therefore, conclusions and hypotheses generated from these data are unreliable and remain partial. This poses a major challenge in molecular biology. This thesis contributes to this matter by proposing several approaches to handle noisy and incomplete biological data, in order to improve prediction accuracy and ease knowledge discovery. It is divided into two parts which address different problems. Part I is dedicated to the theoretical study of building noise-tolerant classifiers in the presence of class noise and measurement noise, i.e. when class labels or measured attribute values of biological instances are erroneous. For the class noise problem, we present three classifiers using probabilistic models to recover the true distribution of each class. In particular, our novel incorporation of the noise model in the Kernel Fisher discriminant offers improved prediction performance, especially on non-Gaussian data sets and data sets with relatively large numbers of features compared to their sample sizes. For measurement noise, we propose to integrate prior knowledge of the noise into kernel density based classifiers, using distinct kernels for individual samples, features, and feature values. The inclusion of prior knowledge is also shown to be especially beneficial in relatively under-sampled data sets. In Part II, we exploit the incomplete metabolic reaction and transcriptional regulation data, using both a network-centric and evolution-based approach. That is, we integrate metabolic networks and regulatory networks within species, and compare the integrated networks across different species. This integrated evolutionary network method not only provides a more comprehensive view of the cellular system, but also helps to generate more reliable information and hypotheses. Our alignment framework allows to automatically align the full metabolic networks of two species, taking into account all reaction arrangement possibilities and allowing small differences in otherwise similar reactions. We present a scoring function which measures pathway similarity in a comprehensive and flexible manner, hierarchically integrating all relevant and uncorrelated information sources. Using this method, we have identified fully conserved pathways and their variations at regulatory and metabolic level, discovered new pathway possibilities which are not represented in conventional databases, and generated hypotheses on the missing information using the information of its counterpart at another level and/or another species.

Files