DNA comparisons in genomics

a reference-based perspective

More Info
expand_more

Abstract

Genomics is a field devoted to understanding the differences in genetics between populations, individuals, and even within individuals. By constantly comparing and contrasting data from diverse sources, genomics can refine our understanding of life and identify new ways to improve our lives. However, this often presents technical and biological challenges that require careful consideration of what is compared, in what context, and what might be present. In this thesis I contribute to resolving these challenges in three different domains:

In genomic data analysis, analysts often compare and contrast new genomic data to an established reference to reduce costs. However, this approach biases comparisons in favor of population-specific genetics since such references encode only a fraction of the genetics of a given population. To address this bias, I propose a method that accounts for population variability in a way that integrates it directly into the comparison process. This integration ensures that the contrast between sample and reference becomes smaller and closer to personalized, so they are treated the same way regardless of the underlying population. The method improves genome characterization and simplifies downstream analyses that rely on these comparisons. As a result, a more accurate portrayal of the genetics of a given population as a whole is obtained.

In non-invasive sequencing-based prenatal testing, we rely on circulating cell-free DNA from maternal plasma to detect pathogenic variants that may affect the fetus. A healthy baseline, which describes the normative state, is generally required to determine the presence of such variants. However, because this DNA is a mixture of maternal and much lower fetal proportions, it remains difficult to disentangle the two, primarily because of biological and technical biases. While this bias can partially be mitigated by changing the baseline and thus contrasting within the individual DNA mixture rather than to a divergent population of mixtures, further improvements are still needed. I present a generalized framework in which the signal-to-noise ratio can be further improved by fully exploiting the information in sequencing data, allowing for more robust predictions at even earlier stages of pregnancy.

The composition of the gut ecosystem can have short- and long-term effects on our health. It is therefore important to understand how it is formed and how a healthy balance can be maintained for as long as possible to preserve our health. To do this, ecosystems must be stratified and compared based on health indices. I show in extremely contrasting Dutch subpopulations that we can obtain valuable characteristics of divergent health states by comparing the gut ecosystems of centenarians with those of Alzheimer's patients. However, significant efforts are required to enable these comparisons due to the many organisms present and the technological limitations in measuring them, introducing bias at all levels.