Reliable estimation of intra-host viral diversity is essential for understanding viral evolution,
treatment resistance, and outbreak dynamics. However, technical artefacts introduced during
sample preparation and sequencing can distort variant frequencies and lead to inco
...
Reliable estimation of intra-host viral diversity is essential for understanding viral evolution,
treatment resistance, and outbreak dynamics. However, technical artefacts introduced during
sample preparation and sequencing can distort variant frequencies and lead to incorrect conclusions. One such group of artefacts is ligated chimeric reads, also referred to as ligation chimeras, formed when full-length DNA molecules are erroneously joined during library preparation. Ligation chimeras are currently poorly characterized and their impact on downstream analyses is largely unknown. In this thesis, we developed a modular and reproducible computational pipeline to detect, quantify, and analyze ligated chimeras in amplicon-based viral sequencing datasets. We applied this pipeline to both public and internal datasets, evaluating the prevalence and structural patterns of chimeras and their impact on viral diversity estimates. Our results show that ligated chimeras are widespread, disproportionately affect specific amplicons, and can introduce substantial allele frequency shifts and spurious variants. This means that common filtering strategies in current pipelines risk discarding true low-frequency variants or failing to remove artefactual ones. These findings highlight the importance of chimera-aware preprocessing to ensure accurate viral diversity estimation from long-read sequencing data.