Minimum Flow Decomposition for Haplotype-Aware Genome Assembly
R.R. Huizenga (TU Delft - Electrical Engineering, Mathematics and Computer Science)
J. van Bemmelen – Mentor (TU Delft - Pattern Recognition and Bioinformatics)
Jasmijn A. Baaijens – Mentor (TU Delft - Pattern Recognition and Bioinformatics)
Thomas Abeel – Graduation committee member (TU Delft - Pattern Recognition and Bioinformatics)
Sebastijan Dumančić – Graduation committee member (TU Delft - Algorithmics)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Accurately reconstructing viral haplotypes from mixed sequencing samples is crucial for tracking viral evolution, detecting new clinically relevant variants, and guiding effective treatments. Existing de novo haplotype-aware genome assembly methods typically rely on heuristic path extraction strategies, which limit exploration of the full solution space and may fail to recover low-abundance haplotypes. In this work, we investigate the use of an Integer Linear Programming (ILP) formulation for the Minimum Flow Decomposition (MFD) based problem to reconstruct haplotypes and estimate their abundances through a contig variation graph. Our approach integrates three main components: (1) a new pipeline for constructing contig variation graphs from reads and contigs build from these reads, (2) a Minimum Path Cover (MPC) step to estimate the number of haplotypes, and (3) an MFD-based ILP to infer haplotypes and their abundances, with additional strategies to restrict the set of allowed path weights for improved tractability. We evaluate the method on simulated hepatitis C virus (HCV) and HIV datasets, comparing its assembly quality, abundance estimation accuracy, and tractability against Virus-VG and VG-flow. Results show that for low-haplotype, well-structured graphs, the MFD approach matches or exceeds the performance of existing haplotype-aware methods, with notable advantages on HIV data. However, the runtime grows exponentially with the number of haplotypes, which limits its applicability to low haplotype-count samples. Weight-restriction strategies and improving graph construction can mitigate this runtime challenge. Our findings demonstrate both the potential and current scalability limits of MFD-based assembly.