Accurately reconstructing viral haplotypes from mixed sequencing samples is crucial for tracking viral evolution, detecting new clinically relevant variants, and guiding effective treatments. Existing de novo haplotype-aware genome assembly methods typically rely on heuristic pat
...
Accurately reconstructing viral haplotypes from mixed sequencing samples is crucial for tracking viral evolution, detecting new clinically relevant variants, and guiding effective treatments. Existing de novo haplotype-aware genome assembly methods typically rely on heuristic path extraction strategies, which limit exploration of the full solution space and may fail to recover low-abundance haplotypes. In this work, we investigate the use of an Integer Linear Programming (ILP) formulation for the Minimum Flow Decomposition (MFD) problem to reconstruct haplotypes and estimate their abundances from a contig variation graph. Our approach integrates three main components: (1) a new pipeline for constructing contig variation graphs from pre-assembled contigs and reads, (2) a Minimum Path Cover (MPC) step to estimate the number of haplotypes, and (3) an MFD-based ILP to infer haplotypes and their abundances, with additional strategies to restrict the set of allowed path weights for improved tractability. We evaluate the method on simulated hepatitis C virus (HCV) and HIV datasets, comparing its assembly quality, abundance estimation accuracy, and tractability against Virus-VG and VG-flow. Results show that for low-haplotype, well-structured graphs, the MFD approach matches or exceeds the performance of existing haplotype-aware methods, with notable advantages on HIV data. However, the runtime grows exponentially with the number of haplotypes, limiting applicability to high-complexity samples. Weight-restriction strategies and improving graph construction can mitigate this runtime challenge. Our findings demonstrate both the potential and current scalability limits of MFD-based assembly.