Multisample motif discovery and visualization for tandem repeats
Yaran Zhang (Vrije Universiteit Amsterdam)
Marc Hulsman (TU Delft - Pattern Recognition and Bioinformatics, Vrije Universiteit Amsterdam)
Alex Salazar (Vrije Universiteit Amsterdam)
Niccolò Tesi (TU Delft - Pattern Recognition and Bioinformatics, Vrije Universiteit Amsterdam)
Lydian Knoop (Vrije Universiteit Amsterdam)
Sven van der Lee (Amsterdam UMC, Vrije Universiteit Amsterdam, TU Delft - Pattern Recognition and Bioinformatics)
Sanduni Wijesekera (Vrije Universiteit Amsterdam)
Jana Krizova (Vrije Universiteit Amsterdam)
Erik Jan Kamsteeg (Radboud University Medical Center)
Henne Holstege (Amsterdam UMC, Vrije Universiteit Amsterdam, TU Delft - Intelligent Systems)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Tandem repeats (TRs) occupy a significant portion of the human genome and are a source of polymorphisms due to variations in sizes and motif compositions. Some of these variations have been associated with various neuropathological disorders, highlighting the clinical importance of assessing the motif structure of TRs. Moreover, assessing the TR motif variation can offer valuable insights into evolutionary dynamics and population structure. Previously, characterizations of TRs were limited by short-read sequencing technology, which lacks the ability to accurately capture the full TR sequences. As long-read sequencing becomes more accessible and can capture the full complexity of TRs, there is now also a need for tools to characterize and analyze TRs using long-read data across multiple samples. In this study, we present MotifScope, a novel algorithm for the characterization and visualization of TRs based on a de novo k-mer approach for motif discovery. Comparative analysis against established tools reveals that MotifScope can identify a greater number of motifs and more accurately represent the underlying repeat sequences. Moreover, MotifScope has been specifically designed to enable motif composition comparisons across assemblies of different individuals, as well as across long-read sequencing reads within an individual, through combined motif discovery and sequence alignment. We showcase potential applications of MotifScope in diverse fields, including population genetics, clinical settings, and forensic analyses.