Unraveling tandem repeat variation in personal genomes with long reads

Davide Bolognini (University of Florence, Italy)

03/06/2021 10:30 - 12:00

Tandem repeats are repeated sequences that occur adjacent to each other in genomes. Due to their prevalence and their association with a number of genetic diseases in humans, there is a rising interest in developing tools for tandem repeat profiling. Genome-wide discovery approaches are needed to fully understand their roles in health and disease but resolving tandem repeat variation accurately remains a very challenging task. Indeed, while traditional mapping-based and assembly-based approaches using short-read data have severe limitations in the size and type of tandem repeats they can resolve, recent third-generation sequencing technologies provide the long reads required to broaden the scope of detectable tandem repeats but exhibit substantially higher sequencing error rates that complicates repeat resolution.
To this purpose, we developed TRiCoLOR, a freely-available tool for tandem repeat profiling using error-prone long reads from third-generation sequencing technologies. The method can identify repetitive regions in long-read sequencing data de novo and resolve their motif and multiplicity in a haplotype-specific manner. The tool further includes methods to interactively visualize the identified repeats and to trace their Mendelian consistency in pedigrees. Tested on synthetic data harboring tandem repeat contractions and expansions,TRiCoLOR demonstrates excellent performances and improved precision and recall compared to alternative tools. For real human whole-genome sequencing data, TRiCoLOR achieves high validation rates, suggesting its suitability to identify tandem repeat variation in personal genomes. Compared to assembly-based approaches for structural variant detection, TRiCoLOR demonstrates capable to resolve tandem repeats in difficult to assemble regions that are prone to mis-assemblies or incorrect repeat assignments.