Ragnar Groot Koerkamp (ETH Zürich)
26/06/2025 10:30 - 12:00
Emplacement: Aurigny Room
Given a lot of genomes, one common problem is to reconstruct their phylogeny. Unfortunately, pairwise comparisons of large genomes are slow. Instead, sketches are often used. The classic bottom sketch hashes all the k-mers in each genome, and then keeps the e.g. s=10000 hashes with the smallest values. Then, one can (roughly speaking) estimate the similarity between genomes by the fraction of shared hashes between their sketches.
In this talk I will present new algorithms for computing bottom (and other) sketches. First, this new method improves the running time from O(n + s log(s) log(n)) to O(n + s log s) in expectation, and, second, we provide a highly efficient implementation based on SIMD instructions.
A remaining bottleneck of the bottom sketch is that computing the distance between two sketches is slow. We’ll discuss some existing schemes that significantly improve the speed of genome comparisons.
Code: https://github.com/ragnargrootkoerkamp/simd-sketch
Blog post this talk is based on: https://curiouscoding.nl/posts/simd-sketch/