Set-Min sketch: a probabilistic map for power-law distributions with application to k-mer annotation
Abstract
k-mer counts are important features used by many bioinformatics pipelines. Existing k-mer counting methods focus on optimizing either time or memory usage, producing in output very large count tables explicitly representing k-mers together with their counts. Storing k-mers is not needed if the set of k-mers is known, making it possible to only keep counters and their association to k-mers. Solutions avoiding explicit representation of k-mers include Minimal Perfect Hash Functions (MPHFs) and Count-Min sketches. We introduce Set-Min sketch – a sketching technique for representing associative maps inspired from Count-Min – and apply it to the problem of representing k-mer count tables. Set-Min is provably more accurate than both Count-Min and Max-Min – an improved variant of Count-Min for static datasets that we define here. We show that Set-Min sketch provides a very low error rate, both in terms of the probability and the size of errors, at the expense of a very moderate memory increase. On the other hand, Set-Min sketches are shown to take up to an order of magnitude less space than MPHF-based solutions, for fully assembled genomes and large k. Space-efficiency of Set-Min in this case takes advantage of the power-law distribution of k-mer counts in genomic datasets.
Availability: https://github.com/yhhshb/fress
Origin | Files produced by the author(s) |
---|