Set-Min sketch: a probabilistic map for power-law distributions with application to k-mer annotation - Models and Algorithms
Article Dans Une Revue Journal of Computational Biology Année : 2022

Set-Min sketch: a probabilistic map for power-law distributions with application to k-mer annotation

Résumé

k-mer counts are important features used by many bioinformatics pipelines. Existing k-mer counting methods focus on optimizing either time or memory usage, producing in output very large count tables explicitly representing k-mers together with their counts. Storing k-mers is not needed if the set of k-mers is known, making it possible to only keep counters and their association to k-mers. Solutions avoiding explicit representation of k-mers include Minimal Perfect Hash Functions (MPHFs) and Count-Min sketches. We introduce Set-Min sketch – a sketching technique for representing associative maps inspired from Count-Min – and apply it to the problem of representing k-mer count tables. Set-Min is provably more accurate than both Count-Min and Max-Min – an improved variant of Count-Min for static datasets that we define here. We show that Set-Min sketch provides a very low error rate, both in terms of the probability and the size of errors, at the expense of a very moderate memory increase. On the other hand, Set-Min sketches are shown to take up to an order of magnitude less space than MPHF-based solutions, for fully assembled genomes and large k. Space-efficiency of Set-Min in this case takes advantage of the power-law distribution of k-mer counts in genomic datasets. Availability: https://github.com/yhhshb/fress
Fichier principal
Vignette du fichier
jcb_version.pdf (2 Mo) Télécharger le fichier
Origine Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-04431772 , version 1 (01-02-2024)

Identifiants

Citer

Yoshihiro Shibuya, Djamal Belazzougui, Gregory Kucherov. Set-Min sketch: a probabilistic map for power-law distributions with application to k-mer annotation. Journal of Computational Biology, 2022, 29 (2), pp.140-154. ⟨10.1089/cmb.2021.0429⟩. ⟨hal-04431772⟩
42 Consultations
35 Téléchargements

Altmetric

Partager

More