Skip to content

jermp/kmer_sets_benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DOI

Kmer sets benchmark

This is a benchmark of data structures for kmer dictionaries, i.e., data structures that represent a set of kmers and support, at least, exact membership queries. The goal of the benchmark is to test the data structures using the same benchmarking code and methodology (as well as, obviously, the same datasets and queries).

The scripts to run the benchmark are in the script folder.

Clone and compile as follows.

git clone --recursive https://github.com/jermp/kmer_sets_benchmark.git
cd kmer_sets_benchmark/
mkdir build
cd build/
cmake .. -DUSE_MAX_KMER_LENGTH_63=Off
make -j

Optionally, compile with cmake .. -DUSE_MAX_KMER_LENGTH_63=On to use k=63 in the benchmark.

Tested dictionaries

The dictionaries benchmarked here are:

All C++ implementations are by the respective authors.

Datasets

For these benchmarks we used the datasets available here https://zenodo.org/records/17582116: the files *.eulertigs.fa.gz were used as input, the *.fastq.gz files were used as queries instead.

Collection Num. distinct 31-mers Num. distinct 63-mers
Cod 502,465,200 556,585,658
Kestrel 1,150,399,205 1,155,250,667
Human 2,505,678,680 2,771,316,093
NCBI-virus 376,205,185 412,515,880
SE 894,310,084 1,524,904,156
HPRC 3,718,120,949 5,926,785,469

Methodology

The dictionaries were built with a max RAM usage of 16 GB and 64 threads. All queries were run using 1 thread, instead.

For SSHash and SBWT, the building time reported in the tables refers to the time it takes to index the eulertigs.fa.gz files. FMSI first requires the computation of masked super strings from the eulertigs.fa.gz files. We excluded this time from the building time and report only the time FMSI takes to index its computed super strings.

The space reported is the space taken by the dictionaries on disk.

Positive random lookup time was measured by querying 1 million kmers that appear in the dictionaries, half of which were reverse complemented to test the dictionaries in the most general case.

For negative random lookups, random kmers were generated (i.e., each nucleotide was uniformly sampled from {A,C,G,T}) and used as queries instead.

For random access, we uniformly generated 1 million ranks and retrieved the corresponding kmers.

Lastly, for streaming queries, we queried the dictionaries using FASTQ reads. For each dictionary, a FASTQ readset was chosen to have a high-hit workload (i.e., most kmers are found in the dictionaries). See the folder script for details.

Results

These are the results obtained on Nov 2025 (see logs here) on a machine equipped with an AMD Ryzen Threadripper PRO 7985WX processor clocked at 5.40GHz. The code was compiled with gcc 13.3.0.

SSHash indexes reported here were built with option --canonical, using the indicated value for the m parameter (minimizer length). All results are available here https://github.com/jermp/sshash/tree/bench/benchmarks.

k Collection m Space (bits/kmer) Space (total GB) Building time (m:ss) Positive random lookup (µs/kmer) Negative random lookup (µs/kmer) Random Access (µs/kmer) Streaming Lookup high-hit (ns/kmer)
31 Cod 20 9.01 0.57 0:26 0.44 0.37 0.28 26
Kestrel 20 8.67 1.25 1:06 0.44 0.40 0.28 46
Human 21 10.01 3.14 3:10 0.61 0.42 0.35 74
NCBI-virus 19 8.48 0.40 0:16 0.41 0.36 0.26 29
SE 21 11.51 1.29 1:06 0.63 0.40 0.36 186
HPRC 21 11.93 5.54 4:45 0.71 0.46 0.54 93
63 Cod 24 4.9 0.35 0:15 0.56 0.45 0.29 60
Kestrel 24 4.22 0.61 0:19 0.54 0.48 0.33 66
Human 25 5.31 1.84 1:09 0.69 0.52 0.36 146
NCBI-virus 23 4.46 0.23 0:07 0.52 0.44 0.28 72
SE 31 7.77 1.48 0:58 1.00 0.51 0.41 400
HPRC 31 8.14 6.03 4:13 1.00 0.58 0.64 181

Tab. 1 SSHash results

SBWT indexes were all built using the "plain-matrix" variant, with option --add-reverse-complements so that queries return the same results as for the other indexes. The indexes make use of the LCP array to speed up streaming queries.

k Collection Space (bits/kmer) Space (total GB) Building time (m:ss) Positive random Lookup (µs/kmer) Negative random Lookup (µs/kmer) Random Access (µs/kmer) Streaming Lookup high-hit (ns/kmer)
31 Cod 10.52 0.66 03:34 2.72 0.91 7.71 62
Kestrel 10.52 1.51 07:57 2.87 0.96 9.48 287
Human 10.50 3.29 17:56 2.97 1.07 10.81 266
NCBI-virus 10.53 0.50 02:44 2.71 0.89 6.96 139
SE 10.72 1.20 06:54 2.83 0.97 8.97 189
HPRC 10.50 4.88 29:24 3.12 1.16 11.45 263
63 Cod 10.52 0.73 06:13 6.59 0.92 16.05 118
Kestrel 10.55 1.52 14:38 6.91 0.97 19.87 435
Human 10.50 3.64 55:38 7.23 1.09 22.73 768
NCBI-virus 10.55 0.54 04:42 6.63 0.90 14.93 187
SE 10.93 2.08 21:37 6.95 1.00 20.87 290
HPRC 10.50 7.78 180:02 8.07 1.24 25.20 835

Tab. 2 SBWT results

FMSI indexes make use of the LCP array to speed up streaming queries.

k Collection Space (bits/kmer) Space (total GB) Building time (m:ss) Positive random Lookup (µs/kmer) Negative random Lookup (µs/kmer) Random Access (µs/kmer) Streaming Lookup high-hit (ns/kmer)
31 Cod 3.37 0.21 02:12 5.70 1.70 14.84 275
Kestrel 3.16 0.45 05:18 6.20 1.90 17.83 983
Human 3.31 1.04 14:33 6.60 2.16 18.62 1176
NCBI-virus 3.50 0.16 01:39 5.51 1.69 13.61 736
SE 4.39 0.49 05:08 6.50 2.04 17.80 1018
HPRC 4.26 1.98 27:47 6.99 2.37 18.04 1370
63 Cod 3.33 0.23 02:38 12.66 1.77 31.18 375
Kestrel 3.17 0.46 05:41 13.74 1.91 37.45 1143
Human 3.22 1.11 17:35 14.64 2.16 40.06 1642
NCBI-virus 3.52 0.18 02:09 12.34 1.74 29.98 870
SE 4.82 0.92 12:29 15.09 2.20 39.65 1419
HPRC 4.95 3.67 61:41 15.99 2.48 37.09 2078

Tab. 3 FMSI results

About

Benchmark data structures for kmer sets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors