Releases: jermp/sshash
Version 5.1.0
Version 5.0.0
This release contributes a major speed improvement for both random and streaming Lookup queries.
The optimizations are described here: https://www.biorxiv.org/content/10.64898/2026.01.21.700884v1.
Version 4.0.0
What's Changed
- General renaming of functions/methods and project structure (e.g., now main tools are in a folder named
tools). - Added CI workflows.
- Refactored CMakeLists.txt: added option to support larger kmer sizes, decide nucleotide encoding map, and gate
-march=nativefor better portability. - New
benchmarksfolder with performance tests and scripts to download datasets. New benchmarking scripts in folderscript. - Updated
external/pthash: now using partitioned MPHFs with 128-bit hashes by default (and with avg. partition size = 3M; see filesinclude/hash_utils.hppandinclude/constants.hpp), so that keys collide with negligible probability (i.e., never in practice). - Added logos.
- Removed class
bit_vector_iterator.hppandef_sequence.hpp(succinct bits utilities are delegated to the librarybits). - New classes
bitpack.hppandkmer.hppto handle arbitrary kmer lengths and data representations. - Minimizer tuples are sorted twice: first on minimizer value, then (after the MPHF has been built) on minimizer hashes so that the order in which their offsets are written in the data structure is known and the task can be parallelized. Also, this simplifies greatly the building of the sparse index (we do not need to build bucket pairs and merge them).
- Added a parallel sort algorithm (in
parallel_sort.hpp) to sort chunks of minimizer tuples in parallel. - Re-written the file parsing step. Now, we first bitpack the input (using a SIMD-based approach, see the function
pack2bits_shift1inutil.hpp), then loop through the encoded bitstream and compute minimizer tuples using multiple threads (minimizer tuples themselves are computed in streaming fashion using the folklore re-scan method; see the filesinclude/kmer_iterator.hppandinclude/minimizer_iterator.hpp). - Added support for Cuttlefish output files (
.cf_seg) as used in the Piscem index. - Warn the user is a the used value for
m(minimizer length) is belowceil(log_s(N))+1, wheresis the alphabet size (typically 4) andNis the total number of bases in the input SPSS. - For each distinct minimizer in the input we store the list of its occurrences, rather than the starting position of its corresponding super-kmer. This has some advantages (as explained next) and simplifies the description of the indexes. Advantages includes: a slightly improved space efficiency for canonical indexes (option
--canonical) because we have generally less minimizer occurrences than super-kmers; simplified lookup logic (ininclude/buckets.hpp, we do not use loops anymore but the lookup algorithm directly compares one kmer at a given offset); generally faster random lookups and streaming queries; much faster streaming queries for k=63. - Added semantic versioning number (now, 4.0.0) that is checked upon loading an index from disk.
- Removed
dictionary::dump()method (used to write super-kmers to a file). - Added the ability to start iteration at a given contig boundary.
- We use a simple approach to hash m-mers when computing a minimizer (just a multiplication and a xor) as coded in the
mixer_64class (seeinclude/hash_utils.hpp). - Avoid aborting when an "empty" partition is detected during the building of the skew index.
- Refactored streaming query logic: now faster and simpler, with much less code as well.
- Added a function to approximate the space of the indexes (under somer approximations) in
src/info.cpp. - Added a program
test/check.cppthat ensures the correctness of the indexes by testing the indexes' lookup queries against astd::unordered_map. - Added a
test/test_alphabet.cppto guarantee correctness of nuc encodings and kmer methods. - Added support for deployment on Bioconda.
PRs
- Dev by @jermp in #14
- Merge dev into master by @jermp in #17
- Merge dev into master by @jermp in #18
- fixed warning on linux by @jermp in #19
- bring dev up to date with master branch by @jermp in #23
- detect Apple in CMakeLists.txt by @jermp in #24
- Dev by @jermp in #26
- provide access to underlying strings by @rob-p in #30
- Submodule integration for piscem by @jermp in #38
- Include cstdint in gz by @adamant-pwn in #40
- Allow for ignoring reverse complements when checking neighbours by @hmusta in #43
- Partitioned phf by @jermp in #44
- Use universal visitors for const-ref saving by @adamant-pwn in #46
- Support larger alphabets and k via generic kmer_t by @adamant-pwn in #39
- make static function inline in header by @rob-p in #51
- Workflows by @jermp in #53
- status badge by @jermp in #54
- Merge
devintomasterbranch by @jermp in #55 - Cf seg by @jermp in #56
- Semver by @jermp in #58
- Stream minimizers by @jermp in #59
- Stream minimizers by @jermp in #60
- Dev by @jermp in #61
- minor cleanup by @jermp in #62
- Single streaming query by @jermp in #63
- Dev by @jermp in #64
- make query_report ostream operator inline by @rob-p in #65
- Dev by @jermp in #66
- removed useless computation from streaming_query by @jermp in #67
- Dev by @jermp in #69
- be more conservative with memory; simpler and less code by @jermp in #70
- benchmarks updated: faster streaming queries for k=63 by @jermp in #71
- Dev by @jermp in #72
- Dev by @jermp in #73
- Dev by @jermp in #74
- Dev by @jermp in #75
- Dev by @jermp in #76
- Dev by @jermp in #78
- Dev by @jermp in #79
New Contributors
- @rob-p made their first contribution in #30
- @adamant-pwn made their first contribution in #40
- @hmusta made their first contribution in #43
Full Changelog: v3.0.0...v4.0.0
Version 3.0.0
This release of the library features a restructured public API for the dictionary and its supported queries.
- "Advanced" lookup queries now include, besides the usual absolute
kmer_id: contig information (contig_idandcontig_sizeof the contig where the k-mer lies in), the relative (within the contig) identifier of the k-mer (namedkmer_id_in_contig), and the orientation of the k-mer in the contig. For any positive query,0 <= kmer_id_in_contig < contig_sizeholds true. - Streaming queries are now general, not just streaming membership queries as they were in the previous releases, and return advanced lookup information by default.
- Support for Navigational queries has been added. Given a k-mer g[1..k], a navigational query determines if g[2..k]+x is present (forward neighbourhood) and if x+g[1..k-1] is present (backward neighbourhood) in the dictionary, for x = A, C, G, T ('+' here means string concatenation).
If a contig identifier is specified for a navigational query (rather than a k-mer), then the backward neighbourhood of the first k-mer and the forward neighbourhood of the last k-mer in the contig are returned.
Version 2.1.0
With this release the dictionary construction uses external memory to save RAM usage.
Version 2.0.0
No major changes compared to previous version (rather than renaming of variables for consistency with papers), but we removed a (useless) serialised 4-byte integer from skew_index and so previous index binary files are not compatible with this library release.
Version 1.2.0
This release adds a new tool called permute that re-orders (and possibly reverse-complement) the strings in an input (weighted) collection to minimize the number of runs in the abundances and, hence, optimize the encoding of the abundances.
The abundances are encoded in O(r) space on top of the space for a SSHash dictionary, where r is the number of runs (i.e., maximal substrings formed by a single abundance value) in the abundances.
The i-th abundance in the sequence, corresponding to the k-mer of identifier i, is retrieved in O(log r) time.
Version 1.1.0
This release adds a new feature: compressed abundances.
The SSHash dictionary now can also store the abundances in highly compressed space.
Version 1.0.0
First release.