Releases · jermp/sshash

Enable memory mapping an index from disk to RAM. Subtools bench, check, and query have now an option --mmap to do this.
Minor bug fixing.
Consistent whitespacing.

This release contributes a major speed improvement for both random and streaming Lookup queries.
The optimizations are described here: https://www.biorxiv.org/content/10.64898/2026.01.21.700884v1.

@jermp

What's Changed

General renaming of functions/methods and project structure (e.g., now main tools are in a folder named tools).
Added CI workflows.
Refactored CMakeLists.txt: added option to support larger kmer sizes, decide nucleotide encoding map, and gate -march=native for better portability.
New benchmarks folder with performance tests and scripts to download datasets. New benchmarking scripts in folder script.
Updated external/pthash: now using partitioned MPHFs with 128-bit hashes by default (and with avg. partition size = 3M; see files include/hash_utils.hpp and include/constants.hpp), so that keys collide with negligible probability (i.e., never in practice).
Added logos.
Removed class bit_vector_iterator.hpp and ef_sequence.hpp (succinct bits utilities are delegated to the library bits).
New classes bitpack.hpp and kmer.hpp to handle arbitrary kmer lengths and data representations.
Minimizer tuples are sorted twice: first on minimizer value, then (after the MPHF has been built) on minimizer hashes so that the order in which their offsets are written in the data structure is known and the task can be parallelized. Also, this simplifies greatly the building of the sparse index (we do not need to build bucket pairs and merge them).
Added a parallel sort algorithm (in parallel_sort.hpp) to sort chunks of minimizer tuples in parallel.
Re-written the file parsing step. Now, we first bitpack the input (using a SIMD-based approach, see the function pack2bits_shift1 in util.hpp), then loop through the encoded bitstream and compute minimizer tuples using multiple threads (minimizer tuples themselves are computed in streaming fashion using the folklore re-scan method; see the files include/kmer_iterator.hpp and include/minimizer_iterator.hpp).
Added support for Cuttlefish output files (.cf_seg) as used in the Piscem index.
Warn the user is a the used value for m (minimizer length) is below ceil(log_s(N))+1, where s is the alphabet size (typically 4) and N is the total number of bases in the input SPSS.
For each distinct minimizer in the input we store the list of its occurrences, rather than the starting position of its corresponding super-kmer. This has some advantages (as explained next) and simplifies the description of the indexes. Advantages includes: a slightly improved space efficiency for canonical indexes (option --canonical) because we have generally less minimizer occurrences than super-kmers; simplified lookup logic (in include/buckets.hpp, we do not use loops anymore but the lookup algorithm directly compares one kmer at a given offset); generally faster random lookups and streaming queries; much faster streaming queries for k=63.
Added semantic versioning number (now, 4.0.0) that is checked upon loading an index from disk.
Removed dictionary::dump() method (used to write super-kmers to a file).
Added the ability to start iteration at a given contig boundary.
We use a simple approach to hash m-mers when computing a minimizer (just a multiplication and a xor) as coded in the mixer_64 class (see include/hash_utils.hpp).
Avoid aborting when an "empty" partition is detected during the building of the skew index.
Refactored streaming query logic: now faster and simpler, with much less code as well.
Added a function to approximate the space of the indexes (under somer approximations) in src/info.cpp.
Added a program test/check.cpp that ensures the correctness of the indexes by testing the indexes' lookup queries against a std::unordered_map.
Added a test/test_alphabet.cpp to guarantee correctness of nuc encodings and kmer methods.
Added support for deployment on Bioconda.

PRs

Dev by @jermp in #14
Merge dev into master by @jermp in #17
Merge dev into master by @jermp in #18
fixed warning on linux by @jermp in #19
bring dev up to date with master branch by @jermp in #23
detect Apple in CMakeLists.txt by @jermp in #24
Dev by @jermp in #26
provide access to underlying strings by @rob-p in #30
Submodule integration for piscem by @jermp in #38
Include cstdint in gz by @adamant-pwn in #40
Allow for ignoring reverse complements when checking neighbours by @hmusta in #43
Partitioned phf by @jermp in #44
Use universal visitors for const-ref saving by @adamant-pwn in #46
Support larger alphabets and k via generic kmer_t by @adamant-pwn in #39
make static function inline in header by @rob-p in #51
Workflows by @jermp in #53
status badge by @jermp in #54
Merge dev into master branch by @jermp in #55
Cf seg by @jermp in #56
Semver by @jermp in #58
Stream minimizers by @jermp in #59
Stream minimizers by @jermp in #60
Dev by @jermp in #61
minor cleanup by @jermp in #62
Single streaming query by @jermp in #63
Dev by @jermp in #64
make query_report ostream operator inline by @rob-p in #65
Dev by @jermp in #66
removed useless computation from streaming_query by @jermp in #67
Dev by @jermp in #69
be more conservative with memory; simpler and less code by @jermp in #70
benchmarks updated: faster streaming queries for k=63 by @jermp in #71
Dev by @jermp in #72
Dev by @jermp in #73
Dev by @jermp in #74
Dev by @jermp in #75
Dev by @jermp in #76
Dev by @jermp in #78
Dev by @jermp in #79

New Contributors

@rob-p made their first contribution in #30
@adamant-pwn made their first contribution in #40
@hmusta made their first contribution in #43

Full Changelog: v3.0.0...v4.0.0

This release of the library features a restructured public API for the dictionary and its supported queries.

"Advanced" lookup queries now include, besides the usual absolute kmer_id: contig information (contig_id and contig_size of the contig where the k-mer lies in), the relative (within the contig) identifier of the k-mer (named kmer_id_in_contig), and the orientation of the k-mer in the contig. For any positive query, 0 <= kmer_id_in_contig < contig_size holds true.
Streaming queries are now general, not just streaming membership queries as they were in the previous releases, and return advanced lookup information by default.
Support for Navigational queries has been added. Given a k-mer g[1..k], a navigational query determines if g[2..k]+x is present (forward neighbourhood) and if x+g[1..k-1] is present (backward neighbourhood) in the dictionary, for x = A, C, G, T ('+' here means string concatenation).
If a contig identifier is specified for a navigational query (rather than a k-mer), then the backward neighbourhood of the first k-mer and the forward neighbourhood of the last k-mer in the contig are returned.

With this release the dictionary construction uses external memory to save RAM usage.

No major changes compared to previous version (rather than renaming of variables for consistency with papers), but we removed a (useless) serialised 4-byte integer from skew_index and so previous index binary files are not compatible with this library release.

This release adds a new tool called permute that re-orders (and possibly reverse-complement) the strings in an input (weighted) collection to minimize the number of runs in the abundances and, hence, optimize the encoding of the abundances.

The abundances are encoded in O(r) space on top of the space for a SSHash dictionary, where r is the number of runs (i.e., maximal substrings formed by a single abundance value) in the abundances.
The i-th abundance in the sequence, corresponding to the k-mer of identifier i, is retrieved in O(log r) time.

This release adds a new feature: compressed abundances.
The SSHash dictionary now can also store the abundances in highly compressed space.

First release.

Releases: jermp/sshash

Version 5.1.0

Uh oh!

Version 5.0.0

Uh oh!

Version 4.0.0

What's Changed

PRs

New Contributors

Contributors

Uh oh!

Version 3.0.0

Uh oh!

Version 2.1.0

Uh oh!

Version 2.0.0

Uh oh!

Version 1.2.0

Uh oh!

Version 1.1.0

Uh oh!

Version 1.0.0

Uh oh!