Skip to content

Releases: jermp/sshash

Version 5.1.0

11 Mar 17:50

Choose a tag to compare

  • Enable memory mapping an index from disk to RAM. Subtools bench, check, and query have now an option --mmap to do this.
  • Minor bug fixing.
  • Consistent whitespacing.

Version 5.0.0

23 Jan 15:00

Choose a tag to compare

This release contributes a major speed improvement for both random and streaming Lookup queries.
The optimizations are described here: https://www.biorxiv.org/content/10.64898/2026.01.21.700884v1.

Version 4.0.0

02 Sep 08:48
54edfa0

Choose a tag to compare

What's Changed

  • General renaming of functions/methods and project structure (e.g., now main tools are in a folder named tools).
  • Added CI workflows.
  • Refactored CMakeLists.txt: added option to support larger kmer sizes, decide nucleotide encoding map, and gate -march=native for better portability.
  • New benchmarks folder with performance tests and scripts to download datasets. New benchmarking scripts in folder script.
  • Updated external/pthash: now using partitioned MPHFs with 128-bit hashes by default (and with avg. partition size = 3M; see files include/hash_utils.hpp and include/constants.hpp), so that keys collide with negligible probability (i.e., never in practice).
  • Added logos.
  • Removed class bit_vector_iterator.hpp and ef_sequence.hpp (succinct bits utilities are delegated to the library bits).
  • New classes bitpack.hpp and kmer.hpp to handle arbitrary kmer lengths and data representations.
  • Minimizer tuples are sorted twice: first on minimizer value, then (after the MPHF has been built) on minimizer hashes so that the order in which their offsets are written in the data structure is known and the task can be parallelized. Also, this simplifies greatly the building of the sparse index (we do not need to build bucket pairs and merge them).
  • Added a parallel sort algorithm (in parallel_sort.hpp) to sort chunks of minimizer tuples in parallel.
  • Re-written the file parsing step. Now, we first bitpack the input (using a SIMD-based approach, see the function pack2bits_shift1 in util.hpp), then loop through the encoded bitstream and compute minimizer tuples using multiple threads (minimizer tuples themselves are computed in streaming fashion using the folklore re-scan method; see the files include/kmer_iterator.hpp and include/minimizer_iterator.hpp).
  • Added support for Cuttlefish output files (.cf_seg) as used in the Piscem index.
  • Warn the user is a the used value for m (minimizer length) is below ceil(log_s(N))+1, where s is the alphabet size (typically 4) and N is the total number of bases in the input SPSS.
  • For each distinct minimizer in the input we store the list of its occurrences, rather than the starting position of its corresponding super-kmer. This has some advantages (as explained next) and simplifies the description of the indexes. Advantages includes: a slightly improved space efficiency for canonical indexes (option --canonical) because we have generally less minimizer occurrences than super-kmers; simplified lookup logic (in include/buckets.hpp, we do not use loops anymore but the lookup algorithm directly compares one kmer at a given offset); generally faster random lookups and streaming queries; much faster streaming queries for k=63.
  • Added semantic versioning number (now, 4.0.0) that is checked upon loading an index from disk.
  • Removed dictionary::dump() method (used to write super-kmers to a file).
  • Added the ability to start iteration at a given contig boundary.
  • We use a simple approach to hash m-mers when computing a minimizer (just a multiplication and a xor) as coded in the mixer_64 class (see include/hash_utils.hpp).
  • Avoid aborting when an "empty" partition is detected during the building of the skew index.
  • Refactored streaming query logic: now faster and simpler, with much less code as well.
  • Added a function to approximate the space of the indexes (under somer approximations) in src/info.cpp.
  • Added a program test/check.cpp that ensures the correctness of the indexes by testing the indexes' lookup queries against a std::unordered_map.
  • Added a test/test_alphabet.cpp to guarantee correctness of nuc encodings and kmer methods.
  • Added support for deployment on Bioconda.

PRs

New Contributors

Full Changelog: v3.0.0...v4.0.0

Version 3.0.0

01 Jul 14:52
61061b8

Choose a tag to compare

This release of the library features a restructured public API for the dictionary and its supported queries.

  • "Advanced" lookup queries now include, besides the usual absolute kmer_id: contig information (contig_id and contig_size of the contig where the k-mer lies in), the relative (within the contig) identifier of the k-mer (named kmer_id_in_contig), and the orientation of the k-mer in the contig. For any positive query, 0 <= kmer_id_in_contig < contig_size holds true.
  • Streaming queries are now general, not just streaming membership queries as they were in the previous releases, and return advanced lookup information by default.
  • Support for Navigational queries has been added. Given a k-mer g[1..k], a navigational query determines if g[2..k]+x is present (forward neighbourhood) and if x+g[1..k-1] is present (backward neighbourhood) in the dictionary, for x = A, C, G, T ('+' here means string concatenation).
    If a contig identifier is specified for a navigational query (rather than a k-mer), then the backward neighbourhood of the first k-mer and the forward neighbourhood of the last k-mer in the contig are returned.

Version 2.1.0

25 May 20:56
aeb2411

Choose a tag to compare

With this release the dictionary construction uses external memory to save RAM usage.

Version 2.0.0

24 May 18:46

Choose a tag to compare

No major changes compared to previous version (rather than renaming of variables for consistency with papers), but we removed a (useless) serialised 4-byte integer from skew_index and so previous index binary files are not compatible with this library release.

Version 1.2.0

02 Apr 14:03
562ff63

Choose a tag to compare

This release adds a new tool called permute that re-orders (and possibly reverse-complement) the strings in an input (weighted) collection to minimize the number of runs in the abundances and, hence, optimize the encoding of the abundances.

The abundances are encoded in O(r) space on top of the space for a SSHash dictionary, where r is the number of runs (i.e., maximal substrings formed by a single abundance value) in the abundances.
The i-th abundance in the sequence, corresponding to the k-mer of identifier i, is retrieved in O(log r) time.

Version 1.1.0

01 Mar 17:00

Choose a tag to compare

This release adds a new feature: compressed abundances.
The SSHash dictionary now can also store the abundances in highly compressed space.

Version 1.0.0

01 Mar 16:42

Choose a tag to compare

First release.