Skip to content

Dev#69

Merged
jermp merged 21 commits intomasterfrom
dev
Aug 22, 2025
Merged

Dev#69
jermp merged 21 commits intomasterfrom
dev

Conversation

@jermp
Copy link
Copy Markdown
Owner

@jermp jermp commented Aug 21, 2025

Indexes now store positions of minimizers rather than of super-kmers. For canonical indexes, this results in slightly smaller indexes. Random lookup time is consistently ~100ns faster.

Parallel computation of minimizer tuples using a producer-consumer model.

@jermp jermp requested a review from Copilot August 21, 2025 18:27

This comment was marked as outdated.

@jermp jermp requested a review from Copilot August 22, 2025 12:37
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the sshash library to store positions of minimizers rather than super-kmers in indexes. The change improves canonical index efficiency by reducing memory usage and provides ~100ns faster random lookup times. Additionally, the PR introduces parallel computation of minimizer tuples using a producer-consumer model.

  • Refactored index storage to use minimizer positions instead of super-kmer positions
  • Improved performance with ~100ns faster lookups and smaller canonical indexes
  • Added parallel minimizer tuple computation with producer-consumer threading model

Reviewed Changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tools/sshash.cpp Removed dump tool functionality and associated code
src/statistics.cpp Updated statistics computation to work with new minimizer position-based storage
src/info.cpp Updated space breakdown reporting to reflect new data structure naming
src/dump.cpp Complete removal of dump functionality
src/dictionary.cpp Refactored lookup methods to use minimizer_info instead of raw minimizer values
src/build.cpp Updated build process to work with new minimizer position storage and parallel processing
include/util.hpp Added minimizer_info struct and updated compute_minimizer to return position information
include/streaming_query.hpp Updated streaming query to use new minimizer_iterator instead of minimizer_enumerator
include/skew_index.hpp Renamed variables for consistency with new bucket size terminology
include/minimizer_iterator.hpp New iterator implementation replacing minimizer_enumerator
include/minimizer_enumerator.hpp Removed old enumerator implementation
include/dictionary.hpp Updated method signatures to use minimizer_info
include/builder/util.hpp Major refactoring of minimizer tuple handling and added thread-safe queue for parallel processing
include/builder/parse_file.hpp Complete rewrite to use parallel producer-consumer model for minimizer computation
include/builder/build_sparse_index.hpp Updated to work with new minimizer position-based data structures
include/builder/build_skew_index.hpp Updated variable naming and logic for new bucket size handling
include/buckets_statistics.hpp Updated statistics to use bucket sizes instead of super-kmer counts
include/buckets.hpp Major refactoring of lookup methods to use minimizer positions and minimizer_info
benchmarks/README.md Fixed typo in script filename
README.md Updated documentation to reflect removal of dump tool
CMakeLists.txt Removed dump.cpp from build sources

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@jermp jermp merged commit ee50f75 into master Aug 22, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants