Conversation
…l kmer as an alternative to taking the min between the minimizer of the kmer and its rc: this schemes has a much higher density than the other but it simplifies the code
There was a problem hiding this comment.
Pull Request Overview
This PR refactors the sshash library to store positions of minimizers rather than super-kmers in indexes. The change improves canonical index efficiency by reducing memory usage and provides ~100ns faster random lookup times. Additionally, the PR introduces parallel computation of minimizer tuples using a producer-consumer model.
- Refactored index storage to use minimizer positions instead of super-kmer positions
- Improved performance with ~100ns faster lookups and smaller canonical indexes
- Added parallel minimizer tuple computation with producer-consumer threading model
Reviewed Changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/sshash.cpp | Removed dump tool functionality and associated code |
| src/statistics.cpp | Updated statistics computation to work with new minimizer position-based storage |
| src/info.cpp | Updated space breakdown reporting to reflect new data structure naming |
| src/dump.cpp | Complete removal of dump functionality |
| src/dictionary.cpp | Refactored lookup methods to use minimizer_info instead of raw minimizer values |
| src/build.cpp | Updated build process to work with new minimizer position storage and parallel processing |
| include/util.hpp | Added minimizer_info struct and updated compute_minimizer to return position information |
| include/streaming_query.hpp | Updated streaming query to use new minimizer_iterator instead of minimizer_enumerator |
| include/skew_index.hpp | Renamed variables for consistency with new bucket size terminology |
| include/minimizer_iterator.hpp | New iterator implementation replacing minimizer_enumerator |
| include/minimizer_enumerator.hpp | Removed old enumerator implementation |
| include/dictionary.hpp | Updated method signatures to use minimizer_info |
| include/builder/util.hpp | Major refactoring of minimizer tuple handling and added thread-safe queue for parallel processing |
| include/builder/parse_file.hpp | Complete rewrite to use parallel producer-consumer model for minimizer computation |
| include/builder/build_sparse_index.hpp | Updated to work with new minimizer position-based data structures |
| include/builder/build_skew_index.hpp | Updated variable naming and logic for new bucket size handling |
| include/buckets_statistics.hpp | Updated statistics to use bucket sizes instead of super-kmer counts |
| include/buckets.hpp | Major refactoring of lookup methods to use minimizer positions and minimizer_info |
| benchmarks/README.md | Fixed typo in script filename |
| README.md | Updated documentation to reflect removal of dump tool |
| CMakeLists.txt | Removed dump.cpp from build sources |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Indexes now store positions of minimizers rather than of super-kmers. For canonical indexes, this results in slightly smaller indexes. Random lookup time is consistently ~100ns faster.
Parallel computation of minimizer tuples using a producer-consumer model.