Piscem-cpp by NPSDC · Pull Request #47 · jermp/sshash

NPSDC · 2024-07-26T03:00:48Z

The above PR is w.r.t efforts towards resolving the GitHub issue COMBINE-lab/piscem-cpp#2 aka using SSHash as a submodule in piscem-cpp.

Piscem-cpp takes the reduced cf format from cuttlefish as input. This PR now supports taking the above format as input for building the index. In addition, checker functions have also been updated to enable parsing the cfseg file.

We would also like to make feature requests in SSHash to enable its integration as a submodule in piscem.

Allow the build command to accept a thread parameter - currently, the default behaviour (in the Master branch of SShash) is to use all available threads.
Add another flag quiet, that turns on/off logging when building the index.

jermp · 2024-07-26T08:00:44Z

Hi Noor,
fantastic! Thank you very much for this PR.
I have set up some time ago a branch called "submodule-integration-for-piscem" and I think all the piscem related stuff should go there. (Th current master branch is waiting for another PR to be merged.)

I've just brought the branch "submodule-integration-for-piscem" up to date wrt to the master branch.

Would you close this PR for the master branch and open it for the branch "submodule-integration-for-piscem"?

thank you!
-Giulio

NPSDC · 2024-07-26T13:25:46Z

Hi Giulio,
Thank you so much for the prompt response. I have modified the PR for merging into "submodule-integration-for-piscem"

jermp · 2024-07-26T14:14:22Z

Merged! Thanks Noor. All future PRs should be reflected in that branch. I think that's the best way to proceed.
-Giulio

rob-p · 2024-07-26T14:24:47Z

Perfect; thanks so much for the quick merge @jermp! As we discussed in the piscem repo and as Noor points out here, the other minimal changes I think we will need are the ability to control the construction thread count and the ability to disable logging during construction (since we decided not to pull the heavy spdlog dependency into sshash).

One other question I had is that I noticed that many of the util functions for e.g. extracting k-mers have been updated from the (now long outdated) sshash version we vendor with piscem-cpp. When k is small (<=31) what are the differences between the new kmer type and the previous representation which was just a literal uint64_t?

Thanks!
Rob

jermp · 2024-07-26T14:32:35Z

Hi @rob-p,
good that we are not going to ship with piscem the spdlog. For the quiet build, it should be pretty straightforward: I'd even alter the current role of --verbose to make it control this aspect. Currently, --verbose prints some extra info but we can directly make it work like it outputs the current logging. (So, if not provided, it makes everything quiet.)

Regarding your question about the kmer type, the kmer.hpp just defines kmer_t as uint64_t for "small" k. Hence, there is no difference. Or are you referring to something else / have a specific function in mind?

-Giulio

rob-p · 2024-07-26T15:27:45Z

@NPSDC, can you explain the difference we tracked down when implementing cuttlefish parsing in the latest sshash where we got the same number of kmers but different number of super-kmers when using the old utility function?

jermp · 2024-07-26T15:40:41Z

Mmmh that's strange. The number of super-kmers should only change if a different minimizer scheme is used or if canonical parsing is on. In the updated versions there are also a couple of compilation flags to define, like SSHASH_USE_MAX_KMER_LENGTH_63 and SSHASH_USE_TRADITIONAL_NUCLEOTIDE_ENCODING. But under the same config, the utility functions should work exactly the same.

NPSDC · 2024-07-26T15:41:24Z

So, if you look at this commit under parse_file.hpp e802c97, only two functions were changed and we saw that the total number of super-kmers and corresponding downstream changes in the size of the partitions were observed. Using the uint_kmer instead of uint64 seemed to have resolved the differences.

NPSDC · 2024-07-26T15:43:23Z

Mmmh that's strange. The number of super-kmers should only change if a different minimizer scheme is used or if canonical parsing is on. In the updated versions there are also a couple of compilation flags to define, like SSHASH_USE_MAX_KMER_LENGTH_63 and SSHASH_USE_TRADITIONAL_NUCLEOTIDE_ENCODING. But under the same config, the utility functions should work exactly the same.

if canonical parsing is on, then the number of super-kmers should not change, right?

jermp · 2024-07-26T15:46:40Z

So, if you look at this commit under parse_file.hpp e802c97, only two functions were changed and we saw that the total number of super-kmers and corresponding downstream changes in the size of the partitions were observed. Using the uint_kmer instead of uint64 seemed to have resolved the differences.

Ok, so now: does it work correctly?

if canonical parsing is on, then the number of super-kmers should not change, right?

It does change the number of super-kmers because also the minimizers change and the sequences are parsed differently.

NPSDC · 2024-07-26T15:59:08Z

Ok, so now: does it work correctly?

Yes, it does work correctly now (like no change in output index, when fasta or the cfseg file is used.)

It does change the number of super-kmers because also the minimizers change and the sequences are parsed differently.

If both files are parsed with canonical-parsing on, then it is the canonical-kmer that would be considered and so the minimizer should not change. Please correct me if I am wrong in my understanding.

jermp · 2024-07-26T16:04:05Z

Yes, it does work correctly now (like no change in output index, when fasta or the cfseg file is used.

Ok, perfect!

If both files are parsed with canonical-parsing on, then it is the canonical-kmer that would be considered and so the minimizer should not change. Please correct me if I am wrong in my understanding.

Oh sure, if both builds use the flag --canonical-parsing, then the number of super-kmers must be the same.
Before, I was referring to "regular" parsing vs. "canonical" parsing. (The canonical one introduces more minimizers and hence, more super-kmers.)

rob-p · 2024-07-26T16:37:33Z

Right so the only changes were:

uint64_t uint64_kmer = util::string_to_uint64_no_reverse(kmer, k);

to

kmer_t uint_kmer = util::string_to_uint_kmer(kmer, k);

and

uint64_kmer_rc = util::compute_reverse_complement(uint64_kmer, k);
uint64_t minimizer_rc = util::compute_minimizer(uint64_kmer_rc, k, m, seed);

to

kmer_t uint_kmer_rc = util::compute_reverse_complement(uint_kmer, k);
uint64_t minimizer_rc = util::compute_minimizer(uint_kmer_rc, k, m, seed);

This changed the number of observed super k-mers in the parse.

jermp · 2024-07-26T17:16:11Z

Strange. I think that should not happen...
I bet the difference is small at least, right?

NPSDC · 2024-07-26T18:53:14Z

It was small but not that small (few 100s) on chr13 genome.

jermp · 2024-07-26T19:07:27Z

So more super-kmers in the new version compared to the old one used in piscem-cpp repo? 🤔

NPSDC · 2024-07-26T20:04:36Z

Pasting relevant pieces of log comparing the two in the initial version before the fix as mentioned by @rob-p

reading file '../../../DNA_datasets/chr13_f1.fa'...
read 740305 sequences, 113120928 bases, 90911778 kmers
num_kmers 90911778
num_super_kmers 13544993
num_pieces 740306

k = 31, m = 15, seed = 1, l = 6, c = 3, canonical_parsing = true, weighted = false, file type = cfseg
reading file '../../../DNA_datasets/chr13.cf_seg'...
m_buffer_size 29411764
sorting buffer...
saving to file './sshash.tmp.run_1722023687779820410.minimizers.0.bin'...
read 0 sequences, 113120928 bases, 90911778 kmers
num_kmers 90911778
num_super_kmers 13509768
num_pieces 740306 (+0.488588 [bits/kmer])

jermp · 2024-07-26T21:37:42Z

Sorry but I'm confused. Are you both referring to different numbers between different file formats (.fa vs. .cf_seg) or, given a file, both implementation return the same numbers?

NPSDC · 2024-07-26T21:52:27Z

We are referring to different numbers for the different file formats. The numbers become consistent after the changes mentioned by @rob-p.

rob-p · 2024-07-26T22:26:15Z

Yes, the string sets in the 2 files are the same. The only difference is the parsing function.

NPSDC and others added 6 commits July 23, 2024 23:15

adding cuttlefish parsing

df0a7ae

added seg parser

8c0c865

fix cf_seg build

e802c97

check for cfseg added

61d04a8

generalized support for both fasta and cfseg

691e26a

sshash with checking for cfseg supported

a5b662a

NPSDC changed the base branch from master to submodule-integration-for-piscem July 26, 2024 13:23

jermp merged commit 51169e0 into jermp:submodule-integration-for-piscem Jul 26, 2024

NPSDC mentioned this pull request Sep 2, 2024

Submodule integration for piscem #50

Closed

Conversation

NPSDC commented Jul 26, 2024

Uh oh!

jermp commented Jul 26, 2024

Uh oh!

NPSDC commented Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jermp commented Jul 26, 2024

Uh oh!

rob-p commented Jul 26, 2024

Uh oh!

jermp commented Jul 26, 2024

Uh oh!

rob-p commented Jul 26, 2024

Uh oh!

jermp commented Jul 26, 2024

Uh oh!

NPSDC commented Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NPSDC commented Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jermp commented Jul 26, 2024

Uh oh!

NPSDC commented Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jermp commented Jul 26, 2024

Uh oh!

rob-p commented Jul 26, 2024

Uh oh!

jermp commented Jul 26, 2024

Uh oh!

NPSDC commented Jul 26, 2024

Uh oh!

jermp commented Jul 26, 2024

Uh oh!

NPSDC commented Jul 26, 2024

Uh oh!

jermp commented Jul 26, 2024

Uh oh!

NPSDC commented Jul 26, 2024

Uh oh!

rob-p commented Jul 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NPSDC commented Jul 26, 2024 •

edited

Loading

NPSDC commented Jul 26, 2024 •

edited

Loading

NPSDC commented Jul 26, 2024 •

edited

Loading

NPSDC commented Jul 26, 2024 •

edited

Loading