Piscem-cpp#47
Conversation
|
Hi Noor, I've just brought the branch "submodule-integration-for-piscem" up to date wrt to the master branch. Would you close this PR for the master branch and open it for the branch "submodule-integration-for-piscem"? thank you! |
|
Hi Giulio, |
|
Merged! Thanks Noor. All future PRs should be reflected in that branch. I think that's the best way to proceed. |
|
Perfect; thanks so much for the quick merge @jermp! As we discussed in the piscem repo and as Noor points out here, the other minimal changes I think we will need are the ability to control the construction thread count and the ability to disable logging during construction (since we decided not to pull the heavy spdlog dependency into sshash). One other question I had is that I noticed that many of the Thanks! |
|
Hi @rob-p, Regarding your question about the kmer type, the -Giulio |
|
@NPSDC, can you explain the difference we tracked down when implementing cuttlefish parsing in the latest sshash where we got the same number of kmers but different number of super-kmers when using the old utility function? |
|
Mmmh that's strange. The number of super-kmers should only change if a different minimizer scheme is used or if canonical parsing is on. In the updated versions there are also a couple of compilation flags to define, like |
|
So, if you look at this commit under |
if canonical parsing is on, then the number of super-kmers should not change, right? |
Ok, so now: does it work correctly?
It does change the number of super-kmers because also the minimizers change and the sequences are parsed differently. |
Yes, it does work correctly now (like no change in output index, when
If both files are parsed with |
Ok, perfect!
Oh sure, if both builds use the flag |
|
Right so the only changes were: to and to This changed the number of observed super k-mers in the parse. |
|
Strange. I think that should not happen... |
|
It was small but not that small (few 100s) on chr13 genome. |
|
So more super-kmers in the new version compared to the old one used in piscem-cpp repo? 🤔 |
|
Pasting relevant pieces of log comparing the two in the initial version before the fix as mentioned by @rob-p reading file '../../../DNA_datasets/chr13_f1.fa'...
read 740305 sequences, 113120928 bases, 90911778 kmers
num_kmers 90911778
num_super_kmers 13544993
num_pieces 740306 |
|
Sorry but I'm confused. Are you both referring to different numbers between different file formats ( |
|
We are referring to different numbers for the different file formats. The numbers become consistent after the changes mentioned by @rob-p. |
|
Yes, the string sets in the 2 files are the same. The only difference is the parsing function. |
The above PR is w.r.t efforts towards resolving the GitHub issue COMBINE-lab/piscem-cpp#2 aka using SSHash as a submodule in piscem-cpp.
Piscem-cpp takes the reduced
cfformat from cuttlefish as input. This PR now supports taking the above format as input for building the index. In addition, checker functions have also been updated to enable parsing thecfsegfile.We would also like to make feature requests in SSHash to enable its integration as a submodule in piscem.