Hi @jermp!
I'm currently working on adopting SSHash in MetaGraph, as we want to see if we can improve our performance using it as an internal structure for indexing and traversing DBGs.
As a part of the process, we want to generalize SSHash in a way that allows larger values of $k$ and potentially different alphabets. My general idea here is to rewrite the code base, so that the methods that previously relied on some properties of k-mers (the alphabet used, methods to convert between k-mers/individual characters and their bit representation, etc) would now be implemented as members of the kmer_t class, rather than utils:: functions, as they mostly are now.
So, I plan to write a simple base kmer_t class with virtual functions that provides contracts for how such methods should look like, and then we will inherit from this base class in MetaGraph to provide implementations for our kmer types. Then, all the methods that currently rely on kmer_t would take a template argument instead, which should be an implementation k-mer class derived from the base class described above. I think this way we can ensure that Metagraph and SSHash stay reasonably separated.
I started working in this direction in our fork of SSHash (9624612), but before going further I wanted to get in touch and collect some feedback on your side. That being said, I have a couple of questions:
- Would it be of interest for SSHash to merge these changes upstream in the future?
- Do you have any advice on how this should be approached? Does the outlined approach make sense to you?
- It seemed to me that most of the stuff that heavily relies on the alphabet being ACGT is currently encapsulated in util.hpp, but there are also some methods in e.g. dictionary.cpp. Are there any other places where you rely on the alphabet being ACGT directly, rather than implicitly via calls to
util:: or dictionary methods?
If merging it upstream in the future is of interest to you, I will try to take your opinion into consideration while working in the fork, so it'd be easier to merge afterwards. If it's not a priority for you, no problem, we can keep working on the fork, but still will highly appreciate any insight and advice you might have for these changes.
Hi @jermp!
I'm currently working on adopting SSHash in MetaGraph, as we want to see if we can improve our performance using it as an internal structure for indexing and traversing DBGs.
As a part of the process, we want to generalize SSHash in a way that allows larger values of$k$ and potentially different alphabets. My general idea here is to rewrite the code base, so that the methods that previously relied on some properties of k-mers (the alphabet used, methods to convert between k-mers/individual characters and their bit representation, etc) would now be implemented as members of the
kmer_tclass, rather thanutils::functions, as they mostly are now.So, I plan to write a simple base
kmer_tclass with virtual functions that provides contracts for how such methods should look like, and then we will inherit from this base class in MetaGraph to provide implementations for our kmer types. Then, all the methods that currently rely onkmer_twould take a template argument instead, which should be an implementation k-mer class derived from the base class described above. I think this way we can ensure that Metagraph and SSHash stay reasonably separated.I started working in this direction in our fork of SSHash (9624612), but before going further I wanted to get in touch and collect some feedback on your side. That being said, I have a couple of questions:
util::ordictionarymethods?If merging it upstream in the future is of interest to you, I will try to take your opinion into consideration while working in the fork, so it'd be easier to merge afterwards. If it's not a priority for you, no problem, we can keep working on the fork, but still will highly appreciate any insight and advice you might have for these changes.