feat: add incremental sequence alignment#237
Conversation
|
This is great - thanks for starting this @ivan-aksamentov. Let me know if I can help to convert to Snakemake. This should allow us to get rid of the preprocessing step (from
Finally, on today's call we also discussed indexing this alignment here, but that may also be best left for a subsequent PR. |
Agreed, if we're doing other things in Nextclade here, we might as well gather up more information! |
2374b4b to
3088af4
Compare
|
I just pushed all the necessary steps to produce *nucleotide* alignment and made a full run on AWS + daily ingest locally (for GISAID only). See the updated first message. However, I just realized that Nextalign in ncov also produces and uses:
So these folks also need to be packaged into the cache, I guess... @jameshadfield @rneher Can you tell if some of these steps related to peptides and insertions can be moved from Alternatively, we could declare caching of alignment in ingest a mistake and instead do the caching in ncov, whatever is simpler. Or what is a good solution here? Either way, the diff in this PR is very small, basically:
Porting to Snakemake (either new ingest or ncov) should be straightforward. |
3088af4 to
cb55fef
Compare
|
I added similar logic for peptide alignment to daily ingests. It's on branch The tricky part with peptides is the full runs. Because of batching, joining many fasta files gets very hairy in bash. I haven't figured it yet. My current hope is that someone shows up here and says that we don't need to cache peptides, and can just calculate and append the required summarized data to metadata instead. Pretty please? :) Just realized that we might not need to cache insertions, because they are already in the metadata and are exactly the same thing as in |
The insertions, translations etc for the entire alignment are only needed to produce the mutation summary. This is currently part of the ncov preprocessing workflow (note that this is different from The mutation summary can be shifted over to ncov-ingest as we are the only consumers of it, and this is what I recommend. This allows ncov-ingest to upload the alignment & mutation summary without needing to upload insertions / peptides etc. P.S. I improved the efficiency of the mutation summary script recently so it shouldn't be the end of the world to simply recompute this each time for the entire alignment; at the very least it shouldn't block this PR.
The indexing of the alignment (often done behind-the-scenes by |
|
Thanks, James, @jameshadfield
There's still a few questions remaining:
|
Long term you're right, we have a lot of this info (if not the same thing exactly? I haven't had time to check!) in metadata.tsv now. The plea is mostly a personal one from me - I use
James can give more detail here, but my naive assumption would be to try and keep the outputs the same as much as possible unless we're absolutely sure they're redundant now. We could perhaps do this by incrementally updating more than just the alignment, but these files as well?
At the moment we actually align twice - once the whole of GISAID in order to get a master file that we can do priorities & etc on. Then once you've picked your set to actually do a build on - because this may contain spiked in sequences - we align again. Because these are often only 10,000 or less, this is actually pretty fast. We could keep doing this. Two other options:
Another approach is to stick with our current method (aligning twice) - though somewhat redundant - for the purposes of this PR - and then figure out a better approach in the future. We'll still be cutting hours off preprocess time. |
I agree with Emma - while the mutation summary may not be needed long term, it's straightforward to keep it going in the short-to-medium term.
No extra files needed :) I think the only files from ncov-ingest which this PR should add to the upload are
The current
I don't think it does go beyond nuc alignment [2] and I don't think ncov needs to change here [1]. There is no merging necessary beyond what we currently have for multiple inputs, which happens after each has been aligned. [1] There is still the matter of the [2] Things are different after we subsample, where we do use the peptides etc. But that's currently handled by a subsequent nextalign run after subsampling. |
|
I added mutation summary to both full run and daily. Checked locally on a small dataset. I had to remove auspice dependency, because we don't have it in the environment and adding it seems to be requiring Now running the full run on AWS Batch to produce full alignment, nextclade.tsv and mutation_summary.tsv. If succesful, I will then wait for gisaid to update and run daily ingest to test the full cycle. In theory, the additions to The bulk of changes are in the full-run script, which is tricky due to batching and bash. |
|
Alright, the full jobs succeeded. The job ids were: GISAID outputs are the under After gisaid updates tomorrow I will try to run daily ingest script with these files locally to see if fasta and summary incrementally updates correctly. |
|
The mutation_summary results of local ingest seems to be okay: $ wc -l data/gisaid/nextclade.mutation_summary.new.tsv | cut -f1 -d' '
32575
$ wc -l data/gisaid/nextclade.mutation_summary.old.tsv | cut -f1 -d' '
5718627
$ wc -l data/gisaid/nextclade.mutation_summary.tsv | cut -f1 -d' '
5751201There is I was not able to run GenBank ingest with fetch locally. After a few hours it fails with Maybe we have better luck running this on usual infra. At this point, @jameshadfield I can use your help in pasting that into the Snakefile somehow. Summary of changes with relation to Snakemake port:
Porting: the fasta step needs to be incorporated similarly to how "join metadata" step is already there. The mutation summary is a new step. The downloads (in A) and uploads (in B) are the new steps. Note that downloads are only needed when there are new sequences.
|
|
I merged master into this branch, which resolved (2), (4) and (1.1). Remaining items:
With this merge I kept the |
|
I just realized that this merge removed the diff between the old and new ingest-gisaid and ingest-genbank scripts from "Changed files" tab here on GitHub. They are now all green, because they don't not exist on merged master. Here is what was in the diff before the merge: e29ab01...c501a89 |
e06b768 to
d16065f
Compare
d16065f to
249a2da
Compare
|
Alright, I sketched all of the changes that need to be ported in Snakemake file, but haven't actually run it yet. There are probably mistakes there. One deviation from my bash versions is that I upload new alignment and mutation summary right after they are ready. This is probably not how it should be done in good workflows, but this is convenient right now, because I simply don't know (yet) how to extend the "if there are new sequences" check until the end of the workflow. Alternatively we may simply check for presence of these files in the end of workflow and upload if they exist. But this solution is less reliable, as it may hide mistakes (cases where files should have been generated, but were not not will silently succeed) TODO:
|
|
@ivan-aksamentov I've refactored this to use snakemake checkpoints. Let's watch the actions (genbank and gisaid) to see if it works as expected. I have not reviewed this fully, but it looks generally good. Some things I noticed: shellcheck Consistent fillenames It would help (me) a lot were we to systematise the filenames we use. My understanding, based on the PR, is that:
However this isn't always the case, e.g., sometimes the middle one is called "new". |
|
Hey thanks @jameshadfield,
Yeah, we need globbing in that case, so I think we just need to silence the warning on that line. But maybe there's some sort of a syntax for making it proper. It's bash, so you never know :)
Yeah, I wanted to do this as well. The reason why it's different for now is that it's different in the full-run script and on S3, so changing things might be a bit awkward. We need to go through these one by one and see what is affected. Oh no! Looks like it failed I will try to read and understand what's going on. |
|
The "quotation mark" jobs also failed, because, as I mentioned above, Nextclade accepts comma-delimited gene list, but mutation summary only accepts space-delimited list. And if not, then this happens: I made it space delimited. Actions: We could also modify mutation-summary script, but this is a bit out of scope of this PR and will make an exact copy of this script from ncov into an "almost exact" copy, which feels awkward to do. What is the best way to "commonize" this functionality across repos? (make it shared) |
|
In the latest news, the runs with the succeeded for GenBank, but GISAID download keeps failing today (Slack thread), so inconclusive result again. |
|
Alright, GISAID have now ran successfully as well. I think this is ready to be reviewed. I suggest we run this version in parallel to the master flow and at the same time, we could try to use the new results on a dev version of ncov, or perhaps on staging, to see how the new flow works end-to-end. And then if it's good, we can make it the default. Before merge, a full run should be performed to update the caches. |
nextclade.aligned.fasta -> aligned.fasta nextclade.mutation_summary.tsv -> mutation_summary.tsv
|
Last changes before merging: Rename
i.e. no |
|
The plan on bringing this to production: (Slack thread: https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1641427855047000)
|
* move `aligned.fasta.xz` to list of `ncov-ingest` produced files since this is now output from Nextclade in the daily ingest pipeline¹ * remove `mutation-summary.tsv.xz` since this is no longer updated in the daily ingest pipeline² ¹ nextstrain/ncov-ingest#237 ² nextstrain/ncov-ingest@05cd82f
Removed checkpoint because it only slows down the pipeline without any of the past benefits. It was initially added because Nextclade v1 used to error on an empty FASTA input and we needed the checkpoint to check if we should generate a new mutation summary¹. Now, running Nextclade v2 on a no-op file takes ~1 minute and we no longer use the mutation summary. ¹ #237 (comment)
Description of proposed changes
During daily ingest, extracted full fasta file is being filtered by whether sequence name is already present in the nextclade.tsv file from S3, which is an accumulation of all Nextclade outputs produced during previous ingests. The sequences that are not present in nextclade.tsv are considered new and are passed to Nextclade for processing.
Nextclade aligns these sequences on every run. However currently alignment is ignored (not emitted).
In this PR:
I add
--output-fastaflag to Nextclade invocation, so that the nucleotide alignment is emitted for the new sequences intodata/gisaid/nextclade.aligned.new.fasta.The old "cache" of aligned sequences is downloaded from S3
as
Then I concatenate the 2 fasta files to get the new
and is uploaded back to S3.
That is, this file will then accumulate aligned sequences after each daily ingest.
This behavior is similar to the concatenating nextclade.tsv for new sequences to the accumulator nextclade.tsv from S3.
For the very first run, I generated the
nextclade.aligned.fasta.xzby running the Nextclade full run.TODO:
bin/run-nextclade-fullscript to also emit and accumulate the alignment. Run it once to generate the initial alignment of the entire database.bin/run-nextclade-fullscript to generate mutation summary. Run it once to generate the initial mutation summary of the full database.Related issue(s)
Testing
This is what I did to test alignment for GISAID (Upd: this was before mutation summary was added):
nextclade.aligned.fasta.xzin a special directory on S3)nextclade.aligned.fasta.xz) to the root of the bucket, where the modified daily ingest can find itseqkit statto count number of sequences.