Skip to content

feat: add incremental sequence alignment#237

Merged
ivan-aksamentov merged 25 commits intomasterfrom
feat/incremental-alignment
Jan 7, 2022
Merged

feat: add incremental sequence alignment#237
ivan-aksamentov merged 25 commits intomasterfrom
feat/incremental-alignment

Conversation

@ivan-aksamentov
Copy link
Copy Markdown
Member

@ivan-aksamentov ivan-aksamentov commented Nov 29, 2021

Description of proposed changes

During daily ingest, extracted full fasta file is being filtered by whether sequence name is already present in the nextclade.tsv file from S3, which is an accumulation of all Nextclade outputs produced during previous ingests. The sequences that are not present in nextclade.tsv are considered new and are passed to Nextclade for processing.

Nextclade aligns these sequences on every run. However currently alignment is ignored (not emitted).

In this PR:

I add --output-fasta flag to Nextclade invocation, so that the nucleotide alignment is emitted for the new sequences into data/gisaid/nextclade.aligned.new.fasta.
The old "cache" of aligned sequences is downloaded from S3

nextclade.aligned.fasta.xz

as

data/gisaid/nextclade.aligned.old.fasta

Then I concatenate the 2 fasta files to get the new

data/gisaid/nextclade.aligned.fasta

and is uploaded back to S3.

That is, this file will then accumulate aligned sequences after each daily ingest.
This behavior is similar to the concatenating nextclade.tsv for new sequences to the accumulator nextclade.tsv from S3.

For the very first run, I generated the nextclade.aligned.fasta.xz by running the Nextclade full run.

TODO:

  • Implement downloading accumulator fasta from S3 in the beginning of the run
  • Emit alignment
  • Concatenate new fasta to the accumulator fasta
  • Implement uploading accumulator fasta to S3 in the end of the run
  • Modify bin/run-nextclade-full script to also emit and accumulate the alignment. Run it once to generate the initial alignment of the entire database.
  • Repeat the same steps for GenBank
  • Calculate mutation summary in daily ingest
  • Implement uploading mutation summary to S3
  • Modify bin/run-nextclade-full script to generate mutation summary. Run it once to generate the initial mutation summary of the full database.
  • Run daily ingest to verify that incremental updates of the full alignment and mutation summary work
  • Repeat the same steps for GenBank
  • Port to Snakemake (in a separate PR)

Related issue(s)

Testing

This is what I did to test alignment for GISAID (Upd: this was before mutation summary was added):

  • 2021-12-02: ran "Nextclade full run" on AWS Batch to generate alignment for the entire GISAID (appeared as nextclade.aligned.fasta.xz in a special directory on S3)
  • copied the alignment (nextclade.aligned.fasta.xz) to the root of the bucket, where the modified daily ingest can find it
  • 2021-12-03: waited until GISAID data updated
  • 2021-12-03: ran daily GISAID locally, checked that the resulting concatenated fasta file has grown. I used seqkit stat to count number of sequences.

@jameshadfield
Copy link
Copy Markdown
Member

This is great - thanks for starting this @ivan-aksamentov. Let me know if I can help to convert to Snakemake.

This should allow us to get rid of the preprocessing step (from nextstrain/ncov) and simply start phylogenetic builds from the alignment produced here. This will be a huge improvement! The remaining steps which would be needed to achieve this (not necessarily for this PR):

  • There's a mutation summary computed from the alignment and other files produced by nextalign/nextclade. I think we'd want to calculate this here in ncov-ingest. cc @emmahodcroft
  • Filtering (mainly QC)

Finally, on today's call we also discussed indexing this alignment here, but that may also be best left for a subsequent PR.

@emmahodcroft
Copy link
Copy Markdown
Member

There's a mutation summary computed from the alignment and other files produced by nextalign/nextclade. I think we'd want to calculate this here in ncov-ingest.

Agreed, if we're doing other things in Nextclade here, we might as well gather up more information!
One thing I'm a little less clear about is whether there's any difference between the mutation summary that ends up back in the metadata & the mutation_summary.tsv file - which is what I currently use, at least. I could make the switch to using the one in metadata (if someone can let me know any key differences) - but it would be super nice if this could wait until things die down a bit from Omicron 😅

@ivan-aksamentov ivan-aksamentov force-pushed the feat/incremental-alignment branch from 2374b4b to 3088af4 Compare December 3, 2021 04:34
@ivan-aksamentov
Copy link
Copy Markdown
Member Author

ivan-aksamentov commented Dec 3, 2021

I just pushed all the necessary steps to produce *nucleotide* alignment and made a full run on AWS + daily ingest locally (for GISAID only). See the updated first message.

However, I just realized that Nextalign in ncov also produces and uses:

  • peptides: controlled by --output-basename --output-dir and emitted as <output-dir>/<output-basename>.gene.<gene_name>.fasta
  • insertions: emitted into --output-insertions as <base-name>.insertions.csv

See:
https://github.com/nextstrain/ncov/blob/cb6361a6da154d88c9f7ea3ac2aff4601f679a46/workflow/snakemake_rules/main_workflow.smk#L99-L102

So these folks also need to be packaged into the cache, I guess...


@jameshadfield @rneher Can you tell if some of these steps related to peptides and insertions can be moved from ncov into ingest to reduce the amount of things to be cached? I don't think peptides are used raw, they are probably summarized into something? mutation_summary.tsv Emma mentioned? Why does it live separately? Maybe we can just join this stuff into metadata? And what is the story with these indexes you are talking about?

Alternatively, we could declare caching of alignment in ingest a mistake and instead do the caching in ncov, whatever is simpler. Or what is a good solution here?


Either way, the diff in this PR is very small, basically:

  • 1 flag added to Nextclade
  • cat the fastas
  • download/upload to S3

Porting to Snakemake (either new ingest or ncov) should be straightforward.

@ivan-aksamentov ivan-aksamentov force-pushed the feat/incremental-alignment branch from 3088af4 to cb55fef Compare December 3, 2021 09:57
@ivan-aksamentov
Copy link
Copy Markdown
Member Author

ivan-aksamentov commented Dec 3, 2021

I added similar logic for peptide alignment to daily ingests. It's on branch feat/incremental-alignment-with-peptides. This is basically the same as for what was added for nuc alignment, but with a for loop over peptide fastas (for every gene).

The tricky part with peptides is the full runs. Because of batching, joining many fasta files gets very hairy in bash. I haven't figured it yet. My current hope is that someone shows up here and says that we don't need to cache peptides, and can just calculate and append the required summarized data to metadata instead. Pretty please? :)

Just realized that we might not need to cache insertions, because they are already in the metadata and are exactly the same thing as in insertions.csv. I hope ncov can use Nextclade's insertions from metadata instead of from insertions.csv.

@jameshadfield
Copy link
Copy Markdown
Member

Can you tell if some of these steps related to peptides and insertions can be moved from ncov into ingest to reduce the amount of things to be cached?

The insertions, translations etc for the entire alignment are only needed to produce the mutation summary. This is currently part of the ncov preprocessing workflow (note that this is different from rule build_mutation_summary which is done on the subsampled alignments, but both rules use the same python script).

The mutation summary can be shifted over to ncov-ingest as we are the only consumers of it, and this is what I recommend. This allows ncov-ingest to upload the alignment & mutation summary without needing to upload insertions / peptides etc.

P.S. I improved the efficiency of the mutation summary script recently so it shouldn't be the end of the world to simply recompute this each time for the entire alignment; at the very least it shouldn't block this PR.

And what is the story with these indexes you are talking about?

The indexing of the alignment (often done behind-the-scenes by augur filter) is taking a long time. Ideally this could be computed here (via augur index) and uploaded alongside the alignment. However there was discussion whether the index is actually needed since we have most of this information in the metadata and thus I think we ignore this for the purposes of this PR.

@ivan-aksamentov
Copy link
Copy Markdown
Member Author

ivan-aksamentov commented Dec 6, 2021

Thanks, James, @jameshadfield

The mutation summary can be shifted over to ncov-ingest as we are the only consumers of it

mutation_summary.tsv seems to be containing nuc and aa mutations per gene. metadata.tsv already contains it in a slightly different format. I'd say that we might not need mutation_summary.tsv at all. But again I don't know how it's used. I could produce it from metadata.tsv.

There's still a few questions remaining:

  • The whole idea is to avoid running Nextalign in ncov for GISAID sequences. Let's say we precompute nuc alignment and mutation_summary.tsv in ingest, and we don't need peptides then. But are there other remaining outputs of Nextalign in ncov that are used and need to be precomputed? For example insertions.tsv (these are also in the metadata by the way).
  • GISAID alignment is handled by ingest fully, okay, but how about these spike-in sequences and otherwise external non-GISAID inputs. How I imagine that: currently external unaligned fastas and metadatas are merged to the GISAID ones and and passed to Nextalign. With the cache instead we need to: pass externals to Nextalign and then merge the results. This merge might be tricky. I don't have a clear idea of how ncov should be changed to accommodate the GISAID cache (because as we discovered it goes beyond just nuc alignment).

@emmahodcroft
Copy link
Copy Markdown
Member

emmahodcroft commented Dec 6, 2021

@ivan-aksamentov

I'd say that we might not need mutation_summary.tsv at all. But again I don't know how it's used. I could produce it from metadata.tsv.

Long term you're right, we have a lot of this info (if not the same thing exactly? I haven't had time to check!) in metadata.tsv now. The plea is mostly a personal one from me - I use mutation_summary.tsv as the key part of CoVariants, and though I can make a switch longer term, it would be an enormous help if I don't have to do this right now - I've got so many balls in the air. However, if the information is the same, happy for this to just be copied out of metadata into mutation_summary.tsv. And my apologies!! 🙏 🙏

But are there other remaining outputs of Nextalign in ncov that are used and need to be precomputed? For example insertions.tsv (these are also in the metadata by the way).

James can give more detail here, but my naive assumption would be to try and keep the outputs the same as much as possible unless we're absolutely sure they're redundant now. We could perhaps do this by incrementally updating more than just the alignment, but these files as well?

GISAID alignment is handled by ingest fully, okay, but how about these spike-in sequences and otherwise external non-GISAID inputs.

At the moment we actually align twice - once the whole of GISAID in order to get a master file that we can do priorities & etc on. Then once you've picked your set to actually do a build on - because this may contain spiked in sequences - we align again. Because these are often only 10,000 or less, this is actually pretty fast. We could keep doing this.

Two other options:

  • We set a rule so that anything that isn't set as a GISAID or Genbank input gets aligned before getting included into the set to be 'actually built' (we already have rules to dedup & clean up such files (though fool-proof deduping obviously lies with the user - if you change all the seq names... well)) & stop aligning after the combining
  • We simply expect that users use a simple command to align their own spiked sets before they include them in the workflow, either manually or via a localrule that they write

Another approach is to stick with our current method (aligning twice) - though somewhat redundant - for the purposes of this PR - and then figure out a better approach in the future. We'll still be cutting hours off preprocess time.

@jameshadfield
Copy link
Copy Markdown
Member

I'd say that we might not need mutation_summary.tsv at all.

I agree with Emma - while the mutation summary may not be needed long term, it's straightforward to keep it going in the short-to-medium term.

Let's say we precompute nuc alignment and mutation_summary.tsv in ingest, and we don't need peptides then. But are there other remaining outputs of Nextalign in ncov that are used and need to be precomputed? For example insertions.tsv (these are also in the metadata by the way).

No extra files needed :) I think the only files from ncov-ingest which this PR should add to the upload are

  1. aligned.fasta.xz
  2. mutation-summary.tsv.xz

GISAID alignment is handled by ingest fully, okay, but how about these spike-in sequences and otherwise external non-GISAID inputs ...

The current ncov workflow allows such behavior via a config file with multiple inputs. Each input is aligned (via nextalign) independently, and only if needed.

I don't have a clear idea of how ncov should be changed to accommodate the GISAID cache (because as we discovered it goes beyond just nuc alignment).

I don't think it does go beyond nuc alignment [2] and I don't think ncov needs to change here [1]. There is no merging necessary beyond what we currently have for multiple inputs, which happens after each has been aligned.


[1] There is still the matter of the rule filter (in nextstrain/ncov) which will be the last remaining part of the "preprocessing" workflow. This can be dealt with separate to this PR and there are some decisions we need to make about what to do with that rule.

[2] Things are different after we subsample, where we do use the peptides etc. But that's currently handled by a subsequent nextalign run after subsampling.

@ivan-aksamentov
Copy link
Copy Markdown
Member Author

ivan-aksamentov commented Dec 8, 2021

I added mutation summary to both full run and daily. Checked locally on a small dataset.

I had to remove auspice dependency, because we don't have it in the environment and adding it seems to be requiring gcc, which is not in the container. Auspice was only used for open_file() call. I just made it the stock open().

Now running the full run on AWS Batch to produce full alignment, nextclade.tsv and mutation_summary.tsv. If succesful, I will then wait for gisaid to update and run daily ingest to test the full cycle.

In theory, the additions to ingest-gisaid and ingest-genbank scripts can be pasted as steps to the Snakemake and should hopefully work.

The bulk of changes are in the full-run script, which is tricky due to batching and bash.

@ivan-aksamentov
Copy link
Copy Markdown
Member Author

Alright, the full jobs succeeded.

The job ids were:

8cf3776a-ad98-4c53-8d62-0b61ec3894af
64878851-e332-4970-a417-2a0125ebe03e

GISAID outputs are the

nextclade.aligned.fasta.xz
nextclade.mutation_summary.tsv.xz
nextclade.tsv.gz

under

s3://nextstrain-ncov-private/nextclade-full-run-2021-12-08--10-24-50--UTC/

After gisaid updates tomorrow I will try to run daily ingest script with these files locally to see if fasta and summary incrementally updates correctly.

@ivan-aksamentov
Copy link
Copy Markdown
Member Author

ivan-aksamentov commented Dec 10, 2021

The mutation_summary results of local ingest seems to be okay:

$ wc -l data/gisaid/nextclade.mutation_summary.new.tsv | cut -f1 -d' '
32575

$ wc -l data/gisaid/nextclade.mutation_summary.old.tsv | cut -f1 -d' '
5718627

$ wc -l data/gisaid/nextclade.mutation_summary.tsv | cut -f1 -d' '
5751201

There is old + new - 1 lines in the result. -1 because we throw away the second header row.

I was not able to run GenBank ingest with fetch locally. After a few hours it fails with

curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to www.ncbi.nlm.nih.gov:44

Maybe we have better luck running this on usual infra.


At this point, @jameshadfield I can use your help in pasting that into the Snakefile somehow.

Summary of changes with relation to Snakemake port:

  • 1. bin/run-nextclade gained 3 new params (see the diff comment below). Porting: merge conflict should be resolved by

    • 1.1. taking all the new params for both branches in the correct order
    • 1.2. changing the call sites accordingly
  • 2. bin/mutation-summary - new script copied literally from ncov's mutation-summary.py. Renamed and added hashbang for consistency with other scripts here. Porting: take as is. Upd: done.

  • 3. The heart of changes, the "incremental update", is in the caller scripts, ./bin/ingest-gisaid and ./bin/ingest-genbank:

    • A. Incrementally update the "cache" locally. Inside the conditional "if there are new sequences", after nextclade invocation after "join metadata" step:

      • A.1 Alignment (note: the "increment" step has been already computed by Nextclade at his point)

        • A.1.1. Retrieve the "cache": Download old alignment from s3
        • A.1.2. Update the "cache": Join new and old alignment with cat
      • A.2 Mutation summary

        • A.2.1. Compute the "increment": run ./bin/mutation-summary
        • A.2.2. Retrieve the "cache": Download old mutation summary from s3
        • A.2.3. Update the "cache": Join new and old mutation summary with ./bin/join-rows
    • B. "Save the cache": In the end of the script: if mutation summary and alignment files exist, upload them to S3. They will not exist if the check "if there are new sequences" is false and if Nextclade and mutation-summary script did not run. So no cache update and nothing to upload in this case.

Porting: the fasta step needs to be incorporated similarly to how "join metadata" step is already there. The mutation summary is a new step. The downloads (in A) and uploads (in B) are the new steps. Note that downloads are only needed when there are new sequences.

  • 4. bin/nextclade-full-run was adjusted to produce the alignment and mutation summary. This is symmetrical to daily increments, but a bit hairy due to the batching. Porting: take as is. Upd: done.

@ivan-aksamentov
Copy link
Copy Markdown
Member Author

ivan-aksamentov commented Dec 10, 2021

I merged master into this branch, which resolved (2), (4) and (1.1).

Remaining items:

  • The call sites for bin/run-nextclade need to be changed in (1.2). Upd: done
  • And (3) - convert bash changes to snakemake things.

With this merge I kept the ingest-gisaid and ingest-genbank scripts, so that they can be used as a reference for the new snakefile things. We should delete them after porting is done. Note: these scripts don't have the adjustments required in (1.2) (the threads params).

@ivan-aksamentov
Copy link
Copy Markdown
Member Author

ivan-aksamentov commented Dec 10, 2021

I just realized that this merge removed the diff between the old and new ingest-gisaid and ingest-genbank scripts from "Changed files" tab here on GitHub. They are now all green, because they don't not exist on merged master.

Here is what was in the diff before the merge: e29ab01...c501a89

@ivan-aksamentov ivan-aksamentov force-pushed the feat/incremental-alignment branch 2 times, most recently from e06b768 to d16065f Compare December 10, 2021 17:27
@ivan-aksamentov ivan-aksamentov force-pushed the feat/incremental-alignment branch from d16065f to 249a2da Compare December 10, 2021 17:27
@ivan-aksamentov
Copy link
Copy Markdown
Member Author

ivan-aksamentov commented Dec 10, 2021

Alright, I sketched all of the changes that need to be ported in Snakemake file, but haven't actually run it yet. There are probably mistakes there.

One deviation from my bash versions is that I upload new alignment and mutation summary right after they are ready. This is probably not how it should be done in good workflows, but this is convenient right now, because I simply don't know (yet) how to extend the "if there are new sequences" check until the end of the workflow. Alternatively we may simply check for presence of these files in the end of workflow and upload if they exist. But this solution is less reliable, as it may hide mistakes (cases where files should have been generated, but were not not will silently succeed)

TODO:

  • run the Snakemake thing and see if it even runs with my changes
  • if not, eliminate the mistakes, make it run locally reliably
  • remove the bash versions of ingest scripts, resolve/silence shellcheck warnings
  • try to run on AWS Batch

@jameshadfield
Copy link
Copy Markdown
Member

@ivan-aksamentov I've refactored this to use snakemake checkpoints. Let's watch the actions (genbank and gisaid) to see if it works as expected.

I have not reviewed this fully, but it looks generally good. Some things I noticed:

shellcheck
is failing

Consistent fillenames

It would help (me) a lot were we to systematise the filenames we use. My understanding, based on the PR, is that:

  • “old” means the existing files / data on S3.
  • “upd” refers to sequences / metadata present in this run but not present in the “old” version of that file
  • “new” means “old” + “upd”

However this isn't always the case, e.g., sometimes the middle one is called "new".

@ivan-aksamentov
Copy link
Copy Markdown
Member Author

ivan-aksamentov commented Dec 15, 2021

Hey thanks @jameshadfield,

shellcheck is failing

Yeah, we need globbing in that case, so I think we just need to silence the warning on that line. But maybe there's some sort of a syntax for making it proper. It's bash, so you never know :)

It would help (me) a lot were we to systematise the filenames we use.

Yeah, I wanted to do this as well. The reason why it's different for now is that it's different in the full-run script and on S3, so changing things might be a bit awkward. We need to go through these one by one and see what is affected.

Oh no! Looks like it failed
https://github.com/nextstrain/ncov-ingest/runs/4529141570?check_suite_focus=true#step:4:358

[batch] [Wed Dec 15 05:57:02 2021]
[batch] checkpoint get_sequences_without_nextclade_annotations:
[batch]     input: data/gisaid/sequences.fasta, data/gisaid/nextclade_old.tsv
[batch]     output: data/gisaid/nextclade.sequences.fasta
[batch]     jobid: 3
[batch] Downstream jobs will be updated after completion.
[batch]         ./bin/filter-fasta             --input_fasta=data/gisaid/sequences.fasta             --input_tsv=data/gisaid/nextclade_old.tsv             --output_fasta=data/gisaid/nextclade.sequences.fasta         
[batch] Updating job 1 (upload).
[batch] WorkflowError in line 423 of /nextstrain/build/Snakefile:
[batch] Can only use unpack() on list and dict

I will try to read and understand what's going on.

@ivan-aksamentov
Copy link
Copy Markdown
Member Author

The "quotation mark" jobs also failed, because, as I mentioned above, Nextclade accepts comma-delimited gene list, but mutation summary only accepts space-delimited list. And if not, then this happens:

[batch]   File "./bin/mutation-summary", line 107, in <module>
[batch]     gene = (set(fname.split('.')) & genes).pop()
[batch] KeyError: 'pop from an empty set'

I made it space delimited. Actions:
https://github.com/nextstrain/ncov-ingest/actions/runs/1587080622
https://github.com/nextstrain/ncov-ingest/actions/runs/1587080625

We could also modify mutation-summary script, but this is a bit out of scope of this PR and will make an exact copy of this script from ncov into an "almost exact" copy, which feels awkward to do. What is the best way to "commonize" this functionality across repos? (make it shared)

@ivan-aksamentov
Copy link
Copy Markdown
Member Author

ivan-aksamentov commented Dec 16, 2021

In the latest news, the runs with the

fix: use space-delimited list of genes for mutation-summary script
fix: use xz compression, not gz for mutation summary

succeeded for GenBank, but GISAID download keeps failing today (Slack thread), so inconclusive result again.

@ivan-aksamentov
Copy link
Copy Markdown
Member Author

ivan-aksamentov commented Dec 20, 2021

Alright, GISAID have now ran successfully as well. I think this is ready to be reviewed.

I suggest we run this version in parallel to the master flow and at the same time, we could try to use the new results on a dev version of ncov, or perhaps on staging, to see how the new flow works end-to-end. And then if it's good, we can make it the default.

Before merge, a full run should be performed to update the caches.

@ivan-aksamentov ivan-aksamentov requested a review from a team December 20, 2021 13:54
@ivan-aksamentov ivan-aksamentov marked this pull request as ready for review December 20, 2021 13:54
@ivan-aksamentov
Copy link
Copy Markdown
Member Author

Last changes before merging:

Rename

  • nextclade.aligned.fasta -> aligned.fasta
  • nextclade.mutation_summary.tsv -> mutation-summary.tsv

i.e. no nextclade. prefix and mutation summary with a dash, not underscore. This is to align with the filenames on S3.

@ivan-aksamentov ivan-aksamentov merged commit 94332fd into master Jan 7, 2022
@ivan-aksamentov ivan-aksamentov deleted the feat/incremental-alignment branch January 7, 2022 06:20
@ivan-aksamentov
Copy link
Copy Markdown
Member Author

ivan-aksamentov commented Jan 7, 2022

The plan on bringing this to production:

(Slack thread: https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1641427855047000)

  • Make a Nextclade full run

    Done. AWS Batch Job IDs:

    b0d9fa0d-fdae-432a-8656-01435d1364f3
    ffcc6adc-6b6d-4f3c-bdc4-05a0a0d7e857
    

    Wait for result in

    s3://nextstrain-ncov-private/nextclade-full-run-2022-01-06--08-39-23--UTC/
    s3://nextstrain-data/files/ncov/open/nextclade-full-run-2022-01-06--08-39-22--UTC/
    

    Note that full fun was started before renaming. So when copying, I will also need to rename (see below).

  • Copy results of a full run to the location of daily ingest, adjusting filenames to what's already on S3, and overwriting

    s3://nextstrain-ncov-private/nextclade-full-run-2022-01-06--08-39-23--UTC/nextclade.aligned.fasta.xz -> s3://nextstrain-ncov-private/aligned.fasta.xz
    
    s3://nextstrain-ncov-private/nextclade-full-run-2022-01-06--08-39-23--UTC/nextclade.mutation_summary.tsv.xz -> s3://nextstrain-ncov-private/mutation-summary.tsv.xz
    
    s3://nextstrain-ncov-private/nextclade-full-run-2022-01-06--08-39-23--UTC/nextclade.tsv.gz -> s3://nextstrain-ncov-private/nextclade.tsv.gz
    
    s3://nextstrain-data/files/ncov/open/nextclade-full-run-2022-01-06--08-39-22--UTC/nextclade.aligned.fasta.xz -> aligned.fasta.xz
    
    s3://nextstrain-data/files/ncov/open/nextclade-full-run-2022-01-06--08-39-22--UTC/nextclade.mutation_summary.tsv.xz -> s3://nextstrain-data/files/ncov/open/mutation-summary.tsv.xz
    
    s3://nextstrain-data/files/ncov/open/nextclade-full-run-2022-01-06--08-39-22--UTC/nextclade.tsv.gz -> s3://nextstrain-data/files/ncov/open/nextclade.tsv.gz
    
    
  • Merge this PR

  • Run "GISAID fetch and ingest" and "GenBank fetch and ingest" GitHub actions to perform daily ingest with the new code and fresh cache.

    Launched at 06:24 UTC

    https://github.com/nextstrain/ncov-ingest/actions/runs/1666402303
    https://github.com/nextstrain/ncov-ingest/actions/runs/1666402782

    AWS Batch Job IDs:

    045eaf99-4a4b-4186-a7bc-f8548a00638c
    7138b2aa-9702-4f1c-892c-425f2c19b00d
    

joverlee521 added a commit to nextstrain/ncov that referenced this pull request Oct 21, 2022
* move `aligned.fasta.xz` to list of `ncov-ingest` produced files since
this is now output from Nextclade in the daily ingest pipeline¹
* remove `mutation-summary.tsv.xz` since this is no longer updated in
the daily ingest pipeline²

¹ nextstrain/ncov-ingest#237
² nextstrain/ncov-ingest@05cd82f
joverlee521 pushed a commit that referenced this pull request Nov 3, 2022
Removed checkpoint because it only slows down the pipeline without
any of the past benefits. It was initially added because Nextclade v1
used to error on an empty FASTA input and we needed the checkpoint to
check if we should generate a new mutation summary¹. Now, running
Nextclade v2 on a no-op file takes ~1 minute and we no longer use the
mutation summary.

¹ #237 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

No open projects

Development

Successfully merging this pull request may close these issues.

4 participants