WIP - snakemake v8+ remote storage support by jameshadfield · Pull Request #89 · nextstrain/zika

jameshadfield · 2025-07-04T01:58:05Z

Initial work to support remote files (S3, HTTP) using Snakemake storage support. The intention is for:

This to be merged into Allow for multiple inputs from the config file #80 which will then be blocked on runtime snakemake v8+ (i.e. v9) support
Eventually this code will make it into shared/ and end up in many repos

All contributions, testing etc welcome. Please feel free to add commits directly to this PR. We should probably add Google Storage, since it's in ncov?

Dependencies needed:

snakemake-storage-plugin-s3
snakemake-storage-plugin-http
? Google storage plugin

conda install -c conda-forge -c bioconda snakemake-storage-plugin-s3 snakemake-storage-plugin-http

Where's data stored?

Within .snakemake/storage/<tag-name>/. This is a nicer solution than the current ncov version which stores files in top-level directories which require git-ignoring. Here's the canonical curated data from our S3 bucket:

$ tree .snakemake/storage
.snakemake/storage
├── s3
└── s3_unsigned
    └── nextstrain-data
        └── files
            └── workflows
                └── zika
                    ├── metadata.tsv.zst
                    └── sequences.fasta.zst

Goodbye `./data`?

Since the fetched data is stored within .snakemake (see above), our workflows won't use ./data directories any more. (Zika currently uses it for USVI data, but we'll change that as part of this PR or the parent PR.)

External analysis directories

These work out of the box which is nice! The files are stored within the analysis directories .snakemake folder, as expected.

To test these you mustn't use the local-to-pathogen-repo USVI data (see parent PR for details)

S3

I've set up S3 to use different storage providers depending on if we think credentials are needed or not, which mirrors our previous aws s3 cp --no-sign-request approach.

Missing credentials can be confusing

If credentials are required (an easy way to test this is to set PUBLIC_BUCKETS to an empty set so that we attempt to use credentials for the canonical nextstrain-data) then depending on how we invoke snakemake we get different errors. I believe this is due to if the inputs of the target rule are fetched from remote storage, or if they TKTK

Case 1: The target rule doesn't itself use a remote storage file, but somewhere in the DAG we do fetch files from S3. E.g. define multiple inputs such that we fetch one set of data from S3 and another set from local USVI data. The behaviour here is what I would expect:

$ snakemake --cores 1 -pf results/filtered.fasta # no AWS credentials in env
...
WorkflowError:
Failed to check existence of s3://nextstrain-data/files/workflows/zika/sequences.fasta.zst
NoCredentialsError: Unable to locate credentials

$ envdir ... snakemake --cores 1 -pf results/filtered.fasta # AWS credentials provided
...
Retrieving from storage: s3://nextstrain-data/files/workflows/zika/metadata.tsv.zst
Finished retrieval.
Retrieving from storage: s3://nextstrain-data/files/workflows/zika/sequences.fasta.zst
Finished retrieval.
...
[Mon Jul  7 15:09:24 2025]
localrule merge_metadata:
    input: s3://nextstrain-data/files/workflows/zika/metadata.tsv.zst (retrieve from storage), data/metadata_usvi.tsv

Case 2: The target rule uses path_or_url directly as an input. For instance, we run the merge_metadata rule from the default config which sources a file directly from S3.

Without credentials we get a very confusing Missing input files error:

$ snakemake --cores 1 -npf results/metadata_merged.tsv # no AWS credentials in env
...
MissingInputException in rule merge_metadata in file "/Users/naboo/github/nextstrain/zika/phylogenetic/rules/merge_inputs.smk", line 56:
Missing input files for rule merge_metadata:
    output: results/metadata_merged.tsv
    affected files:
        .snakemake/storage/s3_signed/nextstrain-data/files/workflows/zika/metadata.tsv.zst

If you have credentials then this works as expected (see case 1 above)

I presume the same confusion occurs with credentials which don't grant authz, but I have't checked.
My hunch is that this behaviour happens because we aren't directly using storage.<tag> wrappers in inputs/outputs of rules, but I haven't done much digging.

Is data re-fetched as needed?

We use keep_local=True so that we keep the downloaded files and thus don't have to re-fetch them upon each snakemake invocation. The presence of a mtime method in the S3 plugin implies that a cache check is made. This would mirror the behaviour in our ncov implementation. We should document how we can avoid this, because it can trigger a workflow re-run when not desired (common for dev work).

More credential strangeness There's some subtlety I don't fully understand here. E.g.: if you have successfully completed a credentialed fetch of the files (e.g. S3 sequences & metadata), and thus they're stored on disk at .snakemake/storage/s3_signed, and then we run another invocation without credentials:

Case 1 above - snakemake --cores 1 -pf results/filtered.fasta # no AWS credentials in env - works just fine, i.e. the rule filter runs successfully.
Case 2 above - snakemake --cores 1 -npf results/metadata_merged.tsv # no AWS credentials in env - we get the same Missing input files error.

Uploading data

I don't know if we used this in ncov, but it could be helpful. Tested using the following rule:

rule upload_filter:
    """TESTING ONLY TODO XXX REMOVE"""
    input: "results/filtered.fasta"
    output: path_or_url("s3://nextstrain-scratch/testing-zika-filtered.fasta")
    shell:
        r"""
        cp {input[0]} {output[0]}
        """

$ envdir ... snakemake --cores 1 -pf upload_filter # AWS credentials needed 
[Mon Jul  7 16:05:13 2025]
localrule upload_filter:
    input: results/filtered.fasta
    output: s3://nextstrain-scratch/testing-zika-filtered.fasta (send to storage)
    jobid: 0
    reason: Forced execution
    resources: tmpdir=/var/folders/j5/0z9xm99d1pz9nqlpgg9c6xqr0000gn/T
Shell command: 
        cp results/filtered.fasta .snakemake/storage/s3_signed/nextstrain-scratch/testing-zika-filtered.fasta
        
Storing in storage: s3://nextstrain-scratch/testing-zika-filtered.fasta

This is the reason for the force_signed keyword argument, as uploading to (e.g.) nextstrain-data needs credentials but downloading doesn't.

HTTP[S]

Running with a single set of inputs from data.nextstrain.org (and no additional inputs) worked:

# config.yaml
inputs:
  - name: ncbi
    metadata: "https://data.nextstrain.org/files/workflows/zika/metadata.tsv.zst"
    sequences: "https://data.nextstrain.org/files/workflows/zika/sequences.fasta.zst"

~~However using raw.githubusercontent.com URLs didn't work (curling them was fine):~~ This was a typo - now fixed in this PR.

Uploading data

Didn't test! Seems possible via a POST, but ???

Google Storage

Todo?

File Compression

This isn't really about snakemake storage solutions at all, but might as well sort it out here

From some basic testing:

Metadata:
- If there are multiple metadata inputs, then they can be zst compressed and augur merge does the right thing.
- If there's only one input then augur filter does the right thing (for zst)
Sequences:
- Multiple sequences go through seqkit (soon to be augur merge?) which does the right thing for zst
- One sequence input: augur filter does the right thing for zst

So assuming we switch from seqkit to augur merge for sequences (which will use... seqkit) then I think we support compression of remote files if augur filter and augur merge support it.

Relevant docs pages

https://snakemake.readthedocs.io/en/stable/snakefiles/storage.html
https://snakemake.github.io/snakemake-plugin-catalog/plugins/storage/http.html
https://snakemake.github.io/snakemake-plugin-catalog/plugins/storage/s3.html
https://github.com/snakemake/snakemake-storage-plugin-s3/blob/main/snakemake_storage_plugin_s3/__init__.py
https://github.com/snakemake/snakemake-storage-plugin-http/blob/main/snakemake_storage_plugin_http/__init__.py

phylogenetic/rules/merge_inputs.smk

https://snakemake.readthedocs.io/en/stable/snakefiles/storage.html https://snakemake.github.io/snakemake-plugin-catalog/plugins/storage/http.html https://snakemake.github.io/snakemake-plugin-catalog/plugins/storage/s3.html https://github.com/snakemake/snakemake-storage-plugin-s3/blob/main/snakemake_storage_plugin_s3/__init__.py https://github.com/snakemake/snakemake-storage-plugin-http/blob/main/snakemake_storage_plugin_http/__init__.py

jameshadfield · 2025-07-07T04:38:57Z

phylogenetic/rules/remote_files.smk

+    # If the bucket is public then we make an unsigned request (i.e. no AWS credentials
+    # need to be set). Note that this won't work for uploads.


Maybe we should drop this special-treatment of known public buckets? It would simplify the code and let us avoid the force_signed argument. For dev purposes I find it frustrating to have to load credentials to fetch data in s3://nextstrain-data, but we could instead avoid this by using https://data.nextstrain.org/... URLs.

I'm open to public buckets being referred to by the https: URLs.

And then keeping s3 handling for less public (private? restricted use?) datasets in case that usecase shows up if it hasn't already.

+1 for removing the special treatment for public buckets.

I say follow our own guidance from ncov/remote_inputs docs:

If you’re running workflows on AWS or GCP compute that fetch this data, please use the S3 or GS URLs, respectively, for cheaper (for us) and faster (for you) data transfers. Otherwise, please use the https://data.nextstrain.org/ URLs.

Note that even though the s3://nextstrain-data/ and gs://nextstrain-data/ buckets are public, the defaults for most S3 and GS clients require some user to be authenticated, though the specific user/account doesn’t matter. In the rare case you need to access the S3 or GS buckets anonymously, the easiest way is to configure your inputs using https://nextstrain-data.s3.amazonaws.com/files/ncov/open/ or https://storage.googleapis.com/nextstrain-data/files/ncov/open/ URLs instead.

Hmm, I'd say that advice is conditioned on the situation where we have to be authenticated for public bucket requests. But this PR shows we don't. So if we pay a small code-complexity price we can get cheaper & faster transfers (S3) without the frustrating UX of needing to provide credentials. If it were only data downloading I'd keep this PR as is, it was only when testing/considering uploads that I started to question it. I'll think about this some more.

With the misleading errors around credentials, should we enforce credential requirements here, e.g.:

assert "AWS_ACCESS_KEY_ID" in os.environ and "AWS_SECRET_ACCESS_KEY" in os.environ, \ "Must set `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables for S3 files"

I've updated the code and I'm happy with the UX now. Here's the logic of when we use signed vs unsigned:

If the bucket is public then we may use an unsigned request which has the nice UX of not needing credentials to be present. If we've made other signed requests or credentials are present then we just sign everything. This has implications for upload: if you attempt to upload to a public bucket without credentials then we allow that here and you'll get a subsequent AccessDenied error when the upload is attempted.

and a table of situations:

S3 buckets credentials present credentials missing

download private / private + public signed Error 1

public signed unsigned

upload private / private + public signed Error 1

public signed Error 2

Error 1:

AWS credentials are required to access <S3 URI>

Error 2:

WorkflowError: Failed to store output in storage <S3 URI> ClientError: An error occurred (AccessDenied) when calling the CreateBucket operation: Anonymous users cannot invoke this API. Please authenticate. WorkflowError: Failed to get mtime of <S3 URI> ClientError: An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied

Thanks for thinking this through @jameshadfield, I really like the outlined behavior!

Providing bad credentials produce slightly different errors, but they should be informative for the user:

$ nextstrain build --image nextstrain/base:branch-snakemake-v9 --env AWS_ACCESS_KEY_ID=foo --env AWS_SECRET_ACCESS_KEY=bar phylogenetic/ --forcerun merge_metadata Assuming unrestricted shared filesystem usage. host: 0f11191f17e0 Building DAG of jobs... WorkflowError: Failed to retrieve input from storage. WorkflowError: Failed to get mtime of s3://nextstrain-data/files/workflows/zika/metadata.tsv.zst ClientError: An error occurred (InvalidAccessKeyId) when calling the ListObjects operation: The AWS Access Key Id you provided does not exist in our records.

$ nextstrain build --image nextstrain/base:branch-snakemake-v9 --env AWS_ACCESS_KEY_ID=foo --env AWS_SECRET_ACCESS_KEY=bar phylogenetic/ --forcerun filter Assuming unrestricted shared filesystem usage. host: 2d666b298294 Building DAG of jobs... WorkflowError: Failed to check existence of s3://nextstrain-data/files/workflows/zika/metadata.tsv.zst ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

phylogenetic/defaults/config.yaml

phylogenetic/rules/merge_inputs.smk

j23414 · 2025-07-07T17:18:03Z

phylogenetic/rules/remote_files.smk

+    if info.scheme in ['http', 'https']:
+        return _storage_http(keep_local=keep_local)(uri)
+
+    # TODO XXX - Google? We allowed this in ncov <https://github.com/nextstrain/ncov/blob/41cf6470d3140963ff3e02c29241f80ae8ed9c33/workflow/snakemake_rules/remote_files.smk#L62>


+1 for supporting GS urls if possible (e.g. Terra users, etc)

+1, I've included the GS plugin (snakemake-storage-plugin-gcs) in nextstrain/docker-base#257

Sounds good - can one of you add it to this PR?

added in 70e16f8, but not sure how to test here...

@j23414 are you able to test the GS stuff? I presume Broad is mirroring the zika files?

Ah, seems like Broad is mirroring everything in nextstrain-data to GS, so I found
https://storage.googleapis.com/nextstrain-data/files/workflows/zika/metadata.tsv.zst
(gs://nextstrain-data/files/workflows/zika/metadata.tsv.zst) which we can use to test.

@j23414 and I discussed GS in person yesterday.

We ran into authentication errors that indicates you need to set up Application Default Credentials in order to use the GS plugin. I have not found any way of passing in credentials via envvars for this plugin....

Even the old Snakemake v7 GS provider required you to login via gcloud auth before running Snakemake. I'm not sure how well this ever worked with the Docker runtime. The gcloud auth command creates a JSON file that is stored at ~/.config/gcloud/application_default_credentials.json, which is not accessible in the docker runtime. You can use GOOGLE_APPLICATION_CREDENTIALS to define a separate location for the credentials file, so you could create the credentials JSON within the pathogen repo and pass the --env GOOGLE_APPLICATION_CREDENTIALS=... to use it in the Docker runtime, but I'm not sure how safe that is...

All that is to say, I'm walking back on my support for GS and think it's not worth the trouble.

Thanks for testing! I'm going to take this code over to ncov now and will remove GS. I'll add a specific conditional to catch the GS scheme (and HTTP) and raise an error telling users to get in touch with us because we can add those functionality if pushed to do so.

j23414 · 2025-07-07T17:26:08Z

phylogenetic/defaults/config.yaml

    metadata: "data/metadata_usvi.tsv"
    sequences: "data/sequences_usvi.fasta"
+    # metadata: https://raw.githubusercontent.com/nextstrain/zika/refs/heads/main/phylogenetic/data/metadata_usvi.tsv
+    # sequences: https://raw.githubusercontent.com/nextstrain/zika/refs/heads/main/phylogenetic/data/sequences.fasta


However using raw.githubusercontent.com URLs didn't work (curling them was fine):

I'm surprised this didn't work, but in this case we can work around it by moving USVI data to a public s3 bucket (refer to it by http)

In case it helps, uploaded usvi to a public s3:

https://data.nextstrain.org/files/workflows/zika/sequences_usvi.fasta.zst https://data.nextstrain.org/files/workflows/zika/metadata_usvi.tsv.zst

Happy to drop the ./data directory subsequently

Seems like this was a case of typo resulting in unexpected error? (I would have expected a 404 error)

Works fine with

Suggested change

# sequences: https://raw.githubusercontent.com/nextstrain/zika/refs/heads/main/phylogenetic/data/sequences.fasta

sequences: https://raw.githubusercontent.com/nextstrain/zika/refs/heads/main/phylogenetic/data/sequences_usvi.fasta

Typo indeed! Ok that fixes it, and I've pushed up a WIP commit which we can fix up later on: cc86bd7.

The error messaging around remote files (see main PR comment) doesn't seem ideal.

In case it helps, uploaded usvi to a public s3:

@j23414 can you test out those files and see if you can reproduce the error i'm getting:

$ curl -o metadata_usvi.tsv.zst https://data.nextstrain.org/files/workflows/zika/metadata_usvi.tsv.zst $ zstd --decompress --stdout metadata_usvi.tsv.zst [expected TSV output] $ augur filter --metadata metadata_usvi.tsv.zst --output-metadata x Traceback (most recent call last): File "/Users/naboo/github/nextstrain/augur/augur/__init__.py", line 70, in run return args.__command__.run(args) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/naboo/github/nextstrain/augur/augur/filter/__init__.py", line 161, in run return _run(args) ^^^^^^^^^^ File "/Users/naboo/github/nextstrain/augur/augur/filter/_run.py", line 198, in run metadata_reader = read_metadata( ^^^^^^^^^^^^^^ File "/Users/naboo/github/nextstrain/augur/augur/io/metadata.py", line 96, in read_metadata metadata = pd.read_csv( ^^^^^^^^^^^^ File "/Users/naboo/miniconda3/envs/augur-dev-snakemake-v9/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv return _read(filepath_or_buffer, kwds) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/naboo/miniconda3/envs/augur-dev-snakemake-v9/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 620, in _read parser = TextFileReader(filepath_or_buffer, **kwds) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/naboo/miniconda3/envs/augur-dev-snakemake-v9/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__ self._engine = self._make_engine(f, self.engine) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/naboo/miniconda3/envs/augur-dev-snakemake-v9/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1898, in _make_engine return mapping[engine](f, **self.options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/naboo/miniconda3/envs/augur-dev-snakemake-v9/lib/python3.12/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 93, in __init__ self._reader = parsers.TextReader(src, **kwds) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "parsers.pyx", line 574, in pandas._libs.parsers.TextReader.__cinit__ File "parsers.pyx", line 663, in pandas._libs.parsers.TextReader._get_header File "parsers.pyx", line 874, in pandas._libs.parsers.TextReader._tokenize_rows File "parsers.pyx", line 891, in pandas._libs.parsers.TextReader._check_tokenize_status File "parsers.pyx", line 2053, in pandas._libs.parsers.raise_parser_error zstd.ZstdError: zstd decompress error: Unknown frame descriptor An error occurred (see above) that has not been properly handled by Augur. To report this, please open a new issue including the original command and the error above: <https://github.com/nextstrain/augur/issues/new/choose>

(I'm getting the same errors using Snakemake HTTP storage plugin, but the above curl seemed simpler to detail)

Huh, I'm getting the same error

...
zstd.ZstdError: zstd decompress error: Unknown frame descriptor

Versions below, I'm using isolated conda runtime:

nextstrain shell . augur --version #> augur 31.2.0 exit nextstrain --version #> Nextstrain CLI 10.2.0 (standalone)

Okay, I have an idea what happened. I've uploaded new fixed metadata_usvi.tsv.zst and sequences_usvi.fasta.zst files and the error disappeared.

Originally I had tried uploading via

# upload uncompressed TSV envdir ~/nextstrain/env.d/seasonal-flu/ \ nextstrain remote upload s3://nextstrain-data/files/workflows/zika/ metadata_usvi.tsv # downloaded to verify mkdir local; cd local wget https://data.nextstrain.org/files/workflows/zika/metadata_usvi.tsv less metadata_usvi.tsv

Realized that metadata_usvi.tsv had been zstd compressed during upload... geh

# renamed downloaded file to metadata_usvi.tsv.zst mv metadata_usvi.tsv metadata_usvi.tsv.zst # Uploaded the zst file envdir ~/nextstrain/env.d/seasonal-flu/ \ nextstrain remote upload s3://nextstrain-data/files/workflows/zika/ metadata_usvi.tsv.zst

I should have locally compressed and uploaded it directly

zstd metadata_usvi.tsv envdir ~/nextstrain/env.d/seasonal-flu/ \ nextstrain remote upload s3://nextstrain-data/files/workflows/zika/ metadata_usvi.tsv.zst

Realized that metadata_usvi.tsv had been zstd compressed during upload... geh

Yeah, nextstrain upload does that automagically which is often what we want (e.g. for auspice JSONs), but when it's not it's always frustrating to debug! Thanks for fixing

phylogenetic/rules/remote_files.smk

genehack · 2025-07-07T22:41:55Z

phylogenetic/rules/remote_files.smk

+    if info.scheme=='s3':
+        return _storage_s3(bucket=info.netloc, keep_local=keep_local, force_signed=force_signed)(uri)
+
+    if info.scheme in ['http', 'https']:


Do we need to support plain HTTP? Requiring HTTPS doesn't feel onerous and is kind of a security best-practice…

It's currently supported in ncov so I think we're trying to keep the functionality the same here.

...ok, but -- is it actually used in ncov? because this feels like a great time for a deprecation…

True! Especially since the Snakemake upgrade is a breaking change anyways...

The plugin requires a Google Cloud Project that I'm unsure how is usually set. Currently it can be set with the snakemake envvar `SNAKEMAKE_STORAGE_GCS_PROJECT`. https://snakemake.github.io/snakemake-plugin-catalog/plugins/storage/gcs.html#settings

jameshadfield · 2025-07-08T01:18:02Z

I dug into the confusing MissingInputException error in snakemake/snakemake#3663. Hopefully we (or someone else) can fix it in Snakemake. I think it's something we can live with until such time as it's fixed upstream.

jameshadfield · 2025-07-09T02:57:47Z

I've pushed this to a point where I am happy with the implementation and done a fair bit of further testing today, including the S3 uploads functionality. I think a little extra testing from others would be great, especially of google storage. When we're happy with the functionality I'll squash all the commits and remove any commented out code etc.

nCoVs adoption of Snakemake v9 is going to need this code, so I wonder if nextstrain/ncov#1180 should be the first place we implement this code (instead of in Zika) as it's a drop-in replacement of nCoV's remote_files.smk; we'd subsequently move it to shared and then Zika can use it from there. Alternatively we could put it in shared first and then have nCoV use that. @joverlee521 thoughts?

joverlee521 · 2025-07-09T17:19:42Z

nCoVs adoption of Snakemake v9 is going to need this code, so I wonder if nextstrain/ncov#1180 should be the first place we implement this code (instead of in Zika) as it's a drop-in replacement of nCoV's remote_files.smk; we'd subsequently move it to shared and then Zika can use it from there. Alternatively we could put it in shared first and then have nCoV use that. @joverlee521 thoughts?

Probably easier for development to add/implement in ncov first then move it out to shared and propagate as needed.

#89 (comment)

The previous version of `path_or_url` was not compatible with Snakemake v8, which implemented a storage plugin architecture [1]. This is a complete rewrite of the approach and was originally written for zika [2]. See [2] for more commentary as well as the docs added to `remote_files.smk` Google Storage URIs are not supported due to authn complexity, as discussed in [3] and HTTP support is dropped as requested in [4]. The default of `keep_local` is now True. Since all (n=3) existing usages of `path_or_url` specify `keep_local=True` this is consistent with previous behaviour. [1] <https://snakemake.readthedocs.io/en/stable/snakefiles/storage.html> [2] <nextstrain/zika#89> [3] <nextstrain/zika#89 (comment)> [4] <nextstrain/zika#89 (comment)>

jameshadfield · 2025-07-11T01:35:04Z

I'm going to close this PR as the code (as of 5319043) is now part of nextstrain/ncov#1180. I think there are two paths we can take:

release updated runtimes (using snakemake v9)
merge Snakemake v9 support ncov#1180
copy ncov's (new!) remote_files.smk into nextstrain/shared
in parallel:
- update zika PR Allow for multiple inputs from the config file #80 to use the remote_files support from the new shared code
- update ncov to use the new shared code

Or if we'd like to work on #80 right now:

copy the remote_files.smk into Allow for multiple inputs from the config file #80 right now and continue working on that PR
release updated runtimes (using snakemake v9)
merge Snakemake v9 support ncov#1180 and Allow for multiple inputs from the config file #80
copy ncov's (new!) remote_files.smk into nextstrain/shared
in parallel, update {zika,ncov} to use the new shared code

I prefer the second one, so I've added the remote-storage code to @j23414's #80 in order to unblock progress there.

The code here is lifted directly from <nextstrain/ncov#1180> which itself was lifted from <#89>. See those PRs for in-depth discussions. The `remote_files.smk` file added here will be moved to the github.com/nextstrain/shared/ repo and vendored across pathogen repos, including this one. Assuming this doesn't change cf. the ncov implementation I think it's fine to have the file copy-pasted around a few repos before it arrives in the shared repo.

I originally implemented this in <nextstrain/zika#89> and that's where the bulk of the discussion and review happened. That PR wasn't merged and instead it was added to ncov via <nextstrain/ncov#1180>. The code in this commit is copied directly from <https://github.com/nextstrain/ncov/blob/137ffc479a1bfe4e2c6ae1bbcd1b22e3afc5f8dd/workflow/snakemake_rules/remote_files.smk> Closes #56

I originally implemented this in <nextstrain/zika#89> and that's where the bulk of the discussion and review happened. That PR wasn't merged and instead it was added to ncov via <nextstrain/ncov#1180>. The code in this commit is copied directly from <https://github.com/nextstrain/ncov/blob/137ffc479a1bfe4e2c6ae1bbcd1b22e3afc5f8dd/workflow/snakemake_rules/remote_files.smk> with some changes to comments in the code as applicable. Closes #56

The code here is lifted directly from <nextstrain/ncov#1180> which itself was lifted from <#89>. See those PRs for in-depth discussions. The `remote_files.smk` file added here will be moved to the github.com/nextstrain/shared/ repo and vendored across pathogen repos, including this one. Assuming this doesn't change cf. the ncov implementation I think it's fine to have the file copy-pasted around a few repos before it arrives in the shared repo.

jameshadfield mentioned this pull request Jul 4, 2025

Allow for multiple inputs from the config file #80

Merged

1 task

jameshadfield commented Jul 4, 2025

View reviewed changes

phylogenetic/rules/merge_inputs.smk Outdated Show resolved Hide resolved

jameshadfield force-pushed the james/storage branch from 7373b38 to 26ad6e9 Compare July 7, 2025 04:18

jameshadfield commented Jul 7, 2025

View reviewed changes

j23414 reviewed Jul 7, 2025

View reviewed changes

wip / fixup / drop. Fix typo

cc86bd7

joverlee521 reviewed Jul 7, 2025

View reviewed changes

phylogenetic/rules/remote_files.smk Show resolved Hide resolved

genehack reviewed Jul 7, 2025

View reviewed changes

WIP: add Google Cloud storage

70e16f8

The plugin requires a Google Cloud Project that I'm unsure how is usually set. Currently it can be set with the snakemake envvar `SNAKEMAKE_STORAGE_GCS_PROJECT`. https://snakemake.github.io/snakemake-plugin-catalog/plugins/storage/gcs.html#settings

squash - various improvements

9168fc6

j23414 and others added 3 commits July 9, 2025 10:27

Add GS USVI location

879ba99

Remove GS storage

79ecf00

#89 (comment)

remove HTTP support

5319043

#89 (comment)

jameshadfield closed this Jul 11, 2025

joverlee521 mentioned this pull request Jul 18, 2025

[aws-batch] include .snakemake/storage nextstrain/cli#453

Closed

jameshadfield mentioned this pull request Jul 20, 2025

Add remote file support (snakemake v8+) nextstrain/shared#57

Merged

victorlin deleted the james/storage branch July 31, 2025 21:51

victorlin mentioned this pull request Jul 31, 2025

Per-pathogen docs prototyping #90

Draft

victorlin restored the james/storage branch July 31, 2025 21:54

		# If the bucket is public then we make an unsigned request (i.e. no AWS credentials
		# need to be set). Note that this won't work for uploads.

	S3 buckets	credentials present	credentials missing
download	private / private + public	signed	Error 1
	public	signed	unsigned
upload	private / private + public	signed	Error 1
	public	signed	Error 2

	# sequences: https://raw.githubusercontent.com/nextstrain/zika/refs/heads/main/phylogenetic/data/sequences.fasta
	sequences: https://raw.githubusercontent.com/nextstrain/zika/refs/heads/main/phylogenetic/data/sequences_usvi.fasta

Conversation

jameshadfield commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dependencies needed:

Where's data stored?

Goodbye ./data?

External analysis directories

S3

Missing credentials can be confusing

Is data re-fetched as needed?

Uploading data

HTTP[S]

Uploading data

Google Storage

File Compression

Relevant docs pages

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jameshadfield Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

j23414 Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jameshadfield commented Jul 4, 2025 •

edited

Loading

Goodbye `./data`?

jameshadfield Jul 9, 2025 •

edited

Loading

j23414 Jul 9, 2025 •

edited

Loading

jameshadfield commented Jul 9, 2025 •

edited

Loading

jameshadfield commented Jul 11, 2025 •

edited

Loading