Skip to content

WIP - snakemake v8+ remote storage support#89

Closed
jameshadfield wants to merge 7 commits intorefactor-additional-inputsfrom
james/storage
Closed

WIP - snakemake v8+ remote storage support#89
jameshadfield wants to merge 7 commits intorefactor-additional-inputsfrom
james/storage

Conversation

@jameshadfield
Copy link
Copy Markdown
Member

@jameshadfield jameshadfield commented Jul 4, 2025

Initial work to support remote files (S3, HTTP) using Snakemake storage support. The intention is for:

All contributions, testing etc welcome. Please feel free to add commits directly to this PR. We should probably add Google Storage, since it's in ncov?

Dependencies needed:

  • snakemake-storage-plugin-s3
  • snakemake-storage-plugin-http
  • ? Google storage plugin
conda install -c conda-forge -c bioconda snakemake-storage-plugin-s3 snakemake-storage-plugin-http

Where's data stored?

Within .snakemake/storage/<tag-name>/. This is a nicer solution than the current ncov version which stores files in top-level directories which require git-ignoring. Here's the canonical curated data from our S3 bucket:

$ tree .snakemake/storage
.snakemake/storage
├── s3
└── s3_unsigned
    └── nextstrain-data
        └── files
            └── workflows
                └── zika
                    ├── metadata.tsv.zst
                    └── sequences.fasta.zst

Goodbye ./data?

Since the fetched data is stored within .snakemake (see above), our workflows won't use ./data directories any more. (Zika currently uses it for USVI data, but we'll change that as part of this PR or the parent PR.)

External analysis directories

These work out of the box which is nice! The files are stored within the analysis directories .snakemake folder, as expected.

To test these you mustn't use the local-to-pathogen-repo USVI data (see parent PR for details)

S3

I've set up S3 to use different storage providers depending on if we think credentials are needed or not, which mirrors our previous aws s3 cp --no-sign-request approach.

Missing credentials can be confusing

If credentials are required (an easy way to test this is to set PUBLIC_BUCKETS to an empty set so that we attempt to use credentials for the canonical nextstrain-data) then depending on how we invoke snakemake we get different errors. I believe this is due to if the inputs of the target rule are fetched from remote storage, or if they TKTK

Case 1: The target rule doesn't itself use a remote storage file, but somewhere in the DAG we do fetch files from S3. E.g. define multiple inputs such that we fetch one set of data from S3 and another set from local USVI data. The behaviour here is what I would expect:

$ snakemake --cores 1 -pf results/filtered.fasta # no AWS credentials in env
...
WorkflowError:
Failed to check existence of s3://nextstrain-data/files/workflows/zika/sequences.fasta.zst
NoCredentialsError: Unable to locate credentials
$ envdir ... snakemake --cores 1 -pf results/filtered.fasta # AWS credentials provided
...
Retrieving from storage: s3://nextstrain-data/files/workflows/zika/metadata.tsv.zst
Finished retrieval.
Retrieving from storage: s3://nextstrain-data/files/workflows/zika/sequences.fasta.zst
Finished retrieval.
...
[Mon Jul  7 15:09:24 2025]
localrule merge_metadata:
    input: s3://nextstrain-data/files/workflows/zika/metadata.tsv.zst (retrieve from storage), data/metadata_usvi.tsv

Case 2: The target rule uses path_or_url directly as an input. For instance, we run the merge_metadata rule from the default config which sources a file directly from S3.

Without credentials we get a very confusing Missing input files error:

$ snakemake --cores 1 -npf results/metadata_merged.tsv # no AWS credentials in env
...
MissingInputException in rule merge_metadata in file "/Users/naboo/github/nextstrain/zika/phylogenetic/rules/merge_inputs.smk", line 56:
Missing input files for rule merge_metadata:
    output: results/metadata_merged.tsv
    affected files:
        .snakemake/storage/s3_signed/nextstrain-data/files/workflows/zika/metadata.tsv.zst

If you have credentials then this works as expected (see case 1 above)

I presume the same confusion occurs with credentials which don't grant authz, but I have't checked.
My hunch is that this behaviour happens because we aren't directly using storage.<tag> wrappers in inputs/outputs of rules, but I haven't done much digging.

Is data re-fetched as needed?

We use keep_local=True so that we keep the downloaded files and thus don't have to re-fetch them upon each snakemake invocation. The presence of a mtime method in the S3 plugin implies that a cache check is made. This would mirror the behaviour in our ncov implementation. We should document how we can avoid this, because it can trigger a workflow re-run when not desired (common for dev work).

More credential strangeness There's some subtlety I don't fully understand here. E.g.: if you have successfully completed a credentialed fetch of the files (e.g. S3 sequences & metadata), and thus they're stored on disk at .snakemake/storage/s3_signed, and then we run another invocation without credentials:

  • Case 1 above - snakemake --cores 1 -pf results/filtered.fasta # no AWS credentials in env - works just fine, i.e. the rule filter runs successfully.
  • Case 2 above - snakemake --cores 1 -npf results/metadata_merged.tsv # no AWS credentials in env - we get the same Missing input files error.

Uploading data

I don't know if we used this in ncov, but it could be helpful. Tested using the following rule:

rule upload_filter:
    """TESTING ONLY TODO XXX REMOVE"""
    input: "results/filtered.fasta"
    output: path_or_url("s3://nextstrain-scratch/testing-zika-filtered.fasta")
    shell:
        r"""
        cp {input[0]} {output[0]}
        """
$ envdir ... snakemake --cores 1 -pf upload_filter # AWS credentials needed 
[Mon Jul  7 16:05:13 2025]
localrule upload_filter:
    input: results/filtered.fasta
    output: s3://nextstrain-scratch/testing-zika-filtered.fasta (send to storage)
    jobid: 0
    reason: Forced execution
    resources: tmpdir=/var/folders/j5/0z9xm99d1pz9nqlpgg9c6xqr0000gn/T
Shell command: 
        cp results/filtered.fasta .snakemake/storage/s3_signed/nextstrain-scratch/testing-zika-filtered.fasta
        
Storing in storage: s3://nextstrain-scratch/testing-zika-filtered.fasta

This is the reason for the force_signed keyword argument, as uploading to (e.g.) nextstrain-data needs credentials but downloading doesn't.

HTTP[S]

Running with a single set of inputs from data.nextstrain.org (and no additional inputs) worked:

# config.yaml
inputs:
  - name: ncbi
    metadata: "https://data.nextstrain.org/files/workflows/zika/metadata.tsv.zst"
    sequences: "https://data.nextstrain.org/files/workflows/zika/sequences.fasta.zst"

However using raw.githubusercontent.com URLs didn't work (curling them was fine): This was a typo - now fixed in this PR.

Uploading data

Didn't test! Seems possible via a POST, but ???

Google Storage

Todo?

File Compression

This isn't really about snakemake storage solutions at all, but might as well sort it out here

From some basic testing:

  • Metadata:
    • If there are multiple metadata inputs, then they can be zst compressed and augur merge does the right thing.
    • If there's only one input then augur filter does the right thing (for zst)
  • Sequences:
    • Multiple sequences go through seqkit (soon to be augur merge?) which does the right thing for zst
    • One sequence input: augur filter does the right thing for zst

So assuming we switch from seqkit to augur merge for sequences (which will use... seqkit) then I think we support compression of remote files if augur filter and augur merge support it.

Relevant docs pages

https://snakemake.readthedocs.io/en/stable/snakefiles/storage.html
https://snakemake.github.io/snakemake-plugin-catalog/plugins/storage/http.html
https://snakemake.github.io/snakemake-plugin-catalog/plugins/storage/s3.html
https://github.com/snakemake/snakemake-storage-plugin-s3/blob/main/snakemake_storage_plugin_s3/__init__.py
https://github.com/snakemake/snakemake-storage-plugin-http/blob/main/snakemake_storage_plugin_http/__init__.py

Comment on lines +20 to +21
# If the bucket is public then we make an unsigned request (i.e. no AWS credentials
# need to be set). Note that this won't work for uploads.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should drop this special-treatment of known public buckets? It would simplify the code and let us avoid the force_signed argument. For dev purposes I find it frustrating to have to load credentials to fetch data in s3://nextstrain-data, but we could instead avoid this by using https://data.nextstrain.org/... URLs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to public buckets being referred to by the https: URLs.

And then keeping s3 handling for less public (private? restricted use?) datasets in case that usecase shows up if it hasn't already.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for removing the special treatment for public buckets.

I say follow our own guidance from ncov/remote_inputs docs:

If you’re running workflows on AWS or GCP compute that fetch this data, please use the S3 or GS URLs, respectively, for cheaper (for us) and faster (for you) data transfers. Otherwise, please use the https://data.nextstrain.org/ URLs.

Note that even though the s3://nextstrain-data/ and gs://nextstrain-data/ buckets are public, the defaults for most S3 and GS clients require some user to be authenticated, though the specific user/account doesn’t matter. In the rare case you need to access the S3 or GS buckets anonymously, the easiest way is to configure your inputs using https://nextstrain-data.s3.amazonaws.com/files/ncov/open/ or https://storage.googleapis.com/nextstrain-data/files/ncov/open/ URLs instead.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'd say that advice is conditioned on the situation where we have to be authenticated for public bucket requests. But this PR shows we don't. So if we pay a small code-complexity price we can get cheaper & faster transfers (S3) without the frustrating UX of needing to provide credentials. If it were only data downloading I'd keep this PR as is, it was only when testing/considering uploads that I started to question it. I'll think about this some more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the misleading errors around credentials, should we enforce credential requirements here, e.g.:

assert "AWS_ACCESS_KEY_ID" in os.environ and "AWS_SECRET_ACCESS_KEY" in os.environ, \
        "Must set `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables for S3 files"

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the code and I'm happy with the UX now. Here's the logic of when we use signed vs unsigned:

If the bucket is public then we may use an unsigned request which has the nice UX of not needing credentials to be present. If we've made other signed requests or credentials are present then we just sign everything. This has implications for upload: if you attempt to upload to a public bucket without credentials then we allow that here and you'll get a subsequent AccessDenied error when the upload is attempted.

and a table of situations:

S3 buckets credentials present credentials missing
download private / private + public signed Error 1
public signed unsigned
upload private / private + public signed Error 1
public signed Error 2

Error 1:

AWS credentials are required to access <S3 URI>

Error 2:

WorkflowError:
Failed to store output in storage <S3 URI>
ClientError: An error occurred (AccessDenied) when calling the CreateBucket operation: Anonymous users cannot invoke this API. Please authenticate.
WorkflowError:
Failed to get mtime of <S3 URI>
ClientError: An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for thinking this through @jameshadfield, I really like the outlined behavior!

Providing bad credentials produce slightly different errors, but they should be informative for the user:

$ nextstrain build --image nextstrain/base:branch-snakemake-v9 --env AWS_ACCESS_KEY_ID=foo --env AWS_SECRET_ACCESS_KEY=bar phylogenetic/ --forcerun merge_metadata
Assuming unrestricted shared filesystem usage.
host: 0f11191f17e0
Building DAG of jobs...
WorkflowError:
Failed to retrieve input from storage.
WorkflowError:
    Failed to get mtime of s3://nextstrain-data/files/workflows/zika/metadata.tsv.zst
    ClientError: An error occurred (InvalidAccessKeyId) when calling the ListObjects operation: The AWS Access Key Id you provided does not exist in our records.
$ nextstrain build --image nextstrain/base:branch-snakemake-v9 --env AWS_ACCESS_KEY_ID=foo --env AWS_SECRET_ACCESS_KEY=bar phylogenetic/ --forcerun filter
Assuming unrestricted shared filesystem usage.
host: 2d666b298294
Building DAG of jobs...
WorkflowError:
Failed to check existence of s3://nextstrain-data/files/workflows/zika/metadata.tsv.zst
ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

if info.scheme in ['http', 'https']:
return _storage_http(keep_local=keep_local)(uri)

# TODO XXX - Google? We allowed this in ncov <https://github.com/nextstrain/ncov/blob/41cf6470d3140963ff3e02c29241f80ae8ed9c33/workflow/snakemake_rules/remote_files.smk#L62>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for supporting GS urls if possible (e.g. Terra users, etc)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I've included the GS plugin (snakemake-storage-plugin-gcs) in nextstrain/docker-base#257

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good - can one of you add it to this PR?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in 70e16f8, but not sure how to test here...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@j23414 are you able to test the GS stuff? I presume Broad is mirroring the zika files?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, seems like Broad is mirroring everything in nextstrain-data to GS, so I found
https://storage.googleapis.com/nextstrain-data/files/workflows/zika/metadata.tsv.zst
(gs://nextstrain-data/files/workflows/zika/metadata.tsv.zst) which we can use to test.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@j23414 and I discussed GS in person yesterday.

We ran into authentication errors that indicates you need to set up Application Default Credentials in order to use the GS plugin. I have not found any way of passing in credentials via envvars for this plugin....

Even the old Snakemake v7 GS provider required you to login via gcloud auth before running Snakemake. I'm not sure how well this ever worked with the Docker runtime. The gcloud auth command creates a JSON file that is stored at ~/.config/gcloud/application_default_credentials.json, which is not accessible in the docker runtime. You can use GOOGLE_APPLICATION_CREDENTIALS to define a separate location for the credentials file, so you could create the credentials JSON within the pathogen repo and pass the --env GOOGLE_APPLICATION_CREDENTIALS=... to use it in the Docker runtime, but I'm not sure how safe that is...

All that is to say, I'm walking back on my support for GS and think it's not worth the trouble.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing! I'm going to take this code over to ncov now and will remove GS. I'll add a specific conditional to catch the GS scheme (and HTTP) and raise an error telling users to get in touch with us because we can add those functionality if pushed to do so.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metadata: "data/metadata_usvi.tsv"
sequences: "data/sequences_usvi.fasta"
# metadata: https://raw.githubusercontent.com/nextstrain/zika/refs/heads/main/phylogenetic/data/metadata_usvi.tsv
# sequences: https://raw.githubusercontent.com/nextstrain/zika/refs/heads/main/phylogenetic/data/sequences.fasta
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However using raw.githubusercontent.com URLs didn't work (curling them was fine):

I'm surprised this didn't work, but in this case we can work around it by moving USVI data to a public s3 bucket (refer to it by http)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case it helps, uploaded usvi to a public s3:

https://data.nextstrain.org/files/workflows/zika/sequences_usvi.fasta.zst
https://data.nextstrain.org/files/workflows/zika/metadata_usvi.tsv.zst

Happy to drop the ./data directory subsequently

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this was a case of typo resulting in unexpected error? (I would have expected a 404 error)

Works fine with

Suggested change
# sequences: https://raw.githubusercontent.com/nextstrain/zika/refs/heads/main/phylogenetic/data/sequences.fasta
sequences: https://raw.githubusercontent.com/nextstrain/zika/refs/heads/main/phylogenetic/data/sequences_usvi.fasta

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo indeed! Ok that fixes it, and I've pushed up a WIP commit which we can fix up later on: cc86bd7.

The error messaging around remote files (see main PR comment) doesn't seem ideal.

Copy link
Copy Markdown
Member Author

@jameshadfield jameshadfield Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case it helps, uploaded usvi to a public s3:

@j23414 can you test out those files and see if you can reproduce the error i'm getting:

$ curl  -o metadata_usvi.tsv.zst https://data.nextstrain.org/files/workflows/zika/metadata_usvi.tsv.zst

$ zstd --decompress --stdout metadata_usvi.tsv.zst
[expected TSV output]

$ augur filter --metadata metadata_usvi.tsv.zst --output-metadata x
Traceback (most recent call last):
  File "/Users/naboo/github/nextstrain/augur/augur/__init__.py", line 70, in run
    return args.__command__.run(args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/naboo/github/nextstrain/augur/augur/filter/__init__.py", line 161, in run
    return _run(args)
           ^^^^^^^^^^
  File "/Users/naboo/github/nextstrain/augur/augur/filter/_run.py", line 198, in run
    metadata_reader = read_metadata(
                      ^^^^^^^^^^^^^^
  File "/Users/naboo/github/nextstrain/augur/augur/io/metadata.py", line 96, in read_metadata
    metadata = pd.read_csv(
               ^^^^^^^^^^^^
  File "/Users/naboo/miniconda3/envs/augur-dev-snakemake-v9/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
    return _read(filepath_or_buffer, kwds)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/naboo/miniconda3/envs/augur-dev-snakemake-v9/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 620, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/naboo/miniconda3/envs/augur-dev-snakemake-v9/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__
    self._engine = self._make_engine(f, self.engine)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/naboo/miniconda3/envs/augur-dev-snakemake-v9/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1898, in _make_engine
    return mapping[engine](f, **self.options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/naboo/miniconda3/envs/augur-dev-snakemake-v9/lib/python3.12/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 93, in __init__
    self._reader = parsers.TextReader(src, **kwds)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "parsers.pyx", line 574, in pandas._libs.parsers.TextReader.__cinit__
  File "parsers.pyx", line 663, in pandas._libs.parsers.TextReader._get_header
  File "parsers.pyx", line 874, in pandas._libs.parsers.TextReader._tokenize_rows
  File "parsers.pyx", line 891, in pandas._libs.parsers.TextReader._check_tokenize_status
  File "parsers.pyx", line 2053, in pandas._libs.parsers.raise_parser_error
zstd.ZstdError: zstd decompress error: Unknown frame descriptor


An error occurred (see above) that has not been properly handled by Augur.
To report this, please open a new issue including the original command and the error above:
    <https://github.com/nextstrain/augur/issues/new/choose>

(I'm getting the same errors using Snakemake HTTP storage plugin, but the above curl seemed simpler to detail)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, I'm getting the same error

...
zstd.ZstdError: zstd decompress error: Unknown frame descriptor

Versions below, I'm using isolated conda runtime:

nextstrain shell .
augur --version
#> augur 31.2.0

exit

nextstrain --version
#> Nextstrain CLI 10.2.0 (standalone)

Copy link
Copy Markdown
Contributor

@j23414 j23414 Jul 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I have an idea what happened. I've uploaded new fixed metadata_usvi.tsv.zst and sequences_usvi.fasta.zst files and the error disappeared.

Originally I had tried uploading via

# upload uncompressed TSV
envdir ~/nextstrain/env.d/seasonal-flu/ \
  nextstrain remote upload s3://nextstrain-data/files/workflows/zika/ metadata_usvi.tsv

# downloaded to verify
mkdir local; cd local
wget https://data.nextstrain.org/files/workflows/zika/metadata_usvi.tsv
less metadata_usvi.tsv

Realized that metadata_usvi.tsv had been zstd compressed during upload... geh

# renamed downloaded file to metadata_usvi.tsv.zst
mv metadata_usvi.tsv metadata_usvi.tsv.zst

# Uploaded the zst file
envdir ~/nextstrain/env.d/seasonal-flu/ \
  nextstrain remote upload s3://nextstrain-data/files/workflows/zika/ metadata_usvi.tsv.zst

I should have locally compressed and uploaded it directly

zstd metadata_usvi.tsv
envdir ~/nextstrain/env.d/seasonal-flu/ \
  nextstrain remote upload s3://nextstrain-data/files/workflows/zika/ metadata_usvi.tsv.zst

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realized that metadata_usvi.tsv had been zstd compressed during upload... geh

Yeah, nextstrain upload does that automagically which is often what we want (e.g. for auspice JSONs), but when it's not it's always frustrating to debug! Thanks for fixing

if info.scheme=='s3':
return _storage_s3(bucket=info.netloc, keep_local=keep_local, force_signed=force_signed)(uri)

if info.scheme in ['http', 'https']:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to support plain HTTP? Requiring HTTPS doesn't feel onerous and is kind of a security best-practice

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's currently supported in ncov so I think we're trying to keep the functionality the same here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...ok, but -- is it actually used in ncov? because this feels like a great time for a deprecation…

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True! Especially since the Snakemake upgrade is a breaking change anyways...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plugin requires a Google Cloud Project that I'm unsure how is usually set.
Currently it can be set with the snakemake envvar `SNAKEMAKE_STORAGE_GCS_PROJECT`.

https://snakemake.github.io/snakemake-plugin-catalog/plugins/storage/gcs.html#settings
@jameshadfield
Copy link
Copy Markdown
Member Author

I dug into the confusing MissingInputException error in snakemake/snakemake#3663. Hopefully we (or someone else) can fix it in Snakemake. I think it's something we can live with until such time as it's fixed upstream.

@jameshadfield
Copy link
Copy Markdown
Member Author

jameshadfield commented Jul 9, 2025

I've pushed this to a point where I am happy with the implementation and done a fair bit of further testing today, including the S3 uploads functionality. I think a little extra testing from others would be great, especially of google storage. When we're happy with the functionality I'll squash all the commits and remove any commented out code etc.

nCoVs adoption of Snakemake v9 is going to need this code, so I wonder if nextstrain/ncov#1180 should be the first place we implement this code (instead of in Zika) as it's a drop-in replacement of nCoV's remote_files.smk; we'd subsequently move it to shared and then Zika can use it from there. Alternatively we could put it in shared first and then have nCoV use that. @joverlee521 thoughts?

@joverlee521
Copy link
Copy Markdown
Contributor

nCoVs adoption of Snakemake v9 is going to need this code, so I wonder if nextstrain/ncov#1180 should be the first place we implement this code (instead of in Zika) as it's a drop-in replacement of nCoV's remote_files.smk; we'd subsequently move it to shared and then Zika can use it from there. Alternatively we could put it in shared first and then have nCoV use that. @joverlee521 thoughts?

Probably easier for development to add/implement in ncov first then move it out to shared and propagate as needed.

jameshadfield added a commit to nextstrain/ncov that referenced this pull request Jul 11, 2025
The previous version of `path_or_url` was not compatible with Snakemake
v8, which implemented a storage plugin architecture [1]. This is a
complete rewrite of the approach and was originally written for zika
[2]. See [2] for more commentary as well as the docs added to
`remote_files.smk`

Google Storage URIs are not supported due to authn complexity, as
discussed in [3] and HTTP support is dropped as requested in [4].

The default of `keep_local` is now True. Since all (n=3) existing usages
of `path_or_url` specify `keep_local=True` this is consistent with
previous behaviour.

[1] <https://snakemake.readthedocs.io/en/stable/snakefiles/storage.html>
[2] <nextstrain/zika#89>
[3] <nextstrain/zika#89 (comment)>
[4] <nextstrain/zika#89 (comment)>
@jameshadfield
Copy link
Copy Markdown
Member Author

jameshadfield commented Jul 11, 2025

I'm going to close this PR as the code (as of 5319043) is now part of nextstrain/ncov#1180. I think there are two paths we can take:

  1. release updated runtimes (using snakemake v9)
  2. merge Snakemake v9 support ncov#1180
  3. copy ncov's (new!) remote_files.smk into nextstrain/shared
  4. in parallel:

Or if we'd like to work on #80 right now:

  1. copy the remote_files.smk into Allow for multiple inputs from the config file #80 right now and continue working on that PR
  2. release updated runtimes (using snakemake v9)
  3. merge Snakemake v9 support ncov#1180 and Allow for multiple inputs from the config file #80
  4. copy ncov's (new!) remote_files.smk into nextstrain/shared
  5. in parallel, update {zika,ncov} to use the new shared code

I prefer the second one, so I've added the remote-storage code to @j23414's #80 in order to unblock progress there.

jameshadfield added a commit that referenced this pull request Jul 11, 2025
The code here is lifted directly from <nextstrain/ncov#1180>
which itself was lifted from <#89>.
See those PRs for in-depth discussions.

The `remote_files.smk` file added here will be moved to the
github.com/nextstrain/shared/ repo and vendored across pathogen repos,
including this one. Assuming this doesn't change cf. the ncov
implementation I think it's fine to have the file copy-pasted around a
few repos before it arrives in the shared repo.
j23414 pushed a commit that referenced this pull request Jul 14, 2025
The code here is lifted directly from <nextstrain/ncov#1180>
which itself was lifted from <#89>.
See those PRs for in-depth discussions.

The `remote_files.smk` file added here will be moved to the
github.com/nextstrain/shared/ repo and vendored across pathogen repos,
including this one. Assuming this doesn't change cf. the ncov
implementation I think it's fine to have the file copy-pasted around a
few repos before it arrives in the shared repo.
jameshadfield added a commit to nextstrain/shared that referenced this pull request Jul 20, 2025
I originally implemented this in <nextstrain/zika#89>
and that's where the bulk of the discussion and review happened. That
PR wasn't merged and instead it was added to ncov via
<nextstrain/ncov#1180>. The code in this commit
is copied directly from <https://github.com/nextstrain/ncov/blob/137ffc479a1bfe4e2c6ae1bbcd1b22e3afc5f8dd/workflow/snakemake_rules/remote_files.smk>

Closes #56
jameshadfield added a commit to nextstrain/shared that referenced this pull request Jul 20, 2025
I originally implemented this in <nextstrain/zika#89>
and that's where the bulk of the discussion and review happened. That
PR wasn't merged and instead it was added to ncov via
<nextstrain/ncov#1180>. The code in this commit
is copied directly from <https://github.com/nextstrain/ncov/blob/137ffc479a1bfe4e2c6ae1bbcd1b22e3afc5f8dd/workflow/snakemake_rules/remote_files.smk>

Closes #56
jameshadfield added a commit to nextstrain/shared that referenced this pull request Jul 21, 2025
I originally implemented this in <nextstrain/zika#89>
and that's where the bulk of the discussion and review happened. That
PR wasn't merged and instead it was added to ncov via
<nextstrain/ncov#1180>. The code in this commit
is copied directly from <https://github.com/nextstrain/ncov/blob/137ffc479a1bfe4e2c6ae1bbcd1b22e3afc5f8dd/workflow/snakemake_rules/remote_files.smk>
with some changes to comments in the code as applicable.

Closes #56
j23414 pushed a commit that referenced this pull request Jul 21, 2025
The code here is lifted directly from <nextstrain/ncov#1180>
which itself was lifted from <#89>.
See those PRs for in-depth discussions.

The `remote_files.smk` file added here will be moved to the
github.com/nextstrain/shared/ repo and vendored across pathogen repos,
including this one. Assuming this doesn't change cf. the ncov
implementation I think it's fine to have the file copy-pasted around a
few repos before it arrives in the shared repo.
@victorlin victorlin deleted the james/storage branch July 31, 2025 21:51
@victorlin victorlin restored the james/storage branch July 31, 2025 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants