WIP - snakemake v8+ remote storage support#89
WIP - snakemake v8+ remote storage support#89jameshadfield wants to merge 7 commits intorefactor-additional-inputsfrom
Conversation
https://snakemake.readthedocs.io/en/stable/snakefiles/storage.html https://snakemake.github.io/snakemake-plugin-catalog/plugins/storage/http.html https://snakemake.github.io/snakemake-plugin-catalog/plugins/storage/s3.html https://github.com/snakemake/snakemake-storage-plugin-s3/blob/main/snakemake_storage_plugin_s3/__init__.py https://github.com/snakemake/snakemake-storage-plugin-http/blob/main/snakemake_storage_plugin_http/__init__.py
7373b38 to
26ad6e9
Compare
phylogenetic/rules/remote_files.smk
Outdated
| # If the bucket is public then we make an unsigned request (i.e. no AWS credentials | ||
| # need to be set). Note that this won't work for uploads. |
There was a problem hiding this comment.
Maybe we should drop this special-treatment of known public buckets? It would simplify the code and let us avoid the force_signed argument. For dev purposes I find it frustrating to have to load credentials to fetch data in s3://nextstrain-data, but we could instead avoid this by using https://data.nextstrain.org/... URLs.
There was a problem hiding this comment.
I'm open to public buckets being referred to by the https: URLs.
And then keeping s3 handling for less public (private? restricted use?) datasets in case that usecase shows up if it hasn't already.
There was a problem hiding this comment.
+1 for removing the special treatment for public buckets.
I say follow our own guidance from ncov/remote_inputs docs:
If you’re running workflows on AWS or GCP compute that fetch this data, please use the S3 or GS URLs, respectively, for cheaper (for us) and faster (for you) data transfers. Otherwise, please use the https://data.nextstrain.org/ URLs.
Note that even though the s3://nextstrain-data/ and gs://nextstrain-data/ buckets are public, the defaults for most S3 and GS clients require some user to be authenticated, though the specific user/account doesn’t matter. In the rare case you need to access the S3 or GS buckets anonymously, the easiest way is to configure your inputs using https://nextstrain-data.s3.amazonaws.com/files/ncov/open/ or https://storage.googleapis.com/nextstrain-data/files/ncov/open/ URLs instead.
There was a problem hiding this comment.
Hmm, I'd say that advice is conditioned on the situation where we have to be authenticated for public bucket requests. But this PR shows we don't. So if we pay a small code-complexity price we can get cheaper & faster transfers (S3) without the frustrating UX of needing to provide credentials. If it were only data downloading I'd keep this PR as is, it was only when testing/considering uploads that I started to question it. I'll think about this some more.
There was a problem hiding this comment.
With the misleading errors around credentials, should we enforce credential requirements here, e.g.:
assert "AWS_ACCESS_KEY_ID" in os.environ and "AWS_SECRET_ACCESS_KEY" in os.environ, \
"Must set `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables for S3 files"
There was a problem hiding this comment.
I've updated the code and I'm happy with the UX now. Here's the logic of when we use signed vs unsigned:
If the bucket is public then we may use an unsigned request which has the nice UX of not needing credentials to be present. If we've made other signed requests or credentials are present then we just sign everything. This has implications for upload: if you attempt to upload to a public bucket without credentials then we allow that here and you'll get a subsequent
AccessDeniederror when the upload is attempted.
and a table of situations:
| S3 buckets | credentials present | credentials missing | |
|---|---|---|---|
| download | private / private + public | signed | Error 1 |
| public | signed | unsigned | |
| upload | private / private + public | signed | Error 1 |
| public | signed | Error 2 |
Error 1:
AWS credentials are required to access <S3 URI>
Error 2:
WorkflowError:
Failed to store output in storage <S3 URI>
ClientError: An error occurred (AccessDenied) when calling the CreateBucket operation: Anonymous users cannot invoke this API. Please authenticate.
WorkflowError:
Failed to get mtime of <S3 URI>
ClientError: An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied
There was a problem hiding this comment.
Thanks for thinking this through @jameshadfield, I really like the outlined behavior!
Providing bad credentials produce slightly different errors, but they should be informative for the user:
$ nextstrain build --image nextstrain/base:branch-snakemake-v9 --env AWS_ACCESS_KEY_ID=foo --env AWS_SECRET_ACCESS_KEY=bar phylogenetic/ --forcerun merge_metadata
Assuming unrestricted shared filesystem usage.
host: 0f11191f17e0
Building DAG of jobs...
WorkflowError:
Failed to retrieve input from storage.
WorkflowError:
Failed to get mtime of s3://nextstrain-data/files/workflows/zika/metadata.tsv.zst
ClientError: An error occurred (InvalidAccessKeyId) when calling the ListObjects operation: The AWS Access Key Id you provided does not exist in our records.
$ nextstrain build --image nextstrain/base:branch-snakemake-v9 --env AWS_ACCESS_KEY_ID=foo --env AWS_SECRET_ACCESS_KEY=bar phylogenetic/ --forcerun filter
Assuming unrestricted shared filesystem usage.
host: 2d666b298294
Building DAG of jobs...
WorkflowError:
Failed to check existence of s3://nextstrain-data/files/workflows/zika/metadata.tsv.zst
ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
phylogenetic/rules/remote_files.smk
Outdated
| if info.scheme in ['http', 'https']: | ||
| return _storage_http(keep_local=keep_local)(uri) | ||
|
|
||
| # TODO XXX - Google? We allowed this in ncov <https://github.com/nextstrain/ncov/blob/41cf6470d3140963ff3e02c29241f80ae8ed9c33/workflow/snakemake_rules/remote_files.smk#L62> |
There was a problem hiding this comment.
+1 for supporting GS urls if possible (e.g. Terra users, etc)
There was a problem hiding this comment.
+1, I've included the GS plugin (snakemake-storage-plugin-gcs) in nextstrain/docker-base#257
There was a problem hiding this comment.
Sounds good - can one of you add it to this PR?
There was a problem hiding this comment.
added in 70e16f8, but not sure how to test here...
There was a problem hiding this comment.
@j23414 are you able to test the GS stuff? I presume Broad is mirroring the zika files?
There was a problem hiding this comment.
Ah, seems like Broad is mirroring everything in nextstrain-data to GS, so I found
https://storage.googleapis.com/nextstrain-data/files/workflows/zika/metadata.tsv.zst
(gs://nextstrain-data/files/workflows/zika/metadata.tsv.zst) which we can use to test.
There was a problem hiding this comment.
@j23414 and I discussed GS in person yesterday.
We ran into authentication errors that indicates you need to set up Application Default Credentials in order to use the GS plugin. I have not found any way of passing in credentials via envvars for this plugin....
Even the old Snakemake v7 GS provider required you to login via gcloud auth before running Snakemake. I'm not sure how well this ever worked with the Docker runtime. The gcloud auth command creates a JSON file that is stored at ~/.config/gcloud/application_default_credentials.json, which is not accessible in the docker runtime. You can use GOOGLE_APPLICATION_CREDENTIALS to define a separate location for the credentials file, so you could create the credentials JSON within the pathogen repo and pass the --env GOOGLE_APPLICATION_CREDENTIALS=... to use it in the Docker runtime, but I'm not sure how safe that is...
All that is to say, I'm walking back on my support for GS and think it's not worth the trouble.
There was a problem hiding this comment.
Thanks for testing! I'm going to take this code over to ncov now and will remove GS. I'll add a specific conditional to catch the GS scheme (and HTTP) and raise an error telling users to get in touch with us because we can add those functionality if pushed to do so.
phylogenetic/defaults/config.yaml
Outdated
| metadata: "data/metadata_usvi.tsv" | ||
| sequences: "data/sequences_usvi.fasta" | ||
| # metadata: https://raw.githubusercontent.com/nextstrain/zika/refs/heads/main/phylogenetic/data/metadata_usvi.tsv | ||
| # sequences: https://raw.githubusercontent.com/nextstrain/zika/refs/heads/main/phylogenetic/data/sequences.fasta |
There was a problem hiding this comment.
However using raw.githubusercontent.com URLs didn't work (curling them was fine):
I'm surprised this didn't work, but in this case we can work around it by moving USVI data to a public s3 bucket (refer to it by http)
There was a problem hiding this comment.
In case it helps, uploaded usvi to a public s3:
https://data.nextstrain.org/files/workflows/zika/sequences_usvi.fasta.zst
https://data.nextstrain.org/files/workflows/zika/metadata_usvi.tsv.zst
Happy to drop the ./data directory subsequently
There was a problem hiding this comment.
Seems like this was a case of typo resulting in unexpected error? (I would have expected a 404 error)
Works fine with
| # sequences: https://raw.githubusercontent.com/nextstrain/zika/refs/heads/main/phylogenetic/data/sequences.fasta | |
| sequences: https://raw.githubusercontent.com/nextstrain/zika/refs/heads/main/phylogenetic/data/sequences_usvi.fasta |
There was a problem hiding this comment.
Typo indeed! Ok that fixes it, and I've pushed up a WIP commit which we can fix up later on: cc86bd7.
The error messaging around remote files (see main PR comment) doesn't seem ideal.
There was a problem hiding this comment.
In case it helps, uploaded usvi to a public s3:
@j23414 can you test out those files and see if you can reproduce the error i'm getting:
$ curl -o metadata_usvi.tsv.zst https://data.nextstrain.org/files/workflows/zika/metadata_usvi.tsv.zst
$ zstd --decompress --stdout metadata_usvi.tsv.zst
[expected TSV output]
$ augur filter --metadata metadata_usvi.tsv.zst --output-metadata x
Traceback (most recent call last):
File "/Users/naboo/github/nextstrain/augur/augur/__init__.py", line 70, in run
return args.__command__.run(args)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/naboo/github/nextstrain/augur/augur/filter/__init__.py", line 161, in run
return _run(args)
^^^^^^^^^^
File "/Users/naboo/github/nextstrain/augur/augur/filter/_run.py", line 198, in run
metadata_reader = read_metadata(
^^^^^^^^^^^^^^
File "/Users/naboo/github/nextstrain/augur/augur/io/metadata.py", line 96, in read_metadata
metadata = pd.read_csv(
^^^^^^^^^^^^
File "/Users/naboo/miniconda3/envs/augur-dev-snakemake-v9/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1026, in read_csv
return _read(filepath_or_buffer, kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/naboo/miniconda3/envs/augur-dev-snakemake-v9/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 620, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/naboo/miniconda3/envs/augur-dev-snakemake-v9/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1620, in __init__
self._engine = self._make_engine(f, self.engine)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/naboo/miniconda3/envs/augur-dev-snakemake-v9/lib/python3.12/site-packages/pandas/io/parsers/readers.py", line 1898, in _make_engine
return mapping[engine](f, **self.options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/naboo/miniconda3/envs/augur-dev-snakemake-v9/lib/python3.12/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 93, in __init__
self._reader = parsers.TextReader(src, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "parsers.pyx", line 574, in pandas._libs.parsers.TextReader.__cinit__
File "parsers.pyx", line 663, in pandas._libs.parsers.TextReader._get_header
File "parsers.pyx", line 874, in pandas._libs.parsers.TextReader._tokenize_rows
File "parsers.pyx", line 891, in pandas._libs.parsers.TextReader._check_tokenize_status
File "parsers.pyx", line 2053, in pandas._libs.parsers.raise_parser_error
zstd.ZstdError: zstd decompress error: Unknown frame descriptor
An error occurred (see above) that has not been properly handled by Augur.
To report this, please open a new issue including the original command and the error above:
<https://github.com/nextstrain/augur/issues/new/choose>(I'm getting the same errors using Snakemake HTTP storage plugin, but the above curl seemed simpler to detail)
There was a problem hiding this comment.
Huh, I'm getting the same error
...
zstd.ZstdError: zstd decompress error: Unknown frame descriptor
Versions below, I'm using isolated conda runtime:
nextstrain shell .
augur --version
#> augur 31.2.0
exit
nextstrain --version
#> Nextstrain CLI 10.2.0 (standalone)
There was a problem hiding this comment.
Okay, I have an idea what happened. I've uploaded new fixed metadata_usvi.tsv.zst and sequences_usvi.fasta.zst files and the error disappeared.
Originally I had tried uploading via
# upload uncompressed TSV
envdir ~/nextstrain/env.d/seasonal-flu/ \
nextstrain remote upload s3://nextstrain-data/files/workflows/zika/ metadata_usvi.tsv
# downloaded to verify
mkdir local; cd local
wget https://data.nextstrain.org/files/workflows/zika/metadata_usvi.tsv
less metadata_usvi.tsv
Realized that metadata_usvi.tsv had been zstd compressed during upload... geh
# renamed downloaded file to metadata_usvi.tsv.zst
mv metadata_usvi.tsv metadata_usvi.tsv.zst
# Uploaded the zst file
envdir ~/nextstrain/env.d/seasonal-flu/ \
nextstrain remote upload s3://nextstrain-data/files/workflows/zika/ metadata_usvi.tsv.zst
I should have locally compressed and uploaded it directly
zstd metadata_usvi.tsv
envdir ~/nextstrain/env.d/seasonal-flu/ \
nextstrain remote upload s3://nextstrain-data/files/workflows/zika/ metadata_usvi.tsv.zst
There was a problem hiding this comment.
Realized that metadata_usvi.tsv had been zstd compressed during upload... geh
Yeah, nextstrain upload does that automagically which is often what we want (e.g. for auspice JSONs), but when it's not it's always frustrating to debug! Thanks for fixing
phylogenetic/rules/remote_files.smk
Outdated
| if info.scheme=='s3': | ||
| return _storage_s3(bucket=info.netloc, keep_local=keep_local, force_signed=force_signed)(uri) | ||
|
|
||
| if info.scheme in ['http', 'https']: |
There was a problem hiding this comment.
Do we need to support plain HTTP? Requiring HTTPS doesn't feel onerous and is kind of a security best-practice…
There was a problem hiding this comment.
It's currently supported in ncov so I think we're trying to keep the functionality the same here.
There was a problem hiding this comment.
...ok, but -- is it actually used in ncov? because this feels like a great time for a deprecation…
There was a problem hiding this comment.
True! Especially since the Snakemake upgrade is a breaking change anyways...
The plugin requires a Google Cloud Project that I'm unsure how is usually set. Currently it can be set with the snakemake envvar `SNAKEMAKE_STORAGE_GCS_PROJECT`. https://snakemake.github.io/snakemake-plugin-catalog/plugins/storage/gcs.html#settings
|
I dug into the confusing |
|
I've pushed this to a point where I am happy with the implementation and done a fair bit of further testing today, including the S3 uploads functionality. I think a little extra testing from others would be great, especially of google storage. When we're happy with the functionality I'll squash all the commits and remove any commented out code etc. nCoVs adoption of Snakemake v9 is going to need this code, so I wonder if nextstrain/ncov#1180 should be the first place we implement this code (instead of in Zika) as it's a drop-in replacement of nCoV's |
Probably easier for development to add/implement in ncov first then move it out to shared and propagate as needed. |
The previous version of `path_or_url` was not compatible with Snakemake v8, which implemented a storage plugin architecture [1]. This is a complete rewrite of the approach and was originally written for zika [2]. See [2] for more commentary as well as the docs added to `remote_files.smk` Google Storage URIs are not supported due to authn complexity, as discussed in [3] and HTTP support is dropped as requested in [4]. The default of `keep_local` is now True. Since all (n=3) existing usages of `path_or_url` specify `keep_local=True` this is consistent with previous behaviour. [1] <https://snakemake.readthedocs.io/en/stable/snakefiles/storage.html> [2] <nextstrain/zika#89> [3] <nextstrain/zika#89 (comment)> [4] <nextstrain/zika#89 (comment)>
|
I'm going to close this PR as the code (as of 5319043) is now part of nextstrain/ncov#1180. I think there are two paths we can take:
Or if we'd like to work on #80 right now:
I prefer the second one, so I've added the remote-storage code to @j23414's #80 in order to unblock progress there. |
The code here is lifted directly from <nextstrain/ncov#1180> which itself was lifted from <#89>. See those PRs for in-depth discussions. The `remote_files.smk` file added here will be moved to the github.com/nextstrain/shared/ repo and vendored across pathogen repos, including this one. Assuming this doesn't change cf. the ncov implementation I think it's fine to have the file copy-pasted around a few repos before it arrives in the shared repo.
The code here is lifted directly from <nextstrain/ncov#1180> which itself was lifted from <#89>. See those PRs for in-depth discussions. The `remote_files.smk` file added here will be moved to the github.com/nextstrain/shared/ repo and vendored across pathogen repos, including this one. Assuming this doesn't change cf. the ncov implementation I think it's fine to have the file copy-pasted around a few repos before it arrives in the shared repo.
I originally implemented this in <nextstrain/zika#89> and that's where the bulk of the discussion and review happened. That PR wasn't merged and instead it was added to ncov via <nextstrain/ncov#1180>. The code in this commit is copied directly from <https://github.com/nextstrain/ncov/blob/137ffc479a1bfe4e2c6ae1bbcd1b22e3afc5f8dd/workflow/snakemake_rules/remote_files.smk> Closes #56
I originally implemented this in <nextstrain/zika#89> and that's where the bulk of the discussion and review happened. That PR wasn't merged and instead it was added to ncov via <nextstrain/ncov#1180>. The code in this commit is copied directly from <https://github.com/nextstrain/ncov/blob/137ffc479a1bfe4e2c6ae1bbcd1b22e3afc5f8dd/workflow/snakemake_rules/remote_files.smk> Closes #56
I originally implemented this in <nextstrain/zika#89> and that's where the bulk of the discussion and review happened. That PR wasn't merged and instead it was added to ncov via <nextstrain/ncov#1180>. The code in this commit is copied directly from <https://github.com/nextstrain/ncov/blob/137ffc479a1bfe4e2c6ae1bbcd1b22e3afc5f8dd/workflow/snakemake_rules/remote_files.smk> with some changes to comments in the code as applicable. Closes #56
The code here is lifted directly from <nextstrain/ncov#1180> which itself was lifted from <#89>. See those PRs for in-depth discussions. The `remote_files.smk` file added here will be moved to the github.com/nextstrain/shared/ repo and vendored across pathogen repos, including this one. Assuming this doesn't change cf. the ncov implementation I think it's fine to have the file copy-pasted around a few repos before it arrives in the shared repo.
Initial work to support remote files (S3, HTTP) using Snakemake storage support. The intention is for:
shared/and end up in many reposAll contributions, testing etc welcome. Please feel free to add commits directly to this PR. We should probably add Google Storage, since it's in ncov?
Dependencies needed:
Where's data stored?
Within
.snakemake/storage/<tag-name>/. This is a nicer solution than the current ncov version which stores files in top-level directories which require git-ignoring. Here's the canonical curated data from our S3 bucket:Goodbye
./data?Since the fetched data is stored within
.snakemake(see above), our workflows won't use./datadirectories any more. (Zika currently uses it for USVI data, but we'll change that as part of this PR or the parent PR.)External analysis directories
These work out of the box which is nice! The files are stored within the analysis directories
.snakemakefolder, as expected.S3
I've set up S3 to use different storage providers depending on if we think credentials are needed or not, which mirrors our previous
aws s3 cp --no-sign-requestapproach.Missing credentials can be confusing
If credentials are required (an easy way to test this is to set
PUBLIC_BUCKETSto an empty set so that we attempt to use credentials for the canonical nextstrain-data) then depending on how we invoke snakemake we get different errors. I believe this is due to if the inputs of the target rule are fetched from remote storage, or if they TKTKCase 1: The target rule doesn't itself use a remote storage file, but somewhere in the DAG we do fetch files from S3. E.g. define multiple inputs such that we fetch one set of data from S3 and another set from local USVI data. The behaviour here is what I would expect:
Case 2: The target rule uses
path_or_urldirectly as an input. For instance, we run themerge_metadatarule from the default config which sources a file directly from S3.Without credentials we get a very confusing
Missing input fileserror:If you have credentials then this works as expected (see case 1 above)
I presume the same confusion occurs with credentials which don't grant authz, but I have't checked.
My hunch is that this behaviour happens because we aren't directly using
storage.<tag>wrappers in inputs/outputs of rules, but I haven't done much digging.Is data re-fetched as needed?
We use
keep_local=Trueso that we keep the downloaded files and thus don't have to re-fetch them upon each snakemake invocation. The presence of a mtime method in the S3 plugin implies that a cache check is made. This would mirror the behaviour in our ncov implementation. We should document how we can avoid this, because it can trigger a workflow re-run when not desired (common for dev work).More credential strangeness There's some subtlety I don't fully understand here. E.g.: if you have successfully completed a credentialed fetch of the files (e.g. S3 sequences & metadata), and thus they're stored on disk at
.snakemake/storage/s3_signed, and then we run another invocation without credentials:snakemake --cores 1 -pf results/filtered.fasta # no AWS credentials in env- works just fine, i.e. therule filterruns successfully.snakemake --cores 1 -npf results/metadata_merged.tsv # no AWS credentials in env- we get the sameMissing input fileserror.Uploading data
I don't know if we used this in ncov, but it could be helpful. Tested using the following rule:
This is the reason for the
force_signedkeyword argument, as uploading to (e.g.)nextstrain-dataneeds credentials but downloading doesn't.HTTP[S]
Running with a single set of inputs from data.nextstrain.org (and no additional inputs) worked:
However using raw.githubusercontent.com URLs didn't work (This was a typo - now fixed in this PR.curling them was fine):Uploading data
Didn't test! Seems possible via a POST, but ???
Google Storage
Todo?
File Compression
This isn't really about snakemake storage solutions at all, but might as well sort it out here
From some basic testing:
augur mergedoes the right thing.augur filterdoes the right thing (for zst)augur filterdoes the right thing for zstSo assuming we switch from
seqkittoaugur mergefor sequences (which will use...seqkit) then I think we support compression of remote files ifaugur filterandaugur mergesupport it.Relevant docs pages
https://snakemake.readthedocs.io/en/stable/snakefiles/storage.html
https://snakemake.github.io/snakemake-plugin-catalog/plugins/storage/http.html
https://snakemake.github.io/snakemake-plugin-catalog/plugins/storage/s3.html
https://github.com/snakemake/snakemake-storage-plugin-s3/blob/main/snakemake_storage_plugin_s3/__init__.py
https://github.com/snakemake/snakemake-storage-plugin-http/blob/main/snakemake_storage_plugin_http/__init__.py