Allow for multiple inputs from the config file#80
Conversation
jameshadfield
left a comment
There was a problem hiding this comment.
Implementation copied and modified from [avian-flu]
I think this would be a good point to try out @trs's suggestion regarding built-in remote file support. We could do that here or in avian-flu. This is also a good time to consider the best way to spread this code around repos (e.g. as a stand-alone rules file, or vendored, or... ).
To be clear, I'm happy with the interface, but before we spread the implementation around different pathogen repos we should improve it as much as possible.
P.S. Avian-flu PR nextstrain/avian-flu#106 was superseded by nextstrain/avian-flu#112
|
+1 for getting built-in remote file support Should I work on this or do you want to? You can push to this branch and zika is a simpler test case than avian (segment wildcards abound).
I lean toward the "stand-alone rules" file for the moment, because I can imagine needing to adjust the "augur merge" parameters based on the spiked-in-private data for a particular pathogen. But I'm open to snakemake modules: inferred example* module ingest_ncbi:
snakefile:
github(
"nextstrain/pathogen-repo-guide",
path="ingest/rules/fetch_from_ncbi.smk",
branch="main",
)or creating a "vendored" directory for phylogenetic workflows (currently only for ingest workflows). |
Agree that non-segmented is an easier place to build this out, and I think any such solutions would also work for the segmented cases. I'm not able to work on this in the short term (one of the reasons I merged avian-flu when I did is that I needed it's capabilities for the ongoing outbreak work). Perhaps dev-chat would be a good place to surface this?
I'm not pushing for any particular solution, but I know it's a subject that a people in the group bring up regularly so wanted to point it out here. |
c74621b to
ab23615
Compare
|
In terms of interface (as opposed to implementation), I'm unclear about why there is affordance to have both extra elements in As a user, I also expect the ability to use HTTP as well as local files as entries. |
Yes, this should be documented more generally - right now the only place is the avian-flu readme. TL;DR you want |
|
Thanks James! I'm still confused however (even after reading the nice documentation in avian-flu readme) if there's a difference between and |
Those are identical. The benefit comes when we have a default YAML (e.g. in the pathogen repo) and an overlay config yaml (in the analysis directory), which is the imagined use case for additional_inputs:
- name: user-data
metadata: user_data/metadata.tsv
sequences: user_data/sequences.fastaAnd, if you want to just analyse your data, you'd use |
|
Got it. This makes sense. Thank you. I see how this is a smart strategy even if it adds some uncertainty about standard path for adding user data. We should have standard advice for using overlays that are then compatible with |
jameshadfield
left a comment
There was a problem hiding this comment.
(Left a few comments but ran out of time to fully test / review - i'll pick this up next week)
ab23615 to
5a014cb
Compare
joverlee521
left a comment
There was a problem hiding this comment.
Left some comments and suggestions.
The last dev chat made it seem like we don't want to slow this down with implementation details so you can consider my comments as non-blocking. I do think that there's a clearer path for vendoring ncov's remote file support now than when this was implemented in avian-flu so I'd be in favor of doing that now rather than later.
| @@ -1,7 +1,14 @@ | |||
| # Sequences must be FASTA and metadata must be TSV | |||
| # Both files must be zstd compressed | |||
There was a problem hiding this comment.
is it going to be confusing that the additional_inputs files are not compressed?
There was a problem hiding this comment.
Good question! I'll drop these comments for now as out of date. Ideally we should be able to handle compressed or uncompressed inputs from s3, https, local, or workflow specific in the final implementation
There was a problem hiding this comment.
Reminder to update this comment. I investigated the supported compressions a little bit as part of #89 (under the "File Compression" section of the PR description). What compressions we actually support will be influenced by what you do with the merge rules.
There was a problem hiding this comment.
Dropped comments entirely in 72f9d21
But let me know if I misunderstood something here
There was a problem hiding this comment.
I think it'd be great to document what compressions are supported, although this is not entirely generalizable across pathogen repos in the way we're aiming for with the multiple_inputs so we have to be careful with how we write the comment. As I understand it, and assuming we use separate merge_sequences and merge_metadata rules, then the supported compressions are:
- Metadata
- Multiple sources: whatever
augur mergesupports - Single source: what's supported by the rules where
input_metadatais used (4 different rules for zika as it currently stands)
- Multiple sources: whatever
- Sequences
- Multiple sources: whatever
augur mergesupports - Single source: what's supported by the rules where
input_sequencesis used (only one rule for zika as it currently stands)
- Multiple sources: whatever
There was a problem hiding this comment.
Perhaps I can link to the augur docs on different compression support, similar to Nextclade docs?
- https://docs.nextstrain.org/projects/nextclade/en/stable/user/input-files/compression.html
If it exists in augur docs somewhere...
be3f42f to
0822b83
Compare
Rebase and maintain the purpose of these lines: * https://github.com/nextstrain/zika/blame/3c17db54698cf7836641308f6aa50d714959992b/phylogenetic/rules/merge_sequences_usvi.smk#L30-L33 Where the workflow.source_path("../ is for datasets that live in the workflow source directory (not the local analysis directory), specifically when using nextstrain run.
Updated to match latest guidelines from Nextstrain's Snakemake style guide¹: - Use raw, triple-quoted shell blocks - Log standard out and error output to log files and the terminal - Always use the benchmark directive ¹ <https://docs.nextstrain.org/en/latest/reference/snakemake-style-guide.html>
Co-authored-by: John SJ Anderson <janders4@fredhutch.org>
Co-authored-by: John SJ Anderson <janders4@fredhutch.org>
The code here is lifted directly from <nextstrain/ncov#1180> which itself was lifted from <#89>. See those PRs for in-depth discussions. The `remote_files.smk` file added here will be moved to the github.com/nextstrain/shared/ repo and vendored across pathogen repos, including this one. Assuming this doesn't change cf. the ncov implementation I think it's fine to have the file copy-pasted around a few repos before it arrives in the shared repo.
0822b83 to
a7da6c3
Compare
| """ | ||
|
|
||
| include: "remote_files.smk" | ||
| include: "../../shared/vendored/snakemake/remote_files.smk" |
There was a problem hiding this comment.
I'm on the fence about this include statement going here or the main Snakefile. Does anyone have a preference?
There was a problem hiding this comment.
I have slight preference for adding the include statment to the main Snakefile in case we want to use the shared path_or_url function in other parts of the workflow.
There was a problem hiding this comment.
Good point about scoping! I leaned toward main Snakefile so fixed up in 75dc2e0
There was a problem hiding this comment.
Agreed - the intention was for path_or_url to be available for wider use than just the inputs.
One use-case we don't support (as discussed elsewhere) is something like:
inputs:
foo = path_or_url(config['bar'])- If
config['bar']is a URL it'll just work when run from either the pathogen repo or an analysis directory - If
config['bar']is a path which exists in the analysis directory (and we're running it in an analysis directory) it'll just work - If
config['bar']is a path which exists in the pathogen repo and we're running it in an analysis directory then it won't work
(There's no easy solution as far as I can see. You can't just use something like resolve_config_path within path_or_url because we support output files and because some workflows use modified versions of resolve_config_path. We could expand path_or_url to take a path_resolver argument which would force the path to exist, i.e. only be useful in the context of params/inputs. Happy to implement this if people want.)
There was a problem hiding this comment.
If config['bar'] is a path which exists in the pathogen repo and we're running it in an analysis directory then it won't work
Hmm, I thought we could wrap resolve_config_path around path_or_url, since path_or_url should return the local file location for a URL.
inputs:
foo = resolve_config_path(path_or_url(config['bar']))but this currently runs into a weird WorkflowError (reported in snakemake/snakemake#3027)
Error:
WorkflowError:
Flags ({'storage_object': <snakemake_storage_plugin_http.StorageObject object at 0xffffa0415f90>}) in file pattern '.snakemake/storage/http/github.com/nextstrain/augur/blob/fe3f57975c34ff180cb437bebfefd68fa4a4afed/augur/data/geolocation_rules.tsv' given to expand() are invalid. Flags (e.g. temp(), directory()) have to be applied outside of expand (e.g. 'temp(expand("plots/{sample}.pdf", sample=SAMPLES))').
Wildcards:
There was a problem hiding this comment.
Beyond the workflow error you get, conceptually the .snakemake/storage file may not exist when the inputs are evaluated so our path resolution approach wouldn't work. This is somewhat analogous to using path_or_url for outputs. Path evaluation within path_or_url will be straightforward I think (via something like path_or_url(uri, path_resolver))
d7df968 to
72f9d21
Compare
joverlee521
left a comment
There was a problem hiding this comment.
I left some small nit-picky comments, but I think this should be good to merge?
Co-authored-by: Jover Lee <joverlee521@gmail.com>
Description of proposed changes
to-be-written
Implementation copied and modified from:
This is inspired by the work in:
nextstrain/avian-flu#106
https://github.com/nextstrain/avian-flu?tab=readme-ov-file#use-additional-metadata-andor-sequences
Related issue(s)
Checklist