Allow running workflow from outside of repo#103
Conversation
e6b070d to
275f585
Compare
| Resolve a relative *path* given in a configuration value. Before resolving | ||
| any '{x}' substrings are replaced by their corresponding wildcards (if the | ||
| `wildcards` argument is provided). | ||
|
|
||
| Search order (first match returned): | ||
| 1. Relative to the analysis directory | ||
| 2. Relative to the directory the entry snakefile was in. Typically this | ||
| is not the Snakefile you are looking at now but (e.g.) the one in | ||
| avian-flu/gisaid | ||
| 3. Relative to where this Snakefile is (i.e. `avian-flu/`) |
There was a problem hiding this comment.
@joverlee521 if we do implement a search order like this it opens up the possibility of moving the "default" files from (e.g.) ./config/ to ./gisaid/ or ./gisaid/defaults/. Similarly for custom rules. If we go this direction I think it warrants a rethink of how we want to structure repos.
ad0f15c to
90b994c
Compare
| class InvalidConfigError(Exception): | ||
| pass | ||
|
|
||
| def resolve_config_path(original_path, wildcards=None): |
There was a problem hiding this comment.
Reminder to me: need to resolve any custom rules using this (?) function, currently they're just
for rule_file in config.get('custom_rules', []):
include: rule_file
There was a problem hiding this comment.
There was a problem hiding this comment.
There's a decision for us to make here which wasn't needed in measles. The h5n1-cattle-outbreak config uses a custom rule which needs to be searched for within the avian-flu repo. Other workflows use a similar approach for CI - e.g. zika.
Approach 1: Use the same resolving approach as resolve_config_path to search for rulefiles.
Approach 2: Use the approach of measles (custom rules must be relative to the working directory) and replace our usage of custom_rules with a "include" directive within h5n1-cattle-outbreak/Snakefile.
There are pros and cons to each I think. Given that we use custom_rules in CI, approach 1 is nice as it'll allow CI to be run in a separate directory as desired. Approach 2 also has the "gotcha" where you develop a workflow with custom-rules within avian-flu, it all works, but then it won't work when you run it in another directory. There's also the bigger picture issue where if multiple configfiles define custom_rules only one of them will be used because they're lists.
Given the above I think using the search order from Approach 1 is the best, but we should also reference cattle-flu.smk within h5n1-cattle-outbreak/Snakefile so that overlays with custom_rules don't have to specify it.
There was a problem hiding this comment.
Hmm. It feels strange to be letting "custom" rule files be sourced from the workflow source, i.e. "stock custom rules", which seems contradictory.
This is why, FWIW, in measles I intentionally only resolved custom rules files based on the working analysis directory.
For the current "build-configs" pattern like CI or our own automation, I'd think to convert those to analysis directories. I didn't demo this for measles to keep changes to a minimum, but it's an easy change and keeps our interfaces consistent.
There was a problem hiding this comment.
It feels strange to be letting "custom" rule files be sourced from the workflow source, i.e. "stock custom rules", which seems contradictory
Yeah I see that. I guess a clarifying question is "do we want to allow all workflows to be run from an external analysis directory". If that's the case, and I think it probably is, then we either avoid custom_rules entirely, expand the interface to have something like workflow_rules, or shift our usage of custom rules to their own (sub-)workflows (concrete example). Is the latter what you mean by "I'd think to convert those to analysis directories"?
If we use different workflows for CI rather than "custom_rules" as we do now, then a cattle-outbreak CI might look like avian-flu/phylogenetic/ci/h5n1-cattle-outbreak/Snakefile which would include: ../../h5n1-cattle-outbreak/Snakefile which would include: "../Snakefile". I haven't really though through how inheritance like this will work in practice.
cc @joverlee521
There was a problem hiding this comment.
we're answering yes, but saying that CI isn't a "workflow" in and of itself so it's ok that it can't be run from an external analysis directory?
I think we have to cut off at some point. The CI seems like a very Nextstrain/workflow developer specific config that would not need to be run by an external user. If someone really wanted to, they can copy the contents of the CI directory to their own analysis directory.
There was a problem hiding this comment.
If someone really wanted to
Just for the absence of doubt, here I'm saying that someone is us as it relates to how we run CI (on GitHub actions). Currently we check out the repo and run with --configfile <ci.yaml>. With these changes we would be doing the same, but with a slightly different invocation syntax, e.g. as Tom sketched out above. But we wouldn't be using an approach where we ran CI in a separate analysis directory with nextstrain run ... as if it were just another workflow. As I said, I'm not involved much in the CI space, but just wanted to explain myself fully here.
There was a problem hiding this comment.
Ah gotcha. Yeah, for CI I'd think we'd want to test the source code so it'd be running nextstrain build.
There was a problem hiding this comment.
Thanks @joverlee521!
I've updated the PR to search for custom rules relative to the analysis (working) directory. The h5n1-cattle-outbreak/Snakemake workflow now imports it's own "custom" rules as expected.
I haven't moved the "base" Snakefile to rules/base.smk, but this PR is probably the place to do it if we want to.
There was a problem hiding this comment.
@jameshadfield: But we wouldn't be using an approach where we ran CI in a separate analysis directory with
nextstrain run ...as if it were just another workflow.@joverlee521: Yeah, for CI I'd think we'd want to test the source code so it'd be running
nextstrain build.
We could have this either way. It's very doable to use nextstrain run for this, and we might want to do so in order for it to be an instructive example/regularly test that bit of interface.
There was a problem hiding this comment.
Submitting comments-in-progress as I'd started reviewing this WIP last week, then the holiday happened and I didn't click submit. (I realize the WIP here has changed again since that time too.)
Although, it seems like some comments I remember writing got lost… ugh.
90b994c to
5c81b87
Compare
Shifts to a concept where custom-rules are only for use in analysis directories, and the custom snakemake file is sourced relative to that working directory. See <#103 (comment)> for more discussion about the benefits and limitations of this. The cattle-outbreak workflow (`h5n1-cattle-outbreak/Snakefile`) now directly imports the rules it needs rather than using the custom-rules machinery.
5c81b87 to
0bc582d
Compare
By having all phylogenetic workflows start from two lists of inputs (`config.inputs`, `config.additional_inputs`) we enable a broad range of uses with a consistent interface. 1. Using local ingest files is trivial (see added docs) and doesn't need a bunch of special-cased logic that is prone to falling out of date (as it had indeed done) 2. Adding extra / private data follows the similar pattern, with an additional config list being used so that we are explicit that the new data is additional and enforce an ordering which is needed for predictable `augur merge` behaviour. The canonical data can be removed / replaced via step (1) if needed. I considered adding additional data after the subtype-filtering step, which would avoid the need to add subtype in the metadata but requires encoding this in the config overlay. I felt the chosen way was simpler and more powerful. Note that this workflow uses an old version of the CI workflow, <https://github.com/nextstrain/.github/blob/v0/.github/workflows/pathogen-repo-ci.yaml#L233-L240> which copies `example_data`. We could upgrade to the latest version and use a config overlay to swap out the canonical inputs with the example data. Note that one of the side effects of the current implementation is that merged inputs use the same filepath irrespective of the workflow. For instance, both gisaid & h5n1-cattle-outbreak use the intermediate path `results/metadata_merged.tsv`, which means it's not possible to maintain runs of both those analysis concurrently if both were to use merged inputs. Using separate analysis directories, e.g. <#103> will help avoid this shortcoming.
| Search order (first match returned): | ||
| 1. Relative to the analysis directory | ||
| 2. Relative to the directory the entry snakefile was in. Typically this is | ||
| not the Snakefile you are looking at now but (e.g.) the one in | ||
| avian-flu/gisaid | ||
| 3. Relative to where this Snakefile is (i.e. `avian-flu/`) |
There was a problem hiding this comment.
I've casually skimmed thru parts of this PR but haven't actually reviewed it. I wanted to start discussion early though on this search order. I think it would be better to have a single fallback, e.g. a search order with only two locations: the working analysis directory and a fixed directory based on the workflow source. That is, we'd pick one of workflow.basedir or workflow.current_basedir and use that one consistently, not allow either. The more search paths, the harder the behaviour is to explain, and the easier it is to get tripped up in a tangle you didn't expect as paths in different parts of the repo collide.
There was a problem hiding this comment.
This would be a good aspect of the interface for us to keep thinking about. I reworked this PR with the intention of merging it ASAP and ended up using a resolution order as follows:
Search order (first match returned):
1. Relative to the analysis directory
2. Relative to the `avian-flu` directory
For certain workflows I think this makes sense - in avian flu most of the code, comments etc is still common and within avian-flu/config, avian-flu/rules - so this seemed an ok solution. In other situations workflows may be more (or completely) siloed within the workflow dir and in such situations I think it's better to resolve paths to that directory instead of (or in preference to) the base (e.g. avian-flu, or phylogenetic) directory.
This touches on a bigger topic, namely who defines and is responsible for this interface. As I see it it, the only constraint that workflows as programs puts on this is for the workflow to search the analysis directory first to allow analysis-directory files to be used. The rest of this interface is up to the workflow author, both to implement and document.
By having all phylogenetic workflows start from two lists of inputs (`config.inputs`, `config.additional_inputs`) we enable a broad range of uses with a consistent interface. 1. Using local ingest files is trivial (see added docs) and doesn't need a bunch of special-cased logic that is prone to falling out of date (as it had indeed done) 2. Adding extra / private data follows the similar pattern, with an additional config list being used so that we are explicit that the new data is additional and enforce an ordering which is needed for predictable `augur merge` behaviour. The canonical data can be removed / replaced via step (1) if needed. I considered adding additional data after the subtype-filtering step, which would avoid the need to add subtype in the metadata but requires encoding this in the config overlay. I felt the chosen way was simpler and more powerful. When considering sequences the structure is more complex than metadata because the influenza genome is segmented and we wish to allow users to provide additional data for only some segments (see docstring for `_parse_config_input`). For non-segmented pathogens the simpler structure used here for metadata could also be used for sequences. This workflow uses an old version of the CI workflow, <https://github.com/nextstrain/.github/blob/v0/.github/workflows/pathogen-repo-ci.yaml#L233-L240> which copies `example_data`. We could upgrade to the latest version and use a config overlay to swap out the canonical inputs with the example data. Note that one of the side effects of the current implementation is that merged inputs use the same filepath irrespective of the workflow. For instance, both gisaid & h5n1-cattle-outbreak use the intermediate path `results/metadata_merged.tsv`, which means it's not possible to maintain runs of both those analysis concurrently if both were to use merged inputs. Using separate analysis directories, e.g. <#103> will help avoid this shortcoming.
By having all phylogenetic workflows start from two lists of inputs (`config.inputs`, `config.additional_inputs`) we enable a broad range of uses with a consistent interface. 1. Using local ingest files is trivial (see added docs) and doesn't need a bunch of special-cased logic that is prone to falling out of date (as it had indeed done) 2. Adding extra / private data follows the similar pattern, with an additional config list being used so that we are explicit that the new data is additional and enforce an ordering which is needed for predictable `augur merge` behaviour. The canonical data can be removed / replaced via step (1) if needed. I considered adding additional data after the subtype-filtering step, which would avoid the need to add subtype in the metadata but requires encoding this in the config overlay. I felt the chosen way was simpler and more powerful. When considering sequences the structure is more complex than metadata because the influenza genome is segmented and we wish to allow users to provide additional data for only some segments (see docstring for `_parse_config_input`). For non-segmented pathogens the simpler structure used here for metadata could also be used for sequences. This workflow uses an old version of the CI workflow, <https://github.com/nextstrain/.github/blob/v0/.github/workflows/pathogen-repo-ci.yaml#L233-L240> which copies `example_data`. We could upgrade to the latest version and use a config overlay to swap out the canonical inputs with the example data. Note that one of the side effects of the current implementation is that merged inputs use the same filepath irrespective of the workflow. For instance, both gisaid & h5n1-cattle-outbreak use the intermediate path `results/metadata_merged.tsv`, which means it's not possible to maintain runs of both those analysis concurrently if both were to use merged inputs. Using separate analysis directories, e.g. <#103> will help avoid this shortcoming.
By having all phylogenetic workflows start from two lists of inputs (`config.inputs`, `config.additional_inputs`) we enable a broad range of uses with a consistent interface. 1. Using local ingest files is trivial (see added docs) and doesn't need a bunch of special-cased logic that is prone to falling out of date (as it had indeed done) 2. Adding extra / private data follows the similar pattern, with an additional config list being used so that we are explicit that the new data is additional and enforce an ordering which is needed for predictable `augur merge` behaviour. The canonical data can be removed / replaced via step (1) if needed. I considered adding additional data after the subtype-filtering step, which would avoid the need to add subtype in the metadata but requires encoding this in the config overlay. I felt the chosen way was simpler and more powerful. When considering sequences the structure is more complex than metadata because the influenza genome is segmented and we wish to allow users to provide additional data for only some segments (see docstring for `_parse_config_input`). For non-segmented pathogens the simpler structure used here for metadata could also be used for sequences. This workflow uses an old version of the CI workflow, <https://github.com/nextstrain/.github/blob/v0/.github/workflows/pathogen-repo-ci.yaml#L233-L240> which copies `example_data`. We could upgrade to the latest version and use a config overlay to swap out the canonical inputs with the example data. Note that one of the side effects of the current implementation is that merged inputs use the same filepath irrespective of the workflow. For instance, both gisaid & h5n1-cattle-outbreak use the intermediate path `results/metadata_merged.tsv`, which means it's not possible to maintain runs of both those analysis concurrently if both were to use merged inputs. Using separate analysis directories, e.g. <#103> will help avoid this shortcoming.
By having all phylogenetic workflows start from two lists of inputs (`config.inputs`, `config.additional_inputs`) we enable a broad range of uses with a consistent interface. 1. Using local ingest files is trivial (see added docs) and doesn't need a bunch of special-cased logic that is prone to falling out of date (as it had indeed done) 2. Adding extra / private data follows the similar pattern, with an additional config list being used so that we are explicit that the new data is additional and enforce an ordering which is needed for predictable `augur merge` behaviour. The canonical data can be removed / replaced via step (1) if needed. I considered adding additional data after the subtype-filtering step, which would avoid the need to add subtype in the metadata but requires encoding this in the config overlay. I felt the chosen way was simpler and more powerful. When considering sequences the structure is more complex than metadata because the influenza genome is segmented and we wish to allow users to provide additional data for only some segments (see docstring for `_parse_config_input`). For non-segmented pathogens the simpler structure used here for metadata could also be used for sequences. This workflow uses an old version of the CI workflow, <https://github.com/nextstrain/.github/blob/v0/.github/workflows/pathogen-repo-ci.yaml#L233-L240> which copies `example_data`. We could upgrade to the latest version and use a config overlay to swap out the canonical inputs with the example data. Note that one of the side effects of the current implementation is that merged inputs use the same filepath irrespective of the workflow. For instance, both gisaid & h5n1-cattle-outbreak use the intermediate path `results/metadata_merged.tsv`, which means it's not possible to maintain runs of both those analysis concurrently if both were to use merged inputs. Using separate analysis directories, e.g. <#103> will help avoid this shortcoming.
0bc582d to
66066a5
Compare
66066a5 to
2cadac9
Compare
6d92c79 to
0c6f7bb
Compare
19c5160 to
6229eb9
Compare
joverlee521
left a comment
There was a problem hiding this comment.
Mostly reviewed to catch up with all the work that is being done here.
Generally, I still need to wrap my head around the new repo layout with multiple entrypoint Snakefiles, but definitely great to support arbitrary analysis directories.
| # load the default config which must exist. | ||
| # NOTE: despite any `--config` or `--configfile` having been already loaded into the `config` variable by this stage, | ||
| # the following `configfile: ...` expression doesn't override this, rather it recomputes the entire `config` structure | ||
| # (by merging) such that the precedence of "configfile: < --configfile < --config" is maintained. | ||
| if not os.path.exists(os.path.join(workflow.basedir, 'config.yaml')): | ||
| raise InvalidConfigError("No default config - this is the result of an error/bug in the underlying workflow.") | ||
| configfile: os.path.join(workflow.basedir, 'config.yaml') | ||
|
|
||
| # load a config.yaml file if it exists in the current working directory | ||
| # (current working directory is most often the directory where you ran `snakemake` from, but can be changed via | ||
| # Snakemake's --directory arg) | ||
| if os.path.exists("config.yaml"): | ||
| configfile: "config.yaml" |
There was a problem hiding this comment.
non-blocking
FWIW, setting the configfile directive here feels very hidden to me. I was initially confused how genome-focused-d1.1/Snakefile and genome-focused/Snakefile were using different config files. I had to trace from genome-focused/Snakefile -> rules/main.smk -> rules/config.smk to get here.
I think it should be immediately visible within each entrypoint Snakefile that it's setting some default configfile. I'd say it's fine to duplicate this bit across the couple of entrypoint Snakefiles that are in this repo.
If you really want to encapsulate this, it might be clearer to directly include this file in the entrypoint Snakefiles
include: "../rules/config.smk"
include: "../rules/main.smk"
include: "../rules/genome.smk"
rule _all:
input: rules.all.inputAnother alternative is to define a shared function that returns the expected configfile paths.
def resolve_configfiles():
configfiles = []
if not os.path.exists(os.path.join(workflow.basedir, 'config.yaml')):
raise InvalidConfigError("No default config - this is the result of an error/bug in the underlying workflow.")
configfiles.append(os.path.join(workflow.basedir, 'config.yaml'))
if os.path.exists("config.yaml"):
configfiles.append("config.yaml")
return configfilesThen set the configfiles within the entrypoint Snakefiles (e.g. genome-focused/Snakefile)
include: "../rules/config.smk"
include: "../rules/main.smk"
include: "../rules/genome.smk"
for configfile in resolve_configfiles():
configfile: configfile
rule _all:
input: rules.all.inputThere was a problem hiding this comment.
Good feedback, thanks! I've switched to your first suggestion.
I'm happy for post-merge tweaks to this - we could add in the configfile: directives to the Snakefile if that's felt clearer, and we could remove the dummy _all rule and instead define the actual rule here too:
rule all:
input:
auspice_json = expand_target_patterns()
There was a problem hiding this comment.
Yeah, I'd think it'd be clearest to have each entrypoint Snakefile define the high level things such as configfile and rule all.
|
|
||
| ```bash | ||
| nextstrain build . --configfile config/gisaid.yaml -f deploy_all | ||
| nextstrain build . --snakefile segment-focused/Snakefile -f deploy_all |
There was a problem hiding this comment.
non-blocking
Huh, I was initially going to suggest running as nextstrain build segment-focused -f deploy_all, but then realized that is slightly different because the output files will be under segment-focused instead of at the top level.
There was a problem hiding this comment.
Yeah - it's basically (exactly?) saying that segment-focused/ is the analysis directory. Because of how our .gitignore is written this'll work fine & keep things ignored, although I think that wasn't the intention. Maybe running in this way is something for use to consider in the future - it's kind of similar to splitting the ingest up into {fauna,ncbi,joined-ncbi} each with their own {data,results} - although I haven't thought about this much.
| We can run the avian-flu worfklows from analysis directory/directories separate to this repo. Imagine the following directory structure: | ||
|
|
||
| ``` | ||
| ├── avian-flu | ||
| │ ├── README.md | ||
| │ ├── segment-focused | ||
| │ │ ├── Snakefile | ||
| │ │ └── config.yaml | ||
| │ └── ... (lots more files) | ||
| └── analysis_dir | ||
| ``` |
There was a problem hiding this comment.
non-blocking
Maybe worth noting that through the work you've done to resolve paths, it is also possible now to specify arbitrary analysis directories within the repo.
This makes it possible to run with the nextstrain build command
nextstrain build . -s genome-focused/Snakefile --directory test
The outputs will be within avian-flu/test/
There was a problem hiding this comment.
I've added a note about the --directory argument to the README, but felt it was too much to propose having this inside the avian-flu repo to users. It's totally possible, and for those who know how to do it it may be useful. The .gitignore will keep them out of version control so this may be quite ergonomic for some people.
Yeah - in the context of this PR the additional boilerplate / cognitive overhead of two essentially dummy P.S. Thanks for the review - much appreciated! |
This starts the process of redefining the workflow interface from a single workflow / entrypoint /Snakefile with different configfiles to choose behaviour to an interface of multiple Snakefiles, each with their own default configs. Ultimately this will facilitate running the workflows in separate working directories and workflows-as-programs (`nextstrain run ...`) Note that because we cannot use the same config file to build both D1.1 and B3.13 builds we use a separate workflow for each. With the move to a more expressive config syntax in <#104> we will drop the D1.1 workflow. I initially chose `ncbi/` and `gisaid/` as the entrypoint/workflow names but moved away from that as I want to decouple the workflow from the data source. As described in the README changes here it's trivial to run either workflow using either data input. One day we may even merge these workflows into a single config-definable workflow, but if we ever do so it feels a long way off.
Since the top-level Snakefile is no longer intended to be run directly (see previous commit) removing it avoids confusion. We also change the name of the 'cattle-flu' rules to better align with the aims of the 'genome-focused' pipeline
Preferentially resolving paths to the analysis directory allows us to
run workflows in a separate workflow and override certain files
(defaults) as needed. Combined with the nicety of automatically using a
working directories `config.yaml` this allows us to run workflows from
separate analysis directories in both a convenient and powerful way.
Note that when we need to combine path resolution with resolving the
underlying config value itself we lose the benefit of currying - e.g.
the following is clunky and can be improved in a subsequent commit:
```
colors = lambda w: resolve_config_path(get_config('colors', 'hardcoded', w))(w),
```
This makes the (common) task of resolving a path from a config lookup much more ergonomic, and the style we use to resolve a config value vs path is now the same (`resolve_config_value` and `resolve_config_path`, respectively). This implementation will also make <#104> simpler to implement.
6229eb9 to
9ef6cf2
Compare
* fix subject-verb agreement in a couple places * rewrite two section headers for consistency with the rest of the doc
…of-repo-branch Some nits I noticed while reviewing #103
| uses: nextstrain/.github/.github/workflows/pathogen-repo-ci.yaml@v0 | ||
| with: | ||
| build-args: --configfile config/gisaid.yaml -pf test_target | ||
| build-args: --snakefile segment-focused/Snakefile -pf test_target |
There was a problem hiding this comment.
Totally missed this change during review, required small follow ups in various CI workflows.
There was a problem hiding this comment.
🤦 Thanks so much for sorting this out!
Following on from nextstrain/measles#55 partly as a learning exercise and partly because I want to use it for avian-flu. The main differences here are
--configfile, i.e. there's no "default configuration values". This is how mpox works too, among others.Tested using two scenarios:
custom-gisaid.yamlconfig ofAnd running via
(The explicit target file was needed due to the config structure - see the note in the YAML above - and this makes the config override of
segmentsirrelevant.)config.yamlconfig ofAnd ran via
snakemake --snakefile ${AVIAN_FLU}/Snakefile --cores 3 -pfOverall it worked really well. I'll start some threads in the code to continue discussion of various parts.