Skip to content

Allow running workflow from outside of repo#103

Merged
jameshadfield merged 4 commits intomasterfrom
james/run-outside-of-repo
Mar 5, 2025
Merged

Allow running workflow from outside of repo#103
jameshadfield merged 4 commits intomasterfrom
james/run-outside-of-repo

Conversation

@jameshadfield
Copy link
Copy Markdown
Member

Following on from nextstrain/measles#55 partly as a learning exercise and partly because I want to use it for avian-flu. The main differences here are

  • A much more complicated snakemake workflow and config structure
  • A workflow which expected to be run with --configfile, i.e. there's no "default configuration values". This is how mpox works too, among others.

Tested using two scenarios:

  1. GISAID builds. Running in a separate directory outside of the avian-flu repo. I used a custom-gisaid.yaml config of
extends: gisaid.yaml
segments:
  - pb2

# NOTE - this reveals a (rather big) issue with our config style as this _won't_
# overwrite the builds as we intend it to, it'll merge in (and thus not change anything)
builds:
  h7n9: 
    - all-time

target_sequences_per_tree: 1000 # for testing

And running via

AVIAN_FLU="path/to/avian-flu/repo"
snakemake --snakefile ${AVIAN_FLU}/Snakefile --cores 3 --configfile custom-gisaid.yaml -pf auspice/avian-flu_h7n9_ha_all-time.json

(The explicit target file was needed due to the config structure - see the note in the YAML above - and this makes the config override of segments irrelevant.)

  1. H5N1-cattle-outbreak. Running in a separate directory outside of the avian-flu repo (and different from the GISAID builds above). I used a config.yaml config of
extends: h5n1-cattle-outbreak.yaml
segments:
  - pb2

And ran via snakemake --snakefile ${AVIAN_FLU}/Snakefile --cores 3 -pf


Overall it worked really well. I'll start some threads in the code to continue discussion of various parts.

Comment thread Snakefile Outdated
Comment thread Snakefile Outdated
Comment on lines +29 to +38
Resolve a relative *path* given in a configuration value. Before resolving
any '{x}' substrings are replaced by their corresponding wildcards (if the
`wildcards` argument is provided).

Search order (first match returned):
1. Relative to the analysis directory
2. Relative to the directory the entry snakefile was in. Typically this
is not the Snakefile you are looking at now but (e.g.) the one in
avian-flu/gisaid
3. Relative to where this Snakefile is (i.e. `avian-flu/`)
Copy link
Copy Markdown
Member Author

@jameshadfield jameshadfield Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joverlee521 if we do implement a search order like this it opens up the possibility of moving the "default" files from (e.g.) ./config/ to ./gisaid/ or ./gisaid/defaults/. Similarly for custom rules. If we go this direction I think it warrants a rethink of how we want to structure repos.

@jameshadfield jameshadfield force-pushed the james/run-outside-of-repo branch 3 times, most recently from ad0f15c to 90b994c Compare November 28, 2024 03:12
Comment thread Snakefile Outdated
Comment thread Snakefile Outdated
class InvalidConfigError(Exception):
pass

def resolve_config_path(original_path, wildcards=None):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to me: need to resolve any custom rules using this (?) function, currently they're just

for rule_file in config.get('custom_rules', []):
    include: rule_file

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a decision for us to make here which wasn't needed in measles. The h5n1-cattle-outbreak config uses a custom rule which needs to be searched for within the avian-flu repo. Other workflows use a similar approach for CI - e.g. zika.

Approach 1: Use the same resolving approach as resolve_config_path to search for rulefiles.

Approach 2: Use the approach of measles (custom rules must be relative to the working directory) and replace our usage of custom_rules with a "include" directive within h5n1-cattle-outbreak/Snakefile.

There are pros and cons to each I think. Given that we use custom_rules in CI, approach 1 is nice as it'll allow CI to be run in a separate directory as desired. Approach 2 also has the "gotcha" where you develop a workflow with custom-rules within avian-flu, it all works, but then it won't work when you run it in another directory. There's also the bigger picture issue where if multiple configfiles define custom_rules only one of them will be used because they're lists.

Given the above I think using the search order from Approach 1 is the best, but we should also reference cattle-flu.smk within h5n1-cattle-outbreak/Snakefile so that overlays with custom_rules don't have to specify it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. It feels strange to be letting "custom" rule files be sourced from the workflow source, i.e. "stock custom rules", which seems contradictory.

This is why, FWIW, in measles I intentionally only resolved custom rules files based on the working analysis directory.

For the current "build-configs" pattern like CI or our own automation, I'd think to convert those to analysis directories. I didn't demo this for measles to keep changes to a minimum, but it's an easy change and keeps our interfaces consistent.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels strange to be letting "custom" rule files be sourced from the workflow source, i.e. "stock custom rules", which seems contradictory

Yeah I see that. I guess a clarifying question is "do we want to allow all workflows to be run from an external analysis directory". If that's the case, and I think it probably is, then we either avoid custom_rules entirely, expand the interface to have something like workflow_rules, or shift our usage of custom rules to their own (sub-)workflows (concrete example). Is the latter what you mean by "I'd think to convert those to analysis directories"?

If we use different workflows for CI rather than "custom_rules" as we do now, then a cattle-outbreak CI might look like avian-flu/phylogenetic/ci/h5n1-cattle-outbreak/Snakefile which would include: ../../h5n1-cattle-outbreak/Snakefile which would include: "../Snakefile". I haven't really though through how inheritance like this will work in practice.

cc @joverlee521

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're answering yes, but saying that CI isn't a "workflow" in and of itself so it's ok that it can't be run from an external analysis directory?

I think we have to cut off at some point. The CI seems like a very Nextstrain/workflow developer specific config that would not need to be run by an external user. If someone really wanted to, they can copy the contents of the CI directory to their own analysis directory.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone really wanted to

Just for the absence of doubt, here I'm saying that someone is us as it relates to how we run CI (on GitHub actions). Currently we check out the repo and run with --configfile <ci.yaml>. With these changes we would be doing the same, but with a slightly different invocation syntax, e.g. as Tom sketched out above. But we wouldn't be using an approach where we ran CI in a separate analysis directory with nextstrain run ... as if it were just another workflow. As I said, I'm not involved much in the CI space, but just wanted to explain myself fully here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha. Yeah, for CI I'd think we'd want to test the source code so it'd be running nextstrain build.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @joverlee521!

I've updated the PR to search for custom rules relative to the analysis (working) directory. The h5n1-cattle-outbreak/Snakemake workflow now imports it's own "custom" rules as expected.

I haven't moved the "base" Snakefile to rules/base.smk, but this PR is probably the place to do it if we want to.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameshadfield: But we wouldn't be using an approach where we ran CI in a separate analysis directory with nextstrain run ... as if it were just another workflow.

@joverlee521: Yeah, for CI I'd think we'd want to test the source code so it'd be running nextstrain build.

We could have this either way. It's very doable to use nextstrain run for this, and we might want to do so in order for it to be an instructive example/regularly test that bit of interface.

Copy link
Copy Markdown
Contributor

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitting comments-in-progress as I'd started reviewing this WIP last week, then the holiday happened and I didn't click submit. (I realize the WIP here has changed again since that time too.)

Although, it seems like some comments I remember writing got lost… ugh.

Comment thread Snakefile Outdated
Comment thread Snakefile Outdated
Comment thread Snakefile Outdated
@jameshadfield jameshadfield force-pushed the james/run-outside-of-repo branch from 90b994c to 5c81b87 Compare December 8, 2024 22:57
jameshadfield added a commit that referenced this pull request Dec 12, 2024
Shifts to a concept where custom-rules are only for use in analysis
directories, and the custom snakemake file is sourced relative to that
working directory. See <#103 (comment)>
for more discussion about the benefits and limitations of this.

The cattle-outbreak workflow (`h5n1-cattle-outbreak/Snakefile`) now
directly imports the rules it needs rather than using the custom-rules
machinery.
@jameshadfield jameshadfield force-pushed the james/run-outside-of-repo branch from 5c81b87 to 0bc582d Compare December 12, 2024 03:55
jameshadfield added a commit that referenced this pull request Dec 16, 2024
By having all phylogenetic workflows start from two lists of inputs
(`config.inputs`, `config.additional_inputs`) we enable a broad range of
uses with a consistent interface.

1. Using local ingest files is trivial (see added docs) and doesn't need
   a bunch of special-cased logic that is prone to falling out of date
   (as it had indeed done)
2. Adding extra / private data follows the similar pattern, with an
   additional config list being used so that we are explicit that the
   new data is additional and enforce an ordering which is needed for
   predictable `augur merge` behaviour. The canonical data can be
   removed / replaced via step (1) if needed.

I considered adding additional data after the subtype-filtering step,
which would avoid the need to add subtype in the metadata but requires
encoding this in the config overlay. I felt the chosen way was simpler
and more powerful.

Note that this workflow uses an old version of the CI workflow,
<https://github.com/nextstrain/.github/blob/v0/.github/workflows/pathogen-repo-ci.yaml#L233-L240>
which copies `example_data`. We could upgrade to the latest version
and use a config overlay to swap out the canonical inputs with the
example data.

Note that one of the side effects of the current implementation is that
merged inputs use the same filepath irrespective of the workflow. For
instance, both gisaid & h5n1-cattle-outbreak use the intermediate path
`results/metadata_merged.tsv`, which means it's not possible to maintain
runs of both those analysis concurrently if both were to use merged
inputs. Using separate analysis directories, e.g.
<#103> will help avoid this
shortcoming.
Comment thread Snakefile Outdated
Comment thread Snakefile Outdated
Comment thread Snakefile Outdated
Comment on lines +33 to +38
Search order (first match returned):
1. Relative to the analysis directory
2. Relative to the directory the entry snakefile was in. Typically this is
not the Snakefile you are looking at now but (e.g.) the one in
avian-flu/gisaid
3. Relative to where this Snakefile is (i.e. `avian-flu/`)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've casually skimmed thru parts of this PR but haven't actually reviewed it. I wanted to start discussion early though on this search order. I think it would be better to have a single fallback, e.g. a search order with only two locations: the working analysis directory and a fixed directory based on the workflow source. That is, we'd pick one of workflow.basedir or workflow.current_basedir and use that one consistently, not allow either. The more search paths, the harder the behaviour is to explain, and the easier it is to get tripped up in a tangle you didn't expect as paths in different parts of the repo collide.

Copy link
Copy Markdown
Member Author

@jameshadfield jameshadfield Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a good aspect of the interface for us to keep thinking about. I reworked this PR with the intention of merging it ASAP and ended up using a resolution order as follows:

    Search order (first match returned):
        1. Relative to the analysis directory
        2. Relative to the `avian-flu` directory

For certain workflows I think this makes sense - in avian flu most of the code, comments etc is still common and within avian-flu/config, avian-flu/rules - so this seemed an ok solution. In other situations workflows may be more (or completely) siloed within the workflow dir and in such situations I think it's better to resolve paths to that directory instead of (or in preference to) the base (e.g. avian-flu, or phylogenetic) directory.

This touches on a bigger topic, namely who defines and is responsible for this interface. As I see it it, the only constraint that workflows as programs puts on this is for the workflow to search the analysis directory first to allow analysis-directory files to be used. The rest of this interface is up to the workflow author, both to implement and document.

jameshadfield added a commit that referenced this pull request Jan 1, 2025
By having all phylogenetic workflows start from two lists of inputs
(`config.inputs`, `config.additional_inputs`) we enable a broad range of
uses with a consistent interface.

1. Using local ingest files is trivial (see added docs) and doesn't need
   a bunch of special-cased logic that is prone to falling out of date
   (as it had indeed done)
2. Adding extra / private data follows the similar pattern, with an
   additional config list being used so that we are explicit that the
   new data is additional and enforce an ordering which is needed for
   predictable `augur merge` behaviour. The canonical data can be
   removed / replaced via step (1) if needed.

I considered adding additional data after the subtype-filtering step,
which would avoid the need to add subtype in the metadata but requires
encoding this in the config overlay. I felt the chosen way was simpler
and more powerful.

When considering sequences the structure is more complex than metadata
because the influenza genome is segmented and we wish to allow users to
provide additional data for only some segments (see docstring for
`_parse_config_input`). For non-segmented pathogens the simpler
structure used here for metadata could also be used for sequences.

This workflow uses an old version of the CI workflow,
<https://github.com/nextstrain/.github/blob/v0/.github/workflows/pathogen-repo-ci.yaml#L233-L240>
which copies `example_data`. We could upgrade to the latest version
and use a config overlay to swap out the canonical inputs with the
example data.

Note that one of the side effects of the current implementation is that
merged inputs use the same filepath irrespective of the workflow. For
instance, both gisaid & h5n1-cattle-outbreak use the intermediate path
`results/metadata_merged.tsv`, which means it's not possible to maintain
runs of both those analysis concurrently if both were to use merged
inputs. Using separate analysis directories, e.g.
<#103> will help avoid this
shortcoming.
jameshadfield added a commit that referenced this pull request Jan 7, 2025
By having all phylogenetic workflows start from two lists of inputs
(`config.inputs`, `config.additional_inputs`) we enable a broad range of
uses with a consistent interface.

1. Using local ingest files is trivial (see added docs) and doesn't need
   a bunch of special-cased logic that is prone to falling out of date
   (as it had indeed done)
2. Adding extra / private data follows the similar pattern, with an
   additional config list being used so that we are explicit that the
   new data is additional and enforce an ordering which is needed for
   predictable `augur merge` behaviour. The canonical data can be
   removed / replaced via step (1) if needed.

I considered adding additional data after the subtype-filtering step,
which would avoid the need to add subtype in the metadata but requires
encoding this in the config overlay. I felt the chosen way was simpler
and more powerful.

When considering sequences the structure is more complex than metadata
because the influenza genome is segmented and we wish to allow users to
provide additional data for only some segments (see docstring for
`_parse_config_input`). For non-segmented pathogens the simpler
structure used here for metadata could also be used for sequences.

This workflow uses an old version of the CI workflow,
<https://github.com/nextstrain/.github/blob/v0/.github/workflows/pathogen-repo-ci.yaml#L233-L240>
which copies `example_data`. We could upgrade to the latest version
and use a config overlay to swap out the canonical inputs with the
example data.

Note that one of the side effects of the current implementation is that
merged inputs use the same filepath irrespective of the workflow. For
instance, both gisaid & h5n1-cattle-outbreak use the intermediate path
`results/metadata_merged.tsv`, which means it's not possible to maintain
runs of both those analysis concurrently if both were to use merged
inputs. Using separate analysis directories, e.g.
<#103> will help avoid this
shortcoming.
jameshadfield added a commit that referenced this pull request Jan 7, 2025
By having all phylogenetic workflows start from two lists of inputs
(`config.inputs`, `config.additional_inputs`) we enable a broad range of
uses with a consistent interface.

1. Using local ingest files is trivial (see added docs) and doesn't need
   a bunch of special-cased logic that is prone to falling out of date
   (as it had indeed done)
2. Adding extra / private data follows the similar pattern, with an
   additional config list being used so that we are explicit that the
   new data is additional and enforce an ordering which is needed for
   predictable `augur merge` behaviour. The canonical data can be
   removed / replaced via step (1) if needed.

I considered adding additional data after the subtype-filtering step,
which would avoid the need to add subtype in the metadata but requires
encoding this in the config overlay. I felt the chosen way was simpler
and more powerful.

When considering sequences the structure is more complex than metadata
because the influenza genome is segmented and we wish to allow users to
provide additional data for only some segments (see docstring for
`_parse_config_input`). For non-segmented pathogens the simpler
structure used here for metadata could also be used for sequences.

This workflow uses an old version of the CI workflow,
<https://github.com/nextstrain/.github/blob/v0/.github/workflows/pathogen-repo-ci.yaml#L233-L240>
which copies `example_data`. We could upgrade to the latest version
and use a config overlay to swap out the canonical inputs with the
example data.

Note that one of the side effects of the current implementation is that
merged inputs use the same filepath irrespective of the workflow. For
instance, both gisaid & h5n1-cattle-outbreak use the intermediate path
`results/metadata_merged.tsv`, which means it's not possible to maintain
runs of both those analysis concurrently if both were to use merged
inputs. Using separate analysis directories, e.g.
<#103> will help avoid this
shortcoming.
jameshadfield added a commit that referenced this pull request Feb 17, 2025
By having all phylogenetic workflows start from two lists of inputs
(`config.inputs`, `config.additional_inputs`) we enable a broad range of
uses with a consistent interface.

1. Using local ingest files is trivial (see added docs) and doesn't need
   a bunch of special-cased logic that is prone to falling out of date
   (as it had indeed done)
2. Adding extra / private data follows the similar pattern, with an
   additional config list being used so that we are explicit that the
   new data is additional and enforce an ordering which is needed for
   predictable `augur merge` behaviour. The canonical data can be
   removed / replaced via step (1) if needed.

I considered adding additional data after the subtype-filtering step,
which would avoid the need to add subtype in the metadata but requires
encoding this in the config overlay. I felt the chosen way was simpler
and more powerful.

When considering sequences the structure is more complex than metadata
because the influenza genome is segmented and we wish to allow users to
provide additional data for only some segments (see docstring for
`_parse_config_input`). For non-segmented pathogens the simpler
structure used here for metadata could also be used for sequences.

This workflow uses an old version of the CI workflow,
<https://github.com/nextstrain/.github/blob/v0/.github/workflows/pathogen-repo-ci.yaml#L233-L240>
which copies `example_data`. We could upgrade to the latest version
and use a config overlay to swap out the canonical inputs with the
example data.

Note that one of the side effects of the current implementation is that
merged inputs use the same filepath irrespective of the workflow. For
instance, both gisaid & h5n1-cattle-outbreak use the intermediate path
`results/metadata_merged.tsv`, which means it's not possible to maintain
runs of both those analysis concurrently if both were to use merged
inputs. Using separate analysis directories, e.g.
<#103> will help avoid this
shortcoming.
@jameshadfield jameshadfield force-pushed the james/run-outside-of-repo branch from 0bc582d to 66066a5 Compare March 3, 2025 21:20
@jameshadfield jameshadfield changed the base branch from master to james/genoflu-colors March 3, 2025 21:21
@jameshadfield jameshadfield changed the title WIP - allow running workflow from outside of repo Allow running workflow from outside of repo Mar 3, 2025
@jameshadfield jameshadfield marked this pull request as ready for review March 3, 2025 21:32
@jameshadfield jameshadfield force-pushed the james/run-outside-of-repo branch from 66066a5 to 2cadac9 Compare March 3, 2025 21:43
@jameshadfield jameshadfield force-pushed the james/genoflu-colors branch from 6d92c79 to 0c6f7bb Compare March 4, 2025 00:17
@jameshadfield jameshadfield force-pushed the james/run-outside-of-repo branch from 19c5160 to 6229eb9 Compare March 4, 2025 00:18
@jameshadfield
Copy link
Copy Markdown
Member Author

Rebased onto master. Successful GitHub actions tests:

Copy link
Copy Markdown
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly reviewed to catch up with all the work that is being done here.

Generally, I still need to wrap my head around the new repo layout with multiple entrypoint Snakefiles, but definitely great to support arbitrary analysis directories.

Comment thread rules/config.smk
Comment on lines +12 to +24
# load the default config which must exist.
# NOTE: despite any `--config` or `--configfile` having been already loaded into the `config` variable by this stage,
# the following `configfile: ...` expression doesn't override this, rather it recomputes the entire `config` structure
# (by merging) such that the precedence of "configfile: < --configfile < --config" is maintained.
if not os.path.exists(os.path.join(workflow.basedir, 'config.yaml')):
raise InvalidConfigError("No default config - this is the result of an error/bug in the underlying workflow.")
configfile: os.path.join(workflow.basedir, 'config.yaml')

# load a config.yaml file if it exists in the current working directory
# (current working directory is most often the directory where you ran `snakemake` from, but can be changed via
# Snakemake's --directory arg)
if os.path.exists("config.yaml"):
configfile: "config.yaml"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking

FWIW, setting the configfile directive here feels very hidden to me. I was initially confused how genome-focused-d1.1/Snakefile and genome-focused/Snakefile were using different config files. I had to trace from genome-focused/Snakefile -> rules/main.smk -> rules/config.smk to get here.

I think it should be immediately visible within each entrypoint Snakefile that it's setting some default configfile. I'd say it's fine to duplicate this bit across the couple of entrypoint Snakefiles that are in this repo.


If you really want to encapsulate this, it might be clearer to directly include this file in the entrypoint Snakefiles

include: "../rules/config.smk"

include: "../rules/main.smk"

include: "../rules/genome.smk"

rule _all:
    input: rules.all.input

Another alternative is to define a shared function that returns the expected configfile paths.

def resolve_configfiles():
    configfiles = []
    if not os.path.exists(os.path.join(workflow.basedir, 'config.yaml')):
         raise InvalidConfigError("No default config - this is the result of an error/bug in the underlying workflow.")

    configfiles.append(os.path.join(workflow.basedir, 'config.yaml'))

    if os.path.exists("config.yaml"):
        configfiles.append("config.yaml")
    return configfiles

Then set the configfiles within the entrypoint Snakefiles (e.g. genome-focused/Snakefile)

include: "../rules/config.smk"

include: "../rules/main.smk"

include: "../rules/genome.smk"

for configfile in resolve_configfiles():
    configfile: configfile

rule _all:
    input: rules.all.input

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good feedback, thanks! I've switched to your first suggestion.

I'm happy for post-merge tweaks to this - we could add in the configfile: directives to the Snakefile if that's felt clearer, and we could remove the dummy _all rule and instead define the actual rule here too:

rule all:
    input:
        auspice_json = expand_target_patterns()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd think it'd be clearest to have each entrypoint Snakefile define the high level things such as configfile and rule all.

Comment thread README.md

```bash
nextstrain build . --configfile config/gisaid.yaml -f deploy_all
nextstrain build . --snakefile segment-focused/Snakefile -f deploy_all
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking

Huh, I was initially going to suggest running as nextstrain build segment-focused -f deploy_all, but then realized that is slightly different because the output files will be under segment-focused instead of at the top level.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - it's basically (exactly?) saying that segment-focused/ is the analysis directory. Because of how our .gitignore is written this'll work fine & keep things ignored, although I think that wasn't the intention. Maybe running in this way is something for use to consider in the future - it's kind of similar to splitting the ingest up into {fauna,ncbi,joined-ncbi} each with their own {data,results} - although I haven't thought about this much.

Comment thread README.md Outdated
Comment on lines +220 to +230
We can run the avian-flu worfklows from analysis directory/directories separate to this repo. Imagine the following directory structure:

```
├── avian-flu
│ ├── README.md
│ ├── segment-focused
│ │ ├── Snakefile
│ │ └── config.yaml
│ └── ... (lots more files)
└── analysis_dir
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking

Maybe worth noting that through the work you've done to resolve paths, it is also possible now to specify arbitrary analysis directories within the repo.

This makes it possible to run with the nextstrain build command

nextstrain build . -s genome-focused/Snakefile --directory test

The outputs will be within avian-flu/test/

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a note about the --directory argument to the README, but felt it was too much to propose having this inside the avian-flu repo to users. It's totally possible, and for those who know how to do it it may be useful. The .gitignore will keep them out of version control so this may be quite ergonomic for some people.

@jameshadfield
Copy link
Copy Markdown
Member Author

Generally, I still need to wrap my head around the new repo layout with multiple entrypoint Snakefiles, but definitely great to support arbitrary analysis directories.

Yeah - in the context of this PR the additional boilerplate / cognitive overhead of two essentially dummy Snakefiles seems a little tenuous. It's not needed to allow arbitrary analysis directories -- the original version of this PR (many moons ago) used a different approach with a single Snakefile entrypoint -- but this approach was chosen to better fit with the workflows-as-programs concept & thus nextstrain run. Where it'll become quite useful is when/if we start to have separate workflows which don't share so much common code like these ones do and better fit the spirit of "different workflows"; in a sense ingest/phylo is exactly this.

P.S. Thanks for the review - much appreciated!

Base automatically changed from james/genoflu-colors to master March 5, 2025 08:34
This starts the process of redefining the workflow interface from a
single workflow / entrypoint /Snakefile with different configfiles to
choose behaviour to an interface of multiple Snakefiles, each with their
own default configs. Ultimately this will facilitate running the
workflows in separate working directories and workflows-as-programs
(`nextstrain run ...`)

Note that because we cannot use the same config file to build both D1.1
and B3.13 builds we use a separate workflow for each. With the move to
a more expressive config syntax in <#104>
we will drop the D1.1 workflow.

I initially chose `ncbi/` and `gisaid/` as the entrypoint/workflow names
but moved away from that as I want to decouple the workflow from the
data source. As described in the README changes here it's trivial to run
either workflow using either data input. One day we may even merge these
workflows into a single config-definable workflow, but if we ever do so
it feels a long way off.
Since the top-level Snakefile is no longer intended to be run directly
(see previous commit) removing it avoids confusion. We also change the
name of the 'cattle-flu' rules to better align with the aims of the
'genome-focused' pipeline
Preferentially resolving paths to the analysis directory allows us to
run workflows in a separate workflow and override certain files
(defaults) as needed. Combined with the nicety of automatically using a
working directories `config.yaml` this allows us to run workflows from
separate analysis directories in both a convenient and powerful way.

Note that when we need to combine path resolution with resolving the
underlying config value itself we lose the benefit of currying - e.g.
the following is clunky and can be improved in a subsequent commit:

```
 colors = lambda w: resolve_config_path(get_config('colors', 'hardcoded', w))(w),
```
This makes the (common) task of resolving a path from a config lookup
much more ergonomic, and the style we use to resolve a config value vs
path is now the same (`resolve_config_value` and `resolve_config_path`,
respectively). This implementation will also make
<#104> simpler to implement.
@jameshadfield jameshadfield force-pushed the james/run-outside-of-repo branch from 6229eb9 to 9ef6cf2 Compare March 5, 2025 08:36
@jameshadfield jameshadfield merged commit 4f0ea44 into master Mar 5, 2025
6 checks passed
@jameshadfield jameshadfield deleted the james/run-outside-of-repo branch March 5, 2025 08:37
genehack added a commit that referenced this pull request Mar 10, 2025
* fix subject-verb agreement in a couple places
* rewrite two section headers for consistency with the rest of the doc
jameshadfield added a commit that referenced this pull request Mar 10, 2025
…of-repo-branch

Some nits I noticed while reviewing #103
joverlee521 added a commit to nextstrain/augur that referenced this pull request Mar 11, 2025
joverlee521 added a commit to nextstrain/docker-base that referenced this pull request Mar 11, 2025
joverlee521 added a commit to nextstrain/conda-base that referenced this pull request Mar 11, 2025
Comment thread .github/workflows/ci.yaml
uses: nextstrain/.github/.github/workflows/pathogen-repo-ci.yaml@v0
with:
build-args: --configfile config/gisaid.yaml -pf test_target
build-args: --snakefile segment-focused/Snakefile -pf test_target
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally missed this change during review, required small follow ups in various CI workflows.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 Thanks so much for sorting this out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants