Multiple inputs / overriding inputs by jameshadfield · Pull Request #112 · nextstrain/avian-flu

jameshadfield · 2024-12-16T02:37:38Z

This is a refactored version of #106 but on top of the current master branch. Feedback indicated the ideas in that PR were ready for review and merge.

By having all phylogenetic workflows start from two lists of inputs (config.inputs, config.additional_inputs) we enable a broad range of uses with a consistent interface.

Using local ingest files is trivial (see added docs) and doesn't need a bunch of special-cased logic that is prone to falling out of date (as it had indeed done)
Adding extra / private data follows the similar pattern, with an additional config list being used so that we are explicit that the new data is additional and enforce an ordering which is needed for predictable augur merge behaviour. The canonical data can be removed / replaced via step (1) if needed.

I considered adding additional data after the subtype-filtering step, which would avoid the need to add subtype in the metadata but requires encoding this in the config overlay. I felt the chosen way was simpler and more powerful.

Note that this workflow uses an old version of the CI workflow, https://github.com/nextstrain/.github/blob/v0/.github/workflows/pathogen-repo-ci.yaml#L233-L240 which copies example_data. We could upgrade to the latest version and use a config overlay to swap out the canonical inputs with the example data.

Note that one of the side effects of the current implementation is that merged inputs use the same filepath irrespective of the workflow. For instance, both gisaid & h5n1-cattle-outbreak use the intermediate path results/metadata_merged.tsv, which means it's not possible to maintain runs of both those analysis concurrently if both were to use merged inputs. Using separate analysis directories, e.g. #103, will help avoid this shortcoming.

tsibley

I think the outward-facing config interface is pretty sound and good, but I think there's a lot of refinement left for the implementation in the workflow. Refinement that would benefit us as authors and anyone else reading the workflow. I would not want to promulgate this implementation onwards to other workflows/pathogens without further refinement first, so I think it's worth iterating on here and now rather than later.

tsibley · 2024-12-16T19:17:33Z

+    # Address may include the '{segment}' wildcard
+    formatted_address = address.format(segment=segment) \
+        if (segment and '{segment}' in address) \
+        else address
+
+    # addresses may be a remote filepath or a local file
+    if address.startswith('s3://'):
+        # path defines the location of the downloaded file (i.e. produced by a download rule)
+        path = f"data/{name}/metadata.tsv" \
+            if is_metadata \
+            else f"data/{name}/sequences_{segment}.fasta"
+    elif address.lower().startswith(r'http[s]:\/\/'):
+        raise InvalidConfigError("Workflow cannot yet handle HTTP[S] inputs")
+    else:
+        path = formatted_address
+
+    return {
+        'name': name,
+        'path': path,
+        'merge_arg': f"{name}={path}",
+        'address': formatted_address
+    }


The awkward path vs. address distinction and handling here (and having to reference the output paths of distant rules) would go away if we used something like Snakemake's remote files support (e.g. as in ncov).

Yup, would be interesting to see how you imagine remote file being used more widely than just ncov.

As a first step, I think we can extract ncov's remote_files.smk and maintain it as a vendored file used (include-ed) across our workflows.

Looking at the above code snippet from this workflow again, I noticed a bug:

elif address.lower().startswith(r'http[s]:\/\/'):

That looks like an attempt at a regex pattern to me, but str.startswith() takes fixed string.

I think it'll be less work now and in the future to start with re-using our existing work for remote files support than it would be to maintain new ad-hoc code like the above.

As a first step...

Sounds good - such a PR / issue would be a good place to discuss any concerns with vendoring and the current ncov implementation. While the code introduced here is new it's no more "ad-hoc" than the previous implementation with S3_SRC and LOCAL_INGEST. It doesn't seem unreasonable to implement remote file support before spreading the config.inputs + config.additional_inputs interface across repos if you'd like to prioritise this.

I noticed a bug

Thanks. Fixed.

By having all phylogenetic workflows start from two lists of inputs (`config.inputs`, `config.additional_inputs`) we enable a broad range of uses with a consistent interface. 1. Using local ingest files is trivial (see added docs) and doesn't need a bunch of special-cased logic that is prone to falling out of date (as it had indeed done) 2. Adding extra / private data follows the similar pattern, with an additional config list being used so that we are explicit that the new data is additional and enforce an ordering which is needed for predictable `augur merge` behaviour. The canonical data can be removed / replaced via step (1) if needed. I considered adding additional data after the subtype-filtering step, which would avoid the need to add subtype in the metadata but requires encoding this in the config overlay. I felt the chosen way was simpler and more powerful. When considering sequences the structure is more complex than metadata because the influenza genome is segmented and we wish to allow users to provide additional data for only some segments (see docstring for `_parse_config_input`). For non-segmented pathogens the simpler structure used here for metadata could also be used for sequences. This workflow uses an old version of the CI workflow, <https://github.com/nextstrain/.github/blob/v0/.github/workflows/pathogen-repo-ci.yaml#L233-L240> which copies `example_data`. We could upgrade to the latest version and use a config overlay to swap out the canonical inputs with the example data. Note that one of the side effects of the current implementation is that merged inputs use the same filepath irrespective of the workflow. For instance, both gisaid & h5n1-cattle-outbreak use the intermediate path `results/metadata_merged.tsv`, which means it's not possible to maintain runs of both those analysis concurrently if both were to use merged inputs. Using separate analysis directories, e.g. <#103> will help avoid this shortcoming.

jameshadfield requested a review from victorlin December 16, 2024 02:37

jameshadfield mentioned this pull request Dec 16, 2024

Allow different (multiple) inputs #106

Closed

joverlee521 reviewed Dec 16, 2024

View reviewed changes

Comment thread README.md

tsibley reviewed Dec 16, 2024

View reviewed changes

jameshadfield force-pushed the james/refactor-inputs-mk2 branch 3 times, most recently from 1a66e34 to fde2126 Compare January 7, 2025 21:27

jameshadfield force-pushed the james/refactor-inputs-mk2 branch from fde2126 to 88b9447 Compare February 17, 2025 21:35

jameshadfield merged commit 3b52271 into master Feb 17, 2025

jameshadfield deleted the james/refactor-inputs-mk2 branch February 17, 2025 21:39

jameshadfield mentioned this pull request Feb 24, 2025

Allow for multiple inputs from the config file nextstrain/zika#80

Merged

1 task

tsibley mentioned this pull request Mar 12, 2025

Workflows as programs nextstrain/public#1

Open

36 tasks

huddlej mentioned this pull request Mar 13, 2025

Provide a generic pattern for including additional user data alongside curated data nextstrain/pathogen-repo-guide#72

Closed

joverlee521 mentioned this pull request Jul 3, 2025

phylogenetic: support standardized multiple input nextstrain/mpox#324

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple inputs / overriding inputs#112

Multiple inputs / overriding inputs#112
jameshadfield merged 1 commit intomasterfrom
james/refactor-inputs-mk2

jameshadfield commented Dec 16, 2024 •

edited by victorlin

Loading

Uh oh!

Uh oh!

tsibley left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tsibley Dec 16, 2024

Uh oh!

jameshadfield Jan 7, 2025

Uh oh!

tsibley Jan 7, 2025

Uh oh!

tsibley Jan 7, 2025

Uh oh!

jameshadfield Jan 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jameshadfield commented Dec 16, 2024 • edited by victorlin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

tsibley left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tsibley Dec 16, 2024

Choose a reason for hiding this comment

Uh oh!

jameshadfield Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

tsibley Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

tsibley Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

jameshadfield Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jameshadfield commented Dec 16, 2024 •

edited by victorlin

Loading