Skip to content

Allow different (multiple) inputs#106

Closed
jameshadfield wants to merge 1 commit intojames/update-config-syntaxfrom
james/refactor-inputs
Closed

Allow different (multiple) inputs#106
jameshadfield wants to merge 1 commit intojames/update-config-syntaxfrom
james/refactor-inputs

Conversation

@jameshadfield
Copy link
Copy Markdown
Member

@jameshadfield jameshadfield commented Dec 2, 2024

By having all phylogenetic workflows start from two lists of inputs
(config.inputs, config.additional_inputs) we enable a broad range of
uses with a consistent interface.

  1. Using local ingest files is trivial (see added docs) and doesn't need
    a bunch of special-cased logic that is prone to falling out of date
    (as it had indeed done)
  2. Adding extra / private data follows the similar pattern, with an
    additional config list being used so that we are explicit that the
    new data is additional and enforce an ordering which is needed for
    predictable augur merge behaviour. The canonical data can be
    removed / replaced via step (1) if needed.

I considered adding additional data after the subtype-filtering step,
which would avoid the need to add subtype in the metadata but requires
encoding this in the config overlay. I felt the chosen way was simpler
and more powerful.

Note that this workflow uses an old version of the CI workflow,
https://github.com/nextstrain/.github/blob/v0/.github/workflows/pathogen-repo-ci.yaml#L233-L240
which copies example_data. We could upgrade to the latest version
and use a config overlay to swap out the canonical inputs with the
example data.

See added docs for examples.

Copy link
Copy Markdown
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments regarding sequence merge

Comment thread Snakefile
Comment thread Snakefile
Comment on lines +320 to +332
input:
metadata = lambda w: collect_inputs(segment=w.segment)
output:
metadata = "results/sequences_merged_{segment}.fasta"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming nitpick:

Suggested change
input:
metadata = lambda w: collect_inputs(segment=w.segment)
output:
metadata = "results/sequences_merged_{segment}.fasta"
input:
sequences = lambda w: collect_inputs(segment=w.segment)
output:
sequences = "results/sequences_merged_{segment}.fasta"

Comment thread README.md
additional_inputs:
- name: secret
metadata: secret.tsv
sequencs: secret_{segment}.fasta
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

By having all phylogenetic workflows start from two lists of inputs
(`config.inputs`, `config.additional_inputs`) we enable a broad range of
uses with a consistent interface.

1. Using local ingest files is trivial (see added docs) and doesn't need
   a bunch of special-cased logic that is prone to falling out of date
   (as it had indeed done)
2. Adding extra / private data follows the similar pattern, with an
   additional config list being used so that we are explicit that the
   new data is additional and enforce an ordering which is needed for
   predictable `augur merge` behaviour. The canonical data can be
   removed / replaced via step (1) if needed.

I considered adding additional data after the subtype-filtering step,
which would avoid the need to add subtype in the metadata but requires
encoding this in the config overlay. I felt the chosen way was simpler
and more powerful.

Note that this workflow uses an old version of the CI workflow,
<https://github.com/nextstrain/.github/blob/v0/.github/workflows/pathogen-repo-ci.yaml#L233-L240>
which copies `example_data`. We could upgrade to the latest version
and use a config overlay to swap out the canonical inputs with the
example data.
@jameshadfield
Copy link
Copy Markdown
Member Author

Closing in favor of #112

@victorlin victorlin deleted the james/refactor-inputs branch December 17, 2024 23:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants