Allow for multiple inputs from the config file by j23414 · Pull Request #80 · nextstrain/zika

j23414 · 2025-02-24T03:52:36Z

Description of proposed changes

to-be-written

Implementation copied and modified from:

https://github.com/nextstrain/avian-flu/blob/ba8d73cbf0ed3134783724182762fc690be8fdd7/Snakefile#L86-L227

This is inspired by the work in:

nextstrain/avian-flu#106
https://github.com/nextstrain/avian-flu?tab=readme-ov-file#use-additional-metadata-andor-sequences

Related issue(s)

Checklist

Checks pass

jameshadfield

Implementation copied and modified from [avian-flu]

I think this would be a good point to try out @trs's suggestion regarding built-in remote file support. We could do that here or in avian-flu. This is also a good time to consider the best way to spread this code around repos (e.g. as a stand-alone rules file, or vendored, or... ).

To be clear, I'm happy with the interface, but before we spread the implementation around different pathogen repos we should improve it as much as possible.

P.S. Avian-flu PR nextstrain/avian-flu#106 was superseded by nextstrain/avian-flu#112

j23414 · 2025-02-24T20:20:45Z

+1 for getting built-in remote file support

Should I work on this or do you want to? You can push to this branch and zika is a simpler test case than avian (segment wildcards abound).

best way to spread this code around repos (e.g. as a stand-alone rules file, or vendored, or... ).

I lean toward the "stand-alone rules" file for the moment, because I can imagine needing to adjust the "augur merge" parameters based on the spiked-in-private data for a particular pathogen. But I'm open to snakemake modules:

inferred example*

module ingest_ncbi:
    snakefile:
        github(
            "nextstrain/pathogen-repo-guide",
            path="ingest/rules/fetch_from_ncbi.smk",
            branch="main",
        )

or creating a "vendored" directory for phylogenetic workflows (currently only for ingest workflows).

jameshadfield · 2025-02-24T20:30:52Z

Should I work on this or do you want to? You can push to this branch and zika is a simpler test case than avian (segment wildcards abound).

Agree that non-segmented is an easier place to build this out, and I think any such solutions would also work for the segmented cases. I'm not able to work on this in the short term (one of the reasons I merged avian-flu when I did is that I needed it's capabilities for the ongoing outbreak work). Perhaps dev-chat would be a good place to surface this?

best way to spread this code around repos

I'm not pushing for any particular solution, but I know it's a subject that a people in the group bring up regularly so wanted to point it out here.

trvrb · 2025-06-26T20:23:28Z

In terms of interface (as opposed to implementation), I'm unclear about why there is affordance to have both extra elements in inputs and extra elements in additional_inputs. As an outside user, I wouldn't know whether to add my data to inputs or to additional_inputs. Why not just use inputs?

As a user, I also expect the ability to use HTTP as well as local files as entries.

jameshadfield · 2025-06-26T20:30:44Z

In terms of interface (as opposed to implementation), I'm unclear about why there is affordance to have both extra elements in inputs and extra elements in additional_inputs. As an outside user, I wouldn't know whether to add my data to inputs or to additional_inputs. Why not just use inputs?

Yes, this should be documented more generally - right now the only place is the avian-flu readme. TL;DR you want inputs to overwrite any defaults with your own data, use additional_inputs if you want to spike in your data.

trvrb · 2025-06-26T20:39:36Z

Thanks James! I'm still confused however (even after reading the nice documentation in avian-flu readme) if there's a difference between

inputs:
  - name: ncbi
    metadata: s3://nextstrain-data/files/workflows/zika/metadata.tsv.zst
    sequences: s3://nextstrain-data/files/workflows/zika/sequences.fasta.zst
additional_inputs:
  - name: user-data
    metadata: user_data/metadata.tsv
    sequences: user_data/sequences.fasta

and

inputs:
  - name: ncbi
    metadata: s3://nextstrain-data/files/workflows/zika/metadata.tsv.zst
    sequences: s3://nextstrain-data/files/workflows/zika/sequences.fasta.zst
  - name: user-data
    metadata: user_data/metadata.tsv
    sequences: user_data/sequences.fasta

jameshadfield · 2025-06-26T20:44:18Z

Thanks James! I'm still confused however (even after reading the nice documentation in avian-flu readme) if there's a difference between

Those are identical. The benefit comes when we have a default YAML (e.g. in the pathogen repo) and an overlay config yaml (in the analysis directory), which is the imagined use case for nextstrain run. If you want to spike in your private data you could write a config overlay the same as ~~your second example~~ either of your examples, but that requires knowing the internal location of our intermediate files. You could instead simply write:

additional_inputs:
  - name: user-data
    metadata: user_data/metadata.tsv
    sequences: user_data/sequences.fasta

And, if you want to just analyse your data, you'd use inputs in that YAML not additional_inputs

trvrb · 2025-06-26T21:41:36Z

Got it. This makes sense. Thank you. I see how this is a smart strategy even if it adds some uncertainty about standard path for adding user data. We should have standard advice for using overlays that are then compatible with nextstrain run. I don't have much else in terms of interface in this case (aside from HTTP).

jameshadfield

(Left a few comments but ran out of time to fully test / review - i'll pick this up next week)

joverlee521

Left some comments and suggestions.

The last dev chat made it seem like we don't want to slow this down with implementation details so you can consider my comments as non-blocking. I do think that there's a clearer path for vendoring ncov's remote file support now than when this was implemented in avian-flu so I'd be in favor of doing that now rather than later.

genehack · 2025-07-01T17:16:52Z

@@ -1,7 +1,14 @@
 # Sequences must be FASTA and metadata must be TSV
 # Both files must be zstd compressed


is it going to be confusing that the additional_inputs files are not compressed?

Good question! I'll drop these comments for now as out of date. Ideally we should be able to handle compressed or uncompressed inputs from s3, https, local, or workflow specific in the final implementation

Reminder to update this comment. I investigated the supported compressions a little bit as part of #89 (under the "File Compression" section of the PR description). What compressions we actually support will be influenced by what you do with the merge rules.

Dropped comments entirely in 72f9d21
But let me know if I misunderstood something here

I think it'd be great to document what compressions are supported, although this is not entirely generalizable across pathogen repos in the way we're aiming for with the multiple_inputs so we have to be careful with how we write the comment. As I understand it, and assuming we use separate merge_sequences and merge_metadata rules, then the supported compressions are:

Metadata

Multiple sources: whatever augur merge supports

Single source: what's supported by the rules where input_metadata is used (4 different rules for zika as it currently stands)

Sequences

Multiple sources: whatever augur merge supports

Single source: what's supported by the rules where input_sequences is used (only one rule for zika as it currently stands)

Perhaps I can link to the augur docs on different compression support, similar to Nextclade docs?

https://docs.nextstrain.org/projects/nextclade/en/stable/user/input-files/compression.html
If it exists in augur docs somewhere...

Migrating this to an issue: #93

This is inspired by the work in: nextstrain/avian-flu#106 https://github.com/nextstrain/avian-flu?tab=readme-ov-file#use-additional-metadata-andor-sequences

Rebase and maintain the purpose of these lines: * https://github.com/nextstrain/zika/blame/3c17db54698cf7836641308f6aa50d714959992b/phylogenetic/rules/merge_sequences_usvi.smk#L30-L33 Where the workflow.source_path("../ is for datasets that live in the workflow source directory (not the local analysis directory), specifically when using nextstrain run.

Updated to match latest guidelines from Nextstrain's Snakemake style guide¹: - Use raw, triple-quoted shell blocks - Log standard out and error output to log files and the terminal - Always use the benchmark directive ¹ <https://docs.nextstrain.org/en/latest/reference/snakemake-style-guide.html>

Co-authored-by: John SJ Anderson <janders4@fredhutch.org>

… complexity

The code here is lifted directly from <nextstrain/ncov#1180> which itself was lifted from <#89>. See those PRs for in-depth discussions. The `remote_files.smk` file added here will be moved to the github.com/nextstrain/shared/ repo and vendored across pathogen repos, including this one. Assuming this doesn't change cf. the ncov implementation I think it's fine to have the file copy-pasted around a few repos before it arrives in the shared repo.

j23414 · 2025-07-22T20:19:12Z

 """

-include: "remote_files.smk"
+include: "../../shared/vendored/snakemake/remote_files.smk"


I'm on the fence about this include statement going here or the main Snakefile. Does anyone have a preference?

I have slight preference for adding the include statment to the main Snakefile in case we want to use the shared path_or_url function in other parts of the workflow.

Good point about scoping! I leaned toward main Snakefile so fixed up in 75dc2e0

Agreed - the intention was for path_or_url to be available for wider use than just the inputs.

One use-case we don't support (as discussed elsewhere) is something like:

inputs: foo = path_or_url(config['bar'])

If config['bar'] is a URL it'll just work when run from either the pathogen repo or an analysis directory

If config['bar'] is a path which exists in the analysis directory (and we're running it in an analysis directory) it'll just work

If config['bar'] is a path which exists in the pathogen repo and we're running it in an analysis directory then it won't work

(There's no easy solution as far as I can see. You can't just use something like resolve_config_path within path_or_url because we support output files and because some workflows use modified versions of resolve_config_path. We could expand path_or_url to take a path_resolver argument which would force the path to exist, i.e. only be useful in the context of params/inputs. Happy to implement this if people want.)

If config['bar'] is a path which exists in the pathogen repo and we're running it in an analysis directory then it won't work

Hmm, I thought we could wrap resolve_config_path around path_or_url, since path_or_url should return the local file location for a URL.

inputs: foo = resolve_config_path(path_or_url(config['bar']))

but this currently runs into a weird WorkflowError (reported in snakemake/snakemake#3027)

Error: WorkflowError: Flags ({'storage_object': <snakemake_storage_plugin_http.StorageObject object at 0xffffa0415f90>}) in file pattern '.snakemake/storage/http/github.com/nextstrain/augur/blob/fe3f57975c34ff180cb437bebfefd68fa4a4afed/augur/data/geolocation_rules.tsv' given to expand() are invalid. Flags (e.g. temp(), directory()) have to be applied outside of expand (e.g. 'temp(expand("plots/{sample}.pdf", sample=SAMPLES))'). Wildcards:

Beyond the workflow error you get, conceptually the .snakemake/storage file may not exist when the inputs are evaluated so our path resolution approach wouldn't work. This is somewhat analogous to using path_or_url for outputs. Path evaluation within path_or_url will be straightforward I think (via something like path_or_url(uri, path_resolver))

joverlee521

I left some small nit-picky comments, but I think this should be good to merge?

Co-authored-by: Jover Lee <joverlee521@gmail.com>

jameshadfield reviewed Feb 24, 2025

View reviewed changes

jameshadfield mentioned this pull request Feb 24, 2025

Merge additional data during phylogenetic (instead of ingest) workflow nextstrain/WNV#68

Merged

1 task

j23414 force-pushed the refactor-additional-inputs branch from c74621b to ab23615 Compare March 3, 2025 19:05

j23414 commented Jun 26, 2025

View reviewed changes

Comment thread phylogenetic/build-configs/ci/config.yaml

jameshadfield reviewed Jun 27, 2025

View reviewed changes

Comment thread phylogenetic/rules/merge_sequences_usvi.smk Outdated

Comment thread phylogenetic/rules/merge_sequences_usvi.smk Outdated

Comment thread phylogenetic/rules/merge_sequences_usvi.smk Outdated

j23414 force-pushed the refactor-additional-inputs branch from ab23615 to 5a014cb Compare June 27, 2025 21:06

jameshadfield requested changes Jun 29, 2025

View reviewed changes

j23414 changed the title ~~WIP: Allow for multiple inputs from the config file~~ Allow for multiple inputs from the config file Jun 30, 2025

j23414 marked this pull request as ready for review June 30, 2025 18:35

joverlee521 reviewed Jul 1, 2025

View reviewed changes

Comment thread phylogenetic/rules/merge_sequences.smk Outdated

Comment thread phylogenetic/rules/merge_sequences.smk Outdated

Comment thread phylogenetic/rules/merge_sequences.smk Outdated

genehack reviewed Jul 1, 2025

View reviewed changes

This was referenced Jul 1, 2025

phylogenetic workflow updates nextstrain/mpox#321

Merged

phylogenetic: support standardized multiple input nextstrain/mpox#324

Closed

jameshadfield mentioned this pull request Jul 4, 2025

WIP - snakemake v8+ remote storage support #89

Closed

j23414 force-pushed the refactor-additional-inputs branch from be3f42f to 0822b83 Compare July 14, 2025 17:55

joverlee521 mentioned this pull request Jul 18, 2025

Add remote_files helper functions nextstrain/shared#56

Closed

jameshadfield mentioned this pull request Jul 21, 2025

Update shared/vendored #92

Merged

j23414 added 6 commits July 21, 2025 15:00

Allow for multiple inputs from the config file

0d0af17

This is inspired by the work in: nextstrain/avian-flu#106 https://github.com/nextstrain/avian-flu?tab=readme-ov-file#use-additional-metadata-andor-sequences

Simplifies connecting phylogenetic example data

67148bd

Merge: Update documentation

1ca6131

Merge: drop usvi from smk filename

a8bdc68

j23414 and others added 6 commits July 21, 2025 15:00

fixup: grammar

c95f115

Co-authored-by: John SJ Anderson <janders4@fredhutch.org>

Fixup: startswith can take a tuple of strings

be7eb50

Co-authored-by: John SJ Anderson <janders4@fredhutch.org>

Fixup: Generalize merge_sequences.smk to merge_inputs.smk

a2bbd48

Fixup: Remove commented out code

a8793cc

fixup: remove lambda_none since it's a holdover from segment-specific…

37b1340

… complexity

j23414 force-pushed the refactor-additional-inputs branch from 0822b83 to a7da6c3 Compare July 21, 2025 22:00

j23414 added 2 commits July 22, 2025 12:30

Switch to use the shared/vendored remote_files.smk

4afc48f

Pull USVI data from s3 instead of locally

3aeebf8

j23414 commented Jul 22, 2025

View reviewed changes

Fixup: Switch to use the shared/vendored remote_files.smk

75dc2e0

jameshadfield reviewed Jul 22, 2025

View reviewed changes

Comment thread phylogenetic/build-configs/ci/config.yaml Outdated

j23414 added 3 commits July 22, 2025 15:13

Place both ncbi and usvi data under 'inputs'

1efbb4b

Example Data: Place both ncbi and usvi data under 'inputs'

69eff79

Cleanup out of date comments

72f9d21

victorlin mentioned this pull request Jul 23, 2025

merge: Inconsistent handling of single input nextstrain/augur#1865

Closed

j23414 force-pushed the refactor-additional-inputs branch from d7df968 to 72f9d21 Compare July 23, 2025 17:10

victorlin mentioned this pull request Jul 23, 2025

merge: Unhelpful error message with named sequence inputs nextstrain/augur#1867

Open

joverlee521 approved these changes Jul 24, 2025

View reviewed changes

Comment thread phylogenetic/data/README.md

Comment thread phylogenetic/rules/merge_inputs.smk Outdated

Comment thread phylogenetic/defaults/config.yaml

j23414 and others added 5 commits July 24, 2025 23:16

fixup: quoted params

f65c5e1

Co-authored-by: Jover Lee <joverlee521@gmail.com>

Fixup: Use augur merge over SeqKit

ee72582

Docs: Add more documentation of USVI data to description.md

fe67761

Docs: Add more documentation of USVI data to phylogenetic/README.md

ea7a722

Docs: Document major change of multiple inputs in CHANGELOG

d60cb9b

j23414 mentioned this pull request Jul 28, 2025

Document input compression support #93

Open

j23414 merged commit 59f50ef into main Jul 28, 2025
4 checks passed

j23414 deleted the refactor-additional-inputs branch July 28, 2025 17:26

This was referenced Jul 29, 2025

phylogenetic: Document how to use inputs and additional_inputs config #94

Closed

Multi input followup #95

Merged

joverlee521 mentioned this pull request Aug 8, 2025

phylogenetic: update to use shared input handling nextstrain/WNV#99

Closed

		@@ -1,7 +1,14 @@
		# Sequences must be FASTA and metadata must be TSV
		# Both files must be zstd compressed

Conversation

j23414 commented Feb 24, 2025

Description of proposed changes

Related issue(s)

Checklist

Uh oh!

jameshadfield left a comment

Choose a reason for hiding this comment

Uh oh!

j23414 commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jameshadfield commented Feb 24, 2025

Uh oh!

trvrb commented Jun 26, 2025

Uh oh!

jameshadfield commented Jun 26, 2025

Uh oh!

trvrb commented Jun 26, 2025

Uh oh!

Uh oh!

jameshadfield commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trvrb commented Jun 26, 2025

Uh oh!

jameshadfield left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joverlee521 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

j23414 Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

j23414 Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jameshadfield Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joverlee521 left a comment

j23414 commented Feb 24, 2025 •

edited

Loading

jameshadfield commented Jun 26, 2025 •

edited

Loading

j23414 Jul 25, 2025 •

edited

Loading

j23414 Jul 22, 2025 •

edited

Loading

jameshadfield Jul 22, 2025 •

edited

Loading