Skip to content

Allow for multiple inputs from the config file#80

Merged
j23414 merged 23 commits intomainfrom
refactor-additional-inputs
Jul 28, 2025
Merged

Allow for multiple inputs from the config file#80
j23414 merged 23 commits intomainfrom
refactor-additional-inputs

Conversation

@j23414
Copy link
Copy Markdown
Contributor

@j23414 j23414 commented Feb 24, 2025

Description of proposed changes

to-be-written

Implementation copied and modified from:

This is inspired by the work in:

nextstrain/avian-flu#106
https://github.com/nextstrain/avian-flu?tab=readme-ov-file#use-additional-metadata-andor-sequences

Related issue(s)

Checklist

  • Checks pass

Copy link
Copy Markdown
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation copied and modified from [avian-flu]

I think this would be a good point to try out @trs's suggestion regarding built-in remote file support. We could do that here or in avian-flu. This is also a good time to consider the best way to spread this code around repos (e.g. as a stand-alone rules file, or vendored, or... ).

To be clear, I'm happy with the interface, but before we spread the implementation around different pathogen repos we should improve it as much as possible.

P.S. Avian-flu PR nextstrain/avian-flu#106 was superseded by nextstrain/avian-flu#112

@j23414
Copy link
Copy Markdown
Contributor Author

j23414 commented Feb 24, 2025

+1 for getting built-in remote file support

Should I work on this or do you want to? You can push to this branch and zika is a simpler test case than avian (segment wildcards abound).

best way to spread this code around repos (e.g. as a stand-alone rules file, or vendored, or... ).

I lean toward the "stand-alone rules" file for the moment, because I can imagine needing to adjust the "augur merge" parameters based on the spiked-in-private data for a particular pathogen. But I'm open to snakemake modules:

inferred example*

module ingest_ncbi:
    snakefile:
        github(
            "nextstrain/pathogen-repo-guide",
            path="ingest/rules/fetch_from_ncbi.smk",
            branch="main",
        )

or creating a "vendored" directory for phylogenetic workflows (currently only for ingest workflows).

@jameshadfield
Copy link
Copy Markdown
Member

Should I work on this or do you want to? You can push to this branch and zika is a simpler test case than avian (segment wildcards abound).

Agree that non-segmented is an easier place to build this out, and I think any such solutions would also work for the segmented cases. I'm not able to work on this in the short term (one of the reasons I merged avian-flu when I did is that I needed it's capabilities for the ongoing outbreak work). Perhaps dev-chat would be a good place to surface this?

best way to spread this code around repos

I'm not pushing for any particular solution, but I know it's a subject that a people in the group bring up regularly so wanted to point it out here.

@j23414 j23414 force-pushed the refactor-additional-inputs branch from c74621b to ab23615 Compare March 3, 2025 19:05
@trvrb
Copy link
Copy Markdown
Member

trvrb commented Jun 26, 2025

In terms of interface (as opposed to implementation), I'm unclear about why there is affordance to have both extra elements in inputs and extra elements in additional_inputs. As an outside user, I wouldn't know whether to add my data to inputs or to additional_inputs. Why not just use inputs?

As a user, I also expect the ability to use HTTP as well as local files as entries.

@jameshadfield
Copy link
Copy Markdown
Member

In terms of interface (as opposed to implementation), I'm unclear about why there is affordance to have both extra elements in inputs and extra elements in additional_inputs. As an outside user, I wouldn't know whether to add my data to inputs or to additional_inputs. Why not just use inputs?

Yes, this should be documented more generally - right now the only place is the avian-flu readme. TL;DR you want inputs to overwrite any defaults with your own data, use additional_inputs if you want to spike in your data.

@trvrb
Copy link
Copy Markdown
Member

trvrb commented Jun 26, 2025

Thanks James! I'm still confused however (even after reading the nice documentation in avian-flu readme) if there's a difference between

inputs:
  - name: ncbi
    metadata: s3://nextstrain-data/files/workflows/zika/metadata.tsv.zst
    sequences: s3://nextstrain-data/files/workflows/zika/sequences.fasta.zst
additional_inputs:
  - name: user-data
    metadata: user_data/metadata.tsv
    sequences: user_data/sequences.fasta

and

inputs:
  - name: ncbi
    metadata: s3://nextstrain-data/files/workflows/zika/metadata.tsv.zst
    sequences: s3://nextstrain-data/files/workflows/zika/sequences.fasta.zst
  - name: user-data
    metadata: user_data/metadata.tsv
    sequences: user_data/sequences.fasta

Comment thread phylogenetic/build-configs/ci/config.yaml
@jameshadfield
Copy link
Copy Markdown
Member

jameshadfield commented Jun 26, 2025

Thanks James! I'm still confused however (even after reading the nice documentation in avian-flu readme) if there's a difference between

Those are identical. The benefit comes when we have a default YAML (e.g. in the pathogen repo) and an overlay config yaml (in the analysis directory), which is the imagined use case for nextstrain run. If you want to spike in your private data you could write a config overlay the same as your second example either of your examples, but that requires knowing the internal location of our intermediate files. You could instead simply write:

additional_inputs:
  - name: user-data
    metadata: user_data/metadata.tsv
    sequences: user_data/sequences.fasta

And, if you want to just analyse your data, you'd use inputs in that YAML not additional_inputs

@trvrb
Copy link
Copy Markdown
Member

trvrb commented Jun 26, 2025

Got it. This makes sense. Thank you. I see how this is a smart strategy even if it adds some uncertainty about standard path for adding user data. We should have standard advice for using overlays that are then compatible with nextstrain run. I don't have much else in terms of interface in this case (aside from HTTP).

Copy link
Copy Markdown
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Left a few comments but ran out of time to fully test / review - i'll pick this up next week)

Comment thread phylogenetic/rules/merge_sequences_usvi.smk Outdated
Comment thread phylogenetic/rules/merge_sequences_usvi.smk Outdated
Comment thread phylogenetic/rules/merge_sequences_usvi.smk Outdated
@j23414 j23414 force-pushed the refactor-additional-inputs branch from ab23615 to 5a014cb Compare June 27, 2025 21:06
Comment thread phylogenetic/rules/merge_sequences.smk Outdated
Comment thread phylogenetic/rules/merge_sequences.smk Outdated
Comment thread phylogenetic/rules/merge_sequences.smk Outdated
Comment thread phylogenetic/defaults/config.yaml Outdated
Comment thread phylogenetic/defaults/config.yaml Outdated
@j23414 j23414 changed the title WIP: Allow for multiple inputs from the config file Allow for multiple inputs from the config file Jun 30, 2025
@j23414 j23414 marked this pull request as ready for review June 30, 2025 18:35
Copy link
Copy Markdown
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments and suggestions.

The last dev chat made it seem like we don't want to slow this down with implementation details so you can consider my comments as non-blocking. I do think that there's a clearer path for vendoring ncov's remote file support now than when this was implemented in avian-flu so I'd be in favor of doing that now rather than later.

Comment thread phylogenetic/rules/merge_sequences.smk Outdated
Comment thread phylogenetic/rules/merge_sequences.smk Outdated
Comment thread phylogenetic/rules/merge_sequences.smk Outdated
Comment thread phylogenetic/defaults/config.yaml Outdated
@@ -1,7 +1,14 @@
# Sequences must be FASTA and metadata must be TSV
# Both files must be zstd compressed
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it going to be confusing that the additional_inputs files are not compressed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I'll drop these comments for now as out of date. Ideally we should be able to handle compressed or uncompressed inputs from s3, https, local, or workflow specific in the final implementation

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to update this comment. I investigated the supported compressions a little bit as part of #89 (under the "File Compression" section of the PR description). What compressions we actually support will be influenced by what you do with the merge rules.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped comments entirely in 72f9d21
But let me know if I misunderstood something here

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be great to document what compressions are supported, although this is not entirely generalizable across pathogen repos in the way we're aiming for with the multiple_inputs so we have to be careful with how we write the comment. As I understand it, and assuming we use separate merge_sequences and merge_metadata rules, then the supported compressions are:

  • Metadata
    • Multiple sources: whatever augur merge supports
    • Single source: what's supported by the rules where input_metadata is used (4 different rules for zika as it currently stands)
  • Sequences
    • Multiple sources: whatever augur merge supports
    • Single source: what's supported by the rules where input_sequences is used (only one rule for zika as it currently stands)

Copy link
Copy Markdown
Contributor Author

@j23414 j23414 Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I can link to the augur docs on different compression support, similar to Nextclade docs?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migrating this to an issue: #93

Comment thread phylogenetic/rules/merge_sequences.smk Outdated
Comment thread phylogenetic/rules/merge_sequences.smk Outdated
Comment thread phylogenetic/rules/merge_sequences.smk Outdated
j23414 added 6 commits July 21, 2025 15:00
Rebase and maintain the purpose of these lines:

* https://github.com/nextstrain/zika/blame/3c17db54698cf7836641308f6aa50d714959992b/phylogenetic/rules/merge_sequences_usvi.smk#L30-L33

Where the workflow.source_path("../ is for datasets that live in the workflow source directory (not the local analysis directory), specifically when using nextstrain run.
Updated to match latest guidelines from Nextstrain's Snakemake style guide¹:

- Use raw, triple-quoted shell blocks
- Log standard out and error output to log files and the terminal
- Always use the benchmark directive

¹ <https://docs.nextstrain.org/en/latest/reference/snakemake-style-guide.html>
j23414 and others added 6 commits July 21, 2025 15:00
Co-authored-by: John SJ Anderson <janders4@fredhutch.org>
Co-authored-by: John SJ Anderson <janders4@fredhutch.org>
The code here is lifted directly from <nextstrain/ncov#1180>
which itself was lifted from <#89>.
See those PRs for in-depth discussions.

The `remote_files.smk` file added here will be moved to the
github.com/nextstrain/shared/ repo and vendored across pathogen repos,
including this one. Assuming this doesn't change cf. the ncov
implementation I think it's fine to have the file copy-pasted around a
few repos before it arrives in the shared repo.
@j23414 j23414 force-pushed the refactor-additional-inputs branch from 0822b83 to a7da6c3 Compare July 21, 2025 22:00
Comment thread phylogenetic/rules/merge_inputs.smk Outdated
"""

include: "remote_files.smk"
include: "../../shared/vendored/snakemake/remote_files.smk"
Copy link
Copy Markdown
Contributor Author

@j23414 j23414 Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence about this include statement going here or the main Snakefile. Does anyone have a preference?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have slight preference for adding the include statment to the main Snakefile in case we want to use the shared path_or_url function in other parts of the workflow.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about scoping! I leaned toward main Snakefile so fixed up in 75dc2e0

Copy link
Copy Markdown
Member

@jameshadfield jameshadfield Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - the intention was for path_or_url to be available for wider use than just the inputs.

One use-case we don't support (as discussed elsewhere) is something like:

inputs:
  foo = path_or_url(config['bar'])
  • If config['bar'] is a URL it'll just work when run from either the pathogen repo or an analysis directory
  • If config['bar'] is a path which exists in the analysis directory (and we're running it in an analysis directory) it'll just work
  • If config['bar'] is a path which exists in the pathogen repo and we're running it in an analysis directory then it won't work

(There's no easy solution as far as I can see. You can't just use something like resolve_config_path within path_or_url because we support output files and because some workflows use modified versions of resolve_config_path. We could expand path_or_url to take a path_resolver argument which would force the path to exist, i.e. only be useful in the context of params/inputs. Happy to implement this if people want.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If config['bar'] is a path which exists in the pathogen repo and we're running it in an analysis directory then it won't work

Hmm, I thought we could wrap resolve_config_path around path_or_url, since path_or_url should return the local file location for a URL.

inputs:
  foo = resolve_config_path(path_or_url(config['bar']))

but this currently runs into a weird WorkflowError (reported in snakemake/snakemake#3027)

Error:
  WorkflowError:
    Flags ({'storage_object': <snakemake_storage_plugin_http.StorageObject object at 0xffffa0415f90>}) in file pattern '.snakemake/storage/http/github.com/nextstrain/augur/blob/fe3f57975c34ff180cb437bebfefd68fa4a4afed/augur/data/geolocation_rules.tsv' given to expand() are invalid. Flags (e.g. temp(), directory()) have to be applied outside of expand (e.g. 'temp(expand("plots/{sample}.pdf", sample=SAMPLES))').
Wildcards:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beyond the workflow error you get, conceptually the .snakemake/storage file may not exist when the inputs are evaluated so our path resolution approach wouldn't work. This is somewhat analogous to using path_or_url for outputs. Path evaluation within path_or_url will be straightforward I think (via something like path_or_url(uri, path_resolver))

Comment thread phylogenetic/build-configs/ci/config.yaml Outdated
Copy link
Copy Markdown
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some small nit-picky comments, but I think this should be good to merge?

Comment thread phylogenetic/data/README.md
Comment thread phylogenetic/rules/merge_inputs.smk Outdated
Comment thread phylogenetic/defaults/config.yaml
@j23414 j23414 merged commit 59f50ef into main Jul 28, 2025
4 checks passed
@j23414 j23414 deleted the refactor-additional-inputs branch July 28, 2025 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants