Skip to content

Commit 6b32071

Browse files
committed
Drop support for old sequence/metadata inputs
Removes deprecated sequence and metadata inputs from the configuration file and removes Snakemake logic required to support these files. Also, removes references to this deprecated input format from the example profiles and the "multiple inputs" tutorial. Since we no longer support this old input format, we also no longer need to support empty origin wildcards. We drop support for empty origin wildcard and remove all references to trimming of origin wildcards that start with an underscore and update all rules to reference the origin wildcard with the underscore in the filename. We also now print helpful errors when inputs aren't defined properly through checks for configurations with old-style input definitions or without any inputs defined. These error messages provide recommendations about how to update the workflow configuration to fix the issues.
1 parent 3647377 commit 6b32071

11 files changed

Lines changed: 112 additions & 114 deletions

File tree

Snakefile

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ from getpass import getuser
88
from snakemake.logging import logger
99
from snakemake.utils import validate
1010
from collections import OrderedDict
11+
import textwrap
1112
import time
1213

1314
# Store the user's configuration prior to loading defaults, so we can check for
@@ -72,6 +73,26 @@ if "builds" not in config:
7273

7374
include: "workflow/snakemake_rules/reference_build_definitions.smk"
7475

76+
# Check for old-style input file references and alert users to the new format.
77+
if "sequences" in config or "metadata" in config:
78+
logger.error("ERROR: Your configuration file includes references to a deprecated specification of input files (e.g., `config['sequences']` or `config['metadata']`).")
79+
logger.error("Update your configuration file (e.g., 'builds.yaml') to define your inputs as follows and try running the workflow again:")
80+
logger.error(textwrap.indent(
81+
f"\ninputs:\n metadata: {config['metadata']}\n sequences: {config['sequences']}\n",
82+
" "
83+
))
84+
sys.exit(1)
85+
86+
# Check for missing inputs.
87+
if "inputs" not in config:
88+
logger.error("ERROR: Your workflow does not define any input files to start with.")
89+
logger.error("Update your configuration file (e.g., 'builds.yaml') to define at least one input dataset as follows and try running the workflow again:")
90+
logger.error(textwrap.indent(
91+
f"\ninputs:\n metadata: data/example_metadata.tsv\n sequences: data/example_sequences.fasta.gz\n",
92+
" "
93+
))
94+
sys.exit(1)
95+
7596
# Allow users to specify a list of active builds from the command line.
7697
if config.get("active_builds"):
7798
BUILD_NAMES = config["active_builds"].split(",")
@@ -91,7 +112,7 @@ wildcard_constraints:
91112
# but not special strings used for Nextstrain builds.
92113
build_name = r'(?:[_a-zA-Z-](?!(tip-frequencies)))+',
93114
date = r"[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]",
94-
origin = r"(_[a-zA-Z0-9-]+)?" # origin starts with an underscore _OR_ it's the empty string
115+
origin = r"[a-zA-Z0-9-_]+"
95116

96117
localrules: download_metadata, download_sequences, download, upload, clean
97118

defaults/parameters.yaml

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,6 @@
66
# This must be a relative path to the top-level Snakefile directory (e.g., `ncov/`).
77
conda_environment: "workflow/envs/nextstrain.yaml"
88

9-
# These are the two main starting files for the run.
10-
# If they do not exist, we will attempt to fetch them from a S3 bucket (see below)
11-
sequences: "data/sequences.fasta"
12-
metadata: "data/metadata.tsv"
13-
149
reference_node_name: "USA/WA1/2020"
1510

1611
# Define files used for external configuration. Common examples consist of a

docs/multiple_inputs.md

Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -51,15 +51,17 @@ my_profiles/example_multiple_inputs/my_auspice_config.json
5151

5252
## Setting up the config
5353

54-
Typically, inside the `builds.yaml` one would specify input files such as
54+
You can define a single input dataset in `builds.yaml` as follows.
5555

5656
```yaml
57-
# traditional syntax for specifying starting files
58-
sequences: "data/sequences.fasta"
59-
metadata: "data/metadata.tsv"
57+
inputs:
58+
- name: my-data
59+
metadata: "data/metadata.tsv"
60+
sequences: "data/sequences.fasta"
6061
```
6162
62-
For multiple inputs, we shall use the new `inputs` section of the config to specify that we have two different inputs, and we will give them the names "aus" and "worldwide":
63+
For multiple inputs, you can add another entry to the `inputs` config list.
64+
Here, we will give them the names "aus" and "worldwide":
6365

6466
```yaml
6567
# my_profiles/example_multiple_inputs/builds.yaml
@@ -72,15 +74,11 @@ inputs:
7274
sequences: "data/example_sequences_worldwide.fasta"
7375
```
7476

75-
> Note that if you also specify `sequences` or `metadata` as top level entries in the config, they will be ignored.
76-
7777
### Snakemake terminology
7878

7979
Inside the Snakemake rules, we use a wildcard `origin` to define different starting points.
80-
For instance, if we ask for the file `results/aligned_worldwide.fasta` then `wildcards.origin="_worldwide"` and we expect that the config has defined
81-
a sequences input via `config["sequences"]["worldwide"]=<path to fasta>` (note the leading `_` has been stripped from the `origin` in the config).
82-
If we use the older syntax (specifying `sequences` or `metadata` as top level entries in the config) then `wildcards.origin=""`.
83-
80+
For instance, if we ask for the file `results/aligned_worldwide.fasta` then `wildcards.origin="worldwide"` and we expect that the config has defined
81+
a sequences input as shown above.
8482

8583
## How is metadata combined?
8684

my_profiles/example/builds.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,12 @@
1111

1212
# In this example, we use these default methods. See other templates for examples of how to customize this subsampling scheme.
1313

14+
# Define input files.
15+
inputs:
16+
- name: example-data
17+
metadata: data/example_metadata.tsv
18+
sequences: data/example_sequences.fasta.gz
19+
1420
builds:
1521
# Focus on King County (location) in Washington State (division) in the USA (country)
1622
# with a build name that will produce the following URL fragment on Nextstrain/auspice:

my_profiles/example/config.yaml

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,6 @@ configfile:
1010
- defaults/parameters.yaml # Pull in the default values
1111
- my_profiles/example/builds.yaml # Pull in our list of desired builds
1212

13-
config:
14-
- sequences=data/example_sequences.fasta
15-
- metadata=data/example_metadata.tsv
16-
1713
# Set the maximum number of cores you want Snakemake to use for this pipeline.
1814
cores: 2
1915

my_profiles/getting_started/builds.yaml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,12 @@
88
# These subsample primarily from the area of interest ("focus"), and add in background ("contextual") sequences from the rest of the world.
99
# Contextual sequences that are genetically similar to (hamming distance) and geographically near the focal sequences are heavily prioritized.
1010

11+
# Define input files.
12+
inputs:
13+
- name: example-data
14+
metadata: data/example_metadata.tsv
15+
sequences: data/example_sequences.fasta.gz
16+
1117
# In this example, we use these default methods. See other templates for examples of how to customize this subsampling scheme.
1218
builds:
1319
# This build samples evenly from the globe

my_profiles/getting_started/config.yaml

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,6 @@ configfile:
1010
- defaults/parameters.yaml # Pull in the default values
1111
- my_profiles/getting_started/builds.yaml # Pull in our list of desired builds
1212

13-
config:
14-
- sequences=data/example_sequences.fasta
15-
- metadata=data/example_metadata.tsv
16-
1713
# Set the maximum number of cores you want Snakemake to use for this pipeline.
1814
cores: 1
1915

workflow/snakemake_rules/common.smk

Lines changed: 21 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -25,89 +25,69 @@ def numeric_date(dt=None):
2525

2626
return res
2727

28-
def _trim_origin(origin):
29-
"""the origin wildcard includes a leading `_`. This function returns the value without this `_`"""
30-
if origin=="":
31-
return ""
32-
return origin[1:]
33-
3428
def _get_subsampling_scheme_by_build_name(build_name):
3529
return config["builds"][build_name].get("subsampling_scheme", build_name)
3630

3731
def _get_filter_value(wildcards, key):
3832
default = config["filter"].get(key, "")
3933
if wildcards["origin"] == "":
4034
return default
41-
return config["filter"].get(_trim_origin(wildcards["origin"]), {}).get(key, default)
35+
return config["filter"].get(wildcards["origin"], {}).get(key, default)
4236

4337
def _get_path_for_input(stage, origin_wildcard):
4438
"""
4539
A function called to define an input for a Snakemake rule
4640
This function always returns a local filepath, the format of which decides whether rules should
4741
create this by downloading from a remote resource, or create it by a local compute rule.
4842
"""
49-
if not origin_wildcard:
50-
# No origin wildcards => deprecated single inputs (e.g. `config["sequences"]`) which cannot
51-
# be downloaded from remote resources
52-
if config.get("inputs"):
53-
raise Exception("ERROR: empty origin wildcard but config defines 'inputs`")
54-
path_or_url = config[stage] if stage in ["metadata", "sequences"] else ""
55-
remote = False
56-
else:
57-
trimmed_origin = _trim_origin(origin_wildcard)
58-
path_or_url = config.get("inputs", {}).get(trimmed_origin, {}).get(stage, "")
59-
scheme = urlsplit(path_or_url).scheme
60-
remote = bool(scheme)
43+
path_or_url = config.get("inputs", {}).get(origin_wildcard, {}).get(stage, "")
44+
scheme = urlsplit(path_or_url).scheme
45+
remote = bool(scheme)
6146

62-
# Following checking should be the remit of the rule which downloads the remote resource
63-
if scheme and scheme!="s3":
64-
raise Exception(f"Input defined scheme {scheme} which is not yet supported.")
47+
# Following checking should be the remit of the rule which downloads the remote resource
48+
if scheme and scheme!="s3":
49+
raise Exception(f"Input defined scheme {scheme} which is not yet supported.")
6550

66-
## Basic checking which could be taken care of by the config schema
67-
## If asking for metadata/sequences, the config _must_ supply a `path_or_url`
68-
if path_or_url=="" and stage in ["metadata", "sequences"]:
69-
raise Exception(f"ERROR: config->input->{trimmed_origin}->{stage} is not defined.")
51+
## Basic checking which could be taken care of by the config schema
52+
## If asking for metadata/sequences, the config _must_ supply a `path_or_url`
53+
if path_or_url=="" and stage in ["metadata", "sequences"]:
54+
raise Exception(f"ERROR: config->input->{origin_wildcard}->{stage} is not defined.")
7055

7156
if stage=="metadata":
72-
return f"data/downloaded{origin_wildcard}.tsv" if remote else path_or_url
57+
return f"data/downloaded_{origin_wildcard}.tsv" if remote else path_or_url
7358
if stage=="sequences":
74-
return f"data/downloaded{origin_wildcard}.fasta" if remote else path_or_url
59+
return f"data/downloaded_{origin_wildcard}.fasta" if remote else path_or_url
7560
if stage=="aligned":
76-
return f"results/precomputed-aligned{origin_wildcard}.fasta" if remote else f"results/aligned{origin_wildcard}.fasta"
61+
return f"results/precomputed-aligned_{origin_wildcard}.fasta" if remote else f"results/aligned_{origin_wildcard}.fasta"
7762
if stage=="to-exclude":
78-
return f"results/precomputed-to-exclude{origin_wildcard}.txt" if remote else f"results/to-exclude{origin_wildcard}.txt"
63+
return f"results/precomputed-to-exclude_{origin_wildcard}.txt" if remote else f"results/to-exclude_{origin_wildcard}.txt"
7964
if stage=="masked":
80-
return f"results/precomputed-masked{origin_wildcard}.fasta" if remote else f"results/masked{origin_wildcard}.fasta"
65+
return f"results/precomputed-masked_{origin_wildcard}.fasta" if remote else f"results/masked_{origin_wildcard}.fasta"
8166
if stage=="filtered":
8267
if remote:
83-
return f"results/precomputed-filtered{origin_wildcard}.fasta"
68+
return f"results/precomputed-filtered_{origin_wildcard}.fasta"
8469
elif path_or_url:
8570
return path_or_url
8671
else:
87-
return f"results/filtered{origin_wildcard}.fasta"
72+
return f"results/filtered_{origin_wildcard}.fasta"
8873

8974
raise Exception(f"_get_path_for_input with unknown stage \"{stage}\"")
9075

9176

9277
def _get_unified_metadata(wildcards):
9378
"""
9479
Returns a single metadata file representing the input metadata file(s).
95-
If there was only one supplied metadata file (e.g. the deprecated
96-
`config["metadata"]` syntax, or one entry in the `config["inputs"] dict`)
80+
If there was only one supplied metadata file in the `config["inputs"] dict`,
9781
then that file is returned. Else "results/combined_metadata.tsv" is returned
9882
which will run the `combine_input_metadata` rule to make it.
9983
"""
100-
if not config.get("inputs"):
101-
return config["metadata"]
10284
if len(list(config["inputs"].keys()))==1:
103-
return _get_path_for_input("metadata", "_"+list(config["inputs"].keys())[0])
85+
return _get_path_for_input("metadata", list(config["inputs"].keys())[0])
10486
return "results/combined_metadata.tsv"
10587

10688
def _get_unified_alignment(wildcards):
107-
if not config.get("inputs"):
108-
return "results/filtered.fasta"
10989
if len(list(config["inputs"].keys()))==1:
110-
return _get_path_for_input("filtered", "_"+list(config["inputs"].keys())[0])
90+
return _get_path_for_input("filtered", list(config["inputs"].keys())[0])
11191
return "results/combined_sequences_for_subsampling.fasta",
11292

11393
def _get_metadata_by_build_name(build_name):

workflow/snakemake_rules/download.smk

Lines changed: 24 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@
88
# rule: align
99
# input: _get_path_for_input
1010
# ...
11-
# will result in an input file looking like "results/aligned{origin}.fasta" or
12-
# "results/download-aligned{origin}.fasta" (which one is chosen depends on the
11+
# will result in an input file looking like "results/aligned_{origin}.fasta" or
12+
# "results/download-aligned_{origin}.fasta" (which one is chosen depends on the
1313
# supplied `config`). In the latter case, `rule download_aligned` will be used.
1414
# See https://github.com/nextstrain/ncov/compare/remote-files for an example of
1515
# how we could leverage snakemake to do this without needing a separate rule!
@@ -33,11 +33,11 @@ def _infer_decompression(input):
3333
rule download_sequences:
3434
message: "Downloading sequences from {params.address} -> {output.sequences}"
3535
output:
36-
sequences = "data/downloaded{origin}.fasta"
36+
sequences = "data/downloaded_{origin}.fasta"
3737
conda: config["conda_environment"]
3838
params:
39-
address = lambda w: config["inputs"][_trim_origin(w.origin)]["sequences"],
40-
deflate = lambda w: _infer_decompression(config["inputs"][_trim_origin(w.origin)]["sequences"])
39+
address = lambda w: config["inputs"][w.origin]["sequences"],
40+
deflate = lambda w: _infer_decompression(config["inputs"][w.origin]["sequences"])
4141
shell:
4242
"""
4343
aws s3 cp {params.address} - | {params.deflate} > {output.sequences:q}
@@ -46,10 +46,10 @@ rule download_sequences:
4646
rule download_metadata:
4747
message: "Downloading metadata from {params.address} -> {output.metadata}"
4848
output:
49-
metadata = "data/downloaded{origin}.tsv"
49+
metadata = "data/downloaded_{origin}.tsv"
5050
conda: config["conda_environment"]
5151
params:
52-
address = lambda w: config["inputs"][_trim_origin(w.origin)]["metadata"]
52+
address = lambda w: config["inputs"][w.origin]["metadata"]
5353
shell:
5454
"""
5555
aws s3 cp {params.address} - | gunzip -cq >{output.metadata:q}
@@ -58,11 +58,11 @@ rule download_metadata:
5858
rule download_aligned:
5959
message: "Downloading aligned fasta files from {params.address} -> {output.sequences}"
6060
output:
61-
sequences = "results/precomputed-aligned{origin}.fasta"
61+
sequences = "results/precomputed-aligned_{origin}.fasta"
6262
conda: config["conda_environment"]
6363
params:
64-
address = lambda w: config["inputs"][_trim_origin(w.origin)]["aligned"],
65-
deflate = lambda w: _infer_decompression(config["inputs"][_trim_origin(w.origin)]["aligned"])
64+
address = lambda w: config["inputs"][w.origin]["aligned"],
65+
deflate = lambda w: _infer_decompression(config["inputs"][w.origin]["aligned"])
6666
shell:
6767
"""
6868
aws s3 cp {params.address} - | {params.deflate} > {output.sequences:q}
@@ -76,17 +76,17 @@ rule download_diagnostic:
7676
{params.to_exclude_address} -> {output.to_exclude}
7777
"""
7878
output:
79-
diagnostics = "results/precomputed-sequence-diagnostics{origin}.tsv",
80-
flagged = "results/precomputed-flagged-sequences{origin}.tsv",
81-
to_exclude = "results/precomputed-to-exclude{origin}.txt"
79+
diagnostics = "results/precomputed-sequence-diagnostics_{origin}.tsv",
80+
flagged = "results/precomputed-flagged-sequences_{origin}.tsv",
81+
to_exclude = "results/precomputed-to-exclude_{origin}.txt"
8282
conda: config["conda_environment"]
8383
params:
8484
# Only `to-exclude` is defined via the config, so we make some assumptions about the format of the other filenames
85-
to_exclude_address = lambda w: config["inputs"][_trim_origin(w.origin)]["to-exclude"],
86-
flagged_address = lambda w: config["inputs"][_trim_origin(w.origin)]["to-exclude"].replace(f'to-exclude{w.origin}.txt', f'flagged-sequences{w.origin}.tsv'),
87-
diagnostics_address = lambda w: config["inputs"][_trim_origin(w.origin)]["to-exclude"].replace(f'to-exclude{w.origin}.txt', f'sequence-diagnostics{w.origin}.tsv'),
85+
to_exclude_address = lambda w: config["inputs"][w.origin]["to-exclude"],
86+
flagged_address = lambda w: config["inputs"][w.origin]["to-exclude"].replace(f'to-exclude{w.origin}.txt', f'flagged-sequences{w.origin}.tsv'),
87+
diagnostics_address = lambda w: config["inputs"][w.origin]["to-exclude"].replace(f'to-exclude{w.origin}.txt', f'sequence-diagnostics{w.origin}.tsv'),
8888
# assume the compression is the same across all 3 addresses
89-
deflate = lambda w: _infer_decompression(config["inputs"][_trim_origin(w.origin)]["to-exclude"])
89+
deflate = lambda w: _infer_decompression(config["inputs"][w.origin]["to-exclude"])
9090
shell:
9191
"""
9292
aws s3 cp {params.to_exclude_address} - | {params.deflate} > {output.to_exclude:q}
@@ -98,11 +98,11 @@ rule download_diagnostic:
9898
rule download_masked:
9999
message: "Downloading aligned & masked FASTA from {params.address} -> {output.sequences}"
100100
output:
101-
sequences = "results/precomputed-masked{origin}.fasta"
101+
sequences = "results/precomputed-masked_{origin}.fasta"
102102
conda: config["conda_environment"]
103103
params:
104-
address = lambda w: config["inputs"][_trim_origin(w.origin)]["masked"],
105-
deflate = lambda w: _infer_decompression(config["inputs"][_trim_origin(w.origin)]["masked"])
104+
address = lambda w: config["inputs"][w.origin]["masked"],
105+
deflate = lambda w: _infer_decompression(config["inputs"][w.origin]["masked"])
106106
shell:
107107
"""
108108
aws s3 cp {params.address} - | {params.deflate} > {output.sequences:q}
@@ -112,12 +112,12 @@ rule download_masked:
112112
rule download_filtered:
113113
message: "Downloading pre-computed filtered alignment from {params.address} -> {output.sequences}"
114114
output:
115-
sequences = "results/precomputed-filtered{origin}.fasta"
115+
sequences = "results/precomputed-filtered_{origin}.fasta"
116116
conda: config["conda_environment"]
117117
params:
118-
address = lambda w: config["inputs"][_trim_origin(w.origin)]["filtered"],
119-
deflate = lambda w: _infer_decompression(config["inputs"][_trim_origin(w.origin)]["filtered"])
118+
address = lambda w: config["inputs"][w.origin]["filtered"],
119+
deflate = lambda w: _infer_decompression(config["inputs"][w.origin]["filtered"])
120120
shell:
121121
"""
122122
aws s3 cp {params.address} - | {params.deflate} > {output.sequences:q}
123-
"""
123+
"""

workflow/snakemake_rules/export_for_nextstrain.smk

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -94,14 +94,14 @@ rule mutation_summary:
9494
reference = config["files"]["alignment_reference"],
9595
genemap = config["files"]["annotation"]
9696
output:
97-
mutation_summary = "results/mutation_summary{origin}.tsv"
97+
mutation_summary = "results/mutation_summary_{origin}.tsv"
9898
log:
99-
"logs/mutation_summary{origin}.txt"
99+
"logs/mutation_summary_{origin}.txt"
100100
benchmark:
101-
"benchmarks/mutation_summary{origin}.txt"
101+
"benchmarks/mutation_summary_{origin}.txt"
102102
params:
103103
outdir = "results/translations",
104-
basename = "seqs{origin}"
104+
basename = "seqs_{origin}"
105105
conda: config["conda_environment"]
106106
shell:
107107
"""

0 commit comments

Comments
 (0)