Skip to content

Commit 40ae575

Browse files
committed
Update syntax for multiple inputs and allow downloading
Please see `docs/multiple_inputs.md` for a full explanation of the changes introduced here. Briefly, we modify the config syntax to define multiple inputs via the `config["inputs"]` dict. Entries here may be downloaded from s3 as needed. Furthermore, defining a "filtered" entry for a given input source will result in the filtered file being downloaded thus avoiding having to perform alignment (etc). This commit introduces a lot of changes, some of which are potentially braking: * Inputs via the old syntax can no longer be downloaded from S3 buckets (the nextstrain config has been updated accordingly). You will need to declare the address for such files in the `config["inputs"]` dict (see tutorial). * `config["S3_BUCKET"]` is no longer used. Addresses should be specified in the `inputs` dict (see tutorial). * Nextstrain core builds: the uploaded preprocessed files (`rule upload`) have new filenames to reflect which input they originate from. The following config values are now used for this step: `S3_DST_BUCKET`, `S3_DST_COMPRESSION` and `S3_DST_ORIGINS`.
1 parent 9f39dc7 commit 40ae575

15 files changed

Lines changed: 501 additions & 350 deletions

File tree

.github/workflows/trial-build.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,6 @@ jobs:
4040
deploy_to_staging \
4141
upload \
4242
--config \
43-
S3_SRC_BUCKET=nextstrain-ncov-private \
4443
S3_DST_BUCKET=nextstrain-ncov-private/trial/"$TRIAL_NAME" \
4544
s3_staging_url=s3://nextstrain-staging/trial_"$TRIAL_NAME"_ \
4645
--profile nextstrain_profiles/nextstrain \

Snakefile

Lines changed: 4 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -31,19 +31,9 @@ import time
3131
user_subsampling = copy.deepcopy(config.get("subsampling", {}))
3232

3333
configfile: "defaults/parameters.yaml"
34-
if "sequences" not in config: config["sequences"]=config["default_sequences"]
35-
if "metadata" not in config: config["metadata"]=config["default_metadata"]
3634

3735
# Check config file for errors
3836
validate(config, schema="workflow/schemas/config.schema.yaml")
39-
try:
40-
assert type(config["sequences"])==type(config["metadata"])
41-
if isinstance(config["sequences"], dict):
42-
# todo - this assertion could be relaxed as we combine multiple metadata files straight away...
43-
assert sorted(list(config["sequences"].keys())) == sorted(list(config["metadata"].keys()))
44-
except AssertionError:
45-
print("The specified sequences & metadata file(s) didn't match! They must both be strings or dictionaries (each with the same keys)")
46-
sys.exit(2)
4737

4838
# Check for overlapping subsampling schemes in user and default
4939
# configurations. For now, issue a deprecation warning, so users know they
@@ -124,6 +114,10 @@ include: "workflow/snakemake_rules/common.smk"
124114
# to output of auspice JSONs for a default build.
125115
include: "workflow/snakemake_rules/main_workflow.smk"
126116

117+
# Include rules to allow downloading of input-specific files from s3 buckets.
118+
# These have to be opted-into via config params.
119+
include: "workflow/snakemake_rules/download.smk"
120+
127121
# Include a custom Snakefile that specifies `localrules` required by the user's
128122
# workflow environment.
129123
if "localrules" in config:

defaults/parameters.yaml

Lines changed: 2 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -9,30 +9,11 @@ conda_environment: "workflow/envs/nextstrain.yaml"
99

1010
# These are the two main starting files for the run.
1111
# If they do not exist, we will attempt to fetch them from a S3 bucket (see below)
12-
# NOTE: if you are defining these in your own configs, you should remove the `default_` prefix.
13-
default_sequences: "data/sequences.fasta"
14-
default_metadata: "data/metadata.tsv"
12+
sequences: "data/sequences.fasta"
13+
metadata: "data/metadata.tsv"
1514

1615
reference_node_name: "USA/WA1/2020"
1716

18-
# Certain steps may require a S3 bucket (e.g. "download pre-aligned sequences").
19-
# This is optional - as long as you provide the starting metadata + sequences,
20-
# all steps can run locally without using any S3 buckets.
21-
S3_BUCKET: "your_S3_bucket_here"
22-
23-
# Set these instead to download from a source (SRC) bucket and upload to a
24-
# different destination (DST) bucket. If either is empty/undefined, then the
25-
# value of S3_BUCKET is used instead.
26-
S3_SRC_BUCKET: ""
27-
S3_DST_BUCKET: ""
28-
29-
# Preprocess steps involve the creation and storage of intermediate files
30-
# (alignments etc). These steps are opt-in, and are not run by default.
31-
# These settings define the compression/decompression settings for these files
32-
preprocess:
33-
compression: "xz"
34-
deflate: "xz -dcq"
35-
3617
# Define files used for external configuration. Common examples consist of a
3718
# list of strains to include and exclude from analyses, a reference sequence to
3819
# align sequences to, colors for strain attributes in auspice, and auspice

docs/index.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,10 @@ The starting point for this section is a JSON file. You can alternately use our
3939
10. [Writing a narrative to highlight key findings](narratives.md)
4040
11. _Case studies: interpreting your data (coming soon!)_
4141

42+
### Multiple inputs
43+
44+
12. [Running the pipeline starting with multiple inputs](multiple_inputs.md)
45+
4246
## Help
4347

4448
If something in this tutorial is broken or unclear, please [open an issue](https://github.com/nextstrain/ncov/issues/new/choose) so we can improve it for everyone.

docs/multiple_inputs.md

Lines changed: 133 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -24,19 +24,20 @@ Our aim is to produce an analysis of the 91 Australian genomes with select world
2424

2525

2626

27-
## Files
27+
## Overview of the files used in this tutorial
2828

2929
The **sequences and metadata** for this tutorial are in `data/example_multiple_inputs.tar.xz` and must be decompressed via `tar xf data/example_multiple_inputs.tar.xz --directory data/`.
3030

3131
You should now see the following starting files:
3232
```sh
33-
data/example_metadata_aus.tsv # Aus data (n=91)
33+
data/example_metadata_aus.tsv # Aus data (n=91) from Seemann et al.
3434
data/example_sequences_aus.fasta
3535
data/example_metadata_worldwide.tsv # Worldwide, contextual data (n=327)
3636
data/example_sequences_worldwide.fasta
3737
```
3838

39-
The files are small enough to be examined in a text editor -- the format of the worldwide metadata is similar to the `nextmeta.tsv` file which you may download from GISAID, whereas the format of the Australian metadata is more limited, only containing sampling date and geographic details. Note: see `data/example_metadata.tsv` for the full metadata of these Australian samples, we've intentionally restricted this here to mimic a real-world scenario.
39+
The files are small enough to be examined in a text editor -- the format of the worldwide metadata is similar to the `nextmeta.tsv` file which you may download from GISAID, whereas the format of the Australian metadata is more limited, only containing sampling date and geographic details, which may be more realistic for a newly generated sequencing run.
40+
Note: see `data/example_metadata.tsv` for the full metadata of these Australian samples, we've intentionally restricted this here to mimic a real-world scenario.
4041

4142

4243
The **build-specific configs** etc are in `my_profiles/example_multiple_inputs`
@@ -47,44 +48,48 @@ my_profiles/example_multiple_inputs/builds.yaml # this is where the input files
4748
my_profiles/example_multiple_inputs/my_auspice_config.json
4849
```
4950

50-
## Snakemake terminology
51-
52-
Inside the Snakemake rules, we use a wildcard `origin` to define different starting points.
53-
For instance, if we ask for the file `results/aligned_seqRun42.fasta` then `wildcards.origin="_seqRun42"` and we expect that the config has defined
54-
a sequences input via `config["sequences"]["seqRun42"]=<path to fasta>` (note the leading `_` has been stripped).
55-
If there's only one starting point, then this wildcard is empty.
56-
For instance, asking for `results/aligned.fasta` results in `wildcards.origin=""` and we expect `config["sequences"]=<path to fasta>`.
5751

52+
## Setting up the config
5853

59-
# Setting up the config
60-
61-
Typically, inside the `builds.yaml`, you typically specify input files such as
54+
Typically, inside the `builds.yaml` one would specify input files such as
6255

6356
```yaml
57+
# traditional syntax for specifying starting files
6458
sequences: "data/sequences.fasta"
6559
metadata: "data/metadata.tsv"
6660
```
6761
68-
For multiple inputs, we shall specify a dictionary for each of these, such as:
62+
For multiple inputs, we shall use the new `inputs` section of the config to specify that we have two different inputs, and we will give them the names "aus" and "worldwide":
6963

7064
```yaml
7165
# my_profiles/example_multiple_inputs/builds.yaml
72-
sequences:
73-
aus: "data/example_sequences_aus.fasta"
74-
worldwide: "data/example_sequences_worldwide.fasta"
75-
metadata:
76-
aus: "data/example_metadata_aus.tsv"
77-
worldwide: "data/example_metadata_worldwide.tsv"
66+
inputs:
67+
aus:
68+
metadata: "data/example_metadata_aus.tsv"
69+
sequences: "data/example_sequences_aus.fasta"
70+
worldwide:
71+
metadata: "data/example_metadata_worldwide.tsv"
72+
sequences: "data/example_sequences_worldwide.fasta"
7873
```
7974

80-
# Creating combined metadata
75+
> Note that if you also specify `sequences` or `metadata` as top level entries in the config, they will be ignored.
76+
77+
### Snakemake terminology
78+
79+
Inside the Snakemake rules, we use a wildcard `origin` to define different starting points.
80+
For instance, if we ask for the file `results/aligned_worldwide.fasta` then `wildcards.origin="_worldwide"` and we expect that the config has defined
81+
a sequences input via `config["sequences"]["worldwide"]=<path to fasta>` (note the leading `_` has been stripped from the `origin` in the config).
82+
If we use the older syntax (specifying `sequences` or `metadata` as top level entries in the config) then `wildcards.origin=""`.
8183

82-
The different provided metadata files (`aus` and `worldwide`, defined above) are combined during the pipeline.
83-
The combined metadata file includes all columns present: the `worldwide` metadata contains many more columns than the `aus` metadata does, so the latter samples will have a number of empty values.
8484

85-
In the case of conflicts, the order of the entries in the YAML matters, with the last value being used.
85+
## How is metadata combined?
8686

87-
Finally, extra columns will be added for each input (e.g. `aus` and `worldwide`), with values `"yes"` or `"no"`, representing which samples are contained in each set of sequences.
87+
The different provided metadata files (for `aus` and `worldwide`, defined above) are combined during the pipeline, and the combined metadata file includes all columns present across the different metadata files.
88+
Looking at the individual TSVs, the `worldwide` metadata contains many more columns than the `aus` metadata does, so we can expect the the `aus` samples to have many empty values in the combined metadata.
89+
In the case of **conflicts**, the order of the entries in the YAML matters, with the last value being used.
90+
91+
Finally, we use one-hot encoding to express the origin of each row of metadata.
92+
This means that **extra columns** will be added for each input (e.g. `aus` and `worldwide`), with values of `"yes"` or `""`, representing which samples are contained in each set of sequences.
8893
We are going to use this to our advantage, by adding a coloring to highlight the source of sequences in auspice via `my_profiles/example_multiple_inputs/my_auspice_config.json`:
8994

9095
```json
@@ -100,58 +105,65 @@ We are going to use this to our advantage, by adding a coloring to highlight the
100105
}
101106
```
102107

103-
# (Pre-) Filtering input-specific parameters
104108

105-
The parameters used for filtering steps are typically defined by the "filter" dict in the `builds.yaml`, with sensible defaults provided (by `defaults/parameters.yaml`).
106-
For multiple inputs, we can overwrite these for each input.
109+
# Input-specific filtering parameters
107110

108-
In this tutorial, we wish to make sure we include all the Australian samples, even if they may be partial genomes etc
111+
The first stage of the pipeline performs filtering, masking and alignment (note that this is different to subsampling).
112+
If we have multiple inputs, this stage of the pipeline is done independently for each input.
113+
The parameters used for filtering steps are typically defined by the "filter" dict in the `builds.yaml`, with sensible defaults provided (see `defaults/parameters.yaml`).
114+
For multiple inputs, we can overwrite these for each input.
109115

110-
per-input level, such as the example tutorial does for `input1` (the North American genomes).
116+
As an example, in this tutorial let's ensure we include all the `aus` samples, even if they may be partial genomes etc
111117

112118
```yaml
113119
# my_profiles/example_multiple_inputs/builds.yaml
114120
filter:
115121
aus:
116-
min_length: 5000 # Allow shorter genomes. Parameter used in the prefilter & filter rules
117-
exclude_where: country=Canada # Would remove all Canadian sequences (there aren't any!)
118-
min_date: "2020-02-01" # used by the filter rule. Will remove all sequences from Jan 2020
119-
exclude_ambiguous_dates_by: year # used by the filter rule.
122+
min_length: 5000 # Allow shorter (partial) genomes
120123
skip_diagnostics: True # skip diagnostics (which can remove genomes) for this input
121124
```
122125

123126
# Subsampling parameters
124127

125-
For subsampling, we utilise the fact that the combined metadata has additional columns to represent the input source: `aus` and `worldwide`.
126-
This allows us to have per-input subsampling steps by restricting the step to sequences from an individual input (or inputs).
128+
The second stage of the pipeline subsamples the (often large) dataset.
129+
By this stage, the multiple inputs will have been combined into a unified alignment and metadata file (see above), however we may utilise the fact that the combined metadata has additional columns to represent which samples came from which input source (the columns `aus` and `worldwide`).
130+
This allows us to have per-input subsampling steps.
127131

128132

129-
In this example, we want to include _all_ of the samples from `aus` (i.e. all Australian genomes) and then create a contextual subsampling of the genomes from `worldwide` based on genetic distance from the `aus` sample.
133+
In this example, we want to produce a dataset which contains:
134+
1. _All_ of the samples from the `aus` input (i.e. all of the Australian genomes)
135+
2. A worldwide sampling which prioritises sequences close to (1)
136+
3. A random, background worldwide sampling
130137

131138
```yaml
139+
# my_profiles/example_multiple_inputs/builds.yaml
132140
builds:
133141
multiple-inputs:
134142
subsampling_scheme: custom-scheme # use a custom subsampling scheme defined below
135143
144+
# STAGE 2: Subsampling parameters
136145
subsampling:
137146
custom-scheme:
138147
# Use metadata key to include ALL from `input1`
139148
allFromAus:
140-
exclude: "--exclude-where 'aus=no'" # subset to sequences from input `aus`
141-
group_by: year # needed for pipeline to work!
142-
seq_per_group: 1000000 # needed for pipeline to work!
143-
# Proximity subsampling from `input2` to provide context
149+
exclude: "--exclude-where 'aus!=yes'" # subset to sequences from input `aus`
150+
group_by: year # needed for pipeline to work!
151+
seq_per_group: 1000000 # needed for pipeline to work!
152+
# Proximity subsampling from `worldwide` input to provide context
144153
worldwideContext:
145154
exclude: "--exclude-where 'aus=yes'" # i.e. subset to sequences _not_ from input `aus`
146-
group_by: year month
147-
seq_per_group: 10
155+
max_sequences: 100
156+
group_by: year # needed for pipeline to work!
148157
priorities:
149158
type: "proximity"
150159
focus: "allFromAus"
151-
160+
worldwideBackground:
161+
exclude: "--exclude-where 'aus=yes'"
162+
group_by: year month
163+
seq_per_group: 5
152164
```
153165
154-
# Run the build
166+
## Run the build
155167
156168
The following commands will run this tutorial
157169
@@ -163,4 +175,79 @@ snakemake --profile my_profiles/example_multiple_inputs -f auspice/ncov_multiple
163175
The resulting JSON can be dropped onto [auspice.us](https://auspice.us) for visualization.
164176

165177
> P.S. If you want to see the DAG to help understand the pipeline steps, you can run
166-
`snakemake --profile my_profiles/example_multiple_inputs -f auspice/ncov_multiple-inputs.json --dag | dot -Tpdf > dag.pdf`
178+
`snakemake --profile my_profiles/example_multiple_inputs -f auspice/ncov_multiple-inputs.json --dag | dot -Tpdf > dag.pdf`
179+
180+
181+
## Extra examples
182+
183+
184+
### What if I need to preprocess input files beforehand?
185+
186+
A common use case may be that some of your input sequences and/or metadata may require preprocessing before the pipeline even starts, which will be use-case specific.
187+
To provide an example of this, let's imagine the situation where we haven't uncompressed the starting files, and our "custom preprocessing" step will be to decompress them.
188+
In other words, our preprocessing step will replace the need to run `tar xf data/example_multiple_inputs.tar.xz --directory data/`.
189+
190+
We can achieve this by creating a snakemake rule which produces all of (or some of) the config-specified input files:
191+
192+
```python
193+
# my_profiles/example_multiple_inputs/rules.smk
194+
rule make_starting_files:
195+
message:
196+
"""
197+
Creating starting files for the multiple inputs tutorial by decompressing {input.archive}
198+
"""
199+
input:
200+
archive = "data/example_multiple_inputs.tar.xz"
201+
output:
202+
# Note: the command doesn't use these, but adding them here makes snakemake
203+
# aware that this rule produces them
204+
aus_meta = "data/example_metadata_aus.tsv",
205+
aus_seqs = "data/example_sequences_aus.fasta",
206+
world_meta = "data/example_metadata_worldwide.tsv",
207+
world_seqs = "data/example_sequences_worldwide.fasta"
208+
shell:
209+
"""
210+
tar xf {input.archive} --directory data/
211+
"""
212+
```
213+
214+
And then making our build aware of these custom rules:
215+
```yaml
216+
# my_profiles/example_multiple_inputs/builds.yaml
217+
custom_rules:
218+
- my_profiles/example_multiple_inputs/rules.smk
219+
```
220+
221+
222+
### What about if my starting files are stored remotely?
223+
224+
Currently we can handle files stored on S3 buckets rather than remotely by simply declaring this as the input location:
225+
226+
```yaml
227+
# your pipeline's builds.yaml config
228+
inputs:
229+
worldwide:
230+
metadata: "s3://your_bucket_name/metadata.tsv"
231+
sequences: "s3://your_bucket_name/sequences.fasta.xz"
232+
```
233+
234+
> If your S3 bucket is private, make sure you have the following env variables set: `$AWS_SECRET_ACCESS_KEY` and `$AWS_ACCESS_KEY_ID`.
235+
236+
> You may use `.xz` or `.gz` compression - we automatically infer this from the filename suffix.
237+
238+
### Can I start from intermediate files stored remotely?
239+
240+
Yes, however this functionality is new and the syntax may change -- please beware!
241+
242+
If you define the `filtered` keyword as an input, then the pipeline will download this file and avoid aligning and filtering this input, which can save a lot of compute time:
243+
```yaml
244+
# your builds.yaml config file
245+
inputs:
246+
worldwide:
247+
metadata: <local path or s3 address>,
248+
filtered: "s3://your_bucket_name/path_to_filtered_sequences.fasta.xz"
249+
```
250+
The same functionality is available for `masked`, `aligned` and `prefiltered` stages, however beware that these may change in the future.
251+
252+
> Note that if intermediate files are present locally, then snakemake will automatically avoid recreating them.
253+
For instance, if you have an input `worldwide` defined in your config (as above) and the file `results/aligned_worldwide.fasta` exists, then Snakemake will know not to recreate this!

0 commit comments

Comments
 (0)