Context
Currently, --subsample-max-sequences effectively calculates a value for --sequences-per-group which applies to all groups specified by --group-by.
This behavior does not work for all scenarios. Example: There are 5 countries with vastly different population sizes. 60 sequences are requested for subsampling. The command would be:
augur filter \
--group-by country \
--subsample-max-sequences 60
This means 12 sequences will be sampled from each country, regardless of population size. One may want the sample to be representative of country population sizes and not uniform. This limitation is what prompts higher-level subsampling workarounds such as nextstrain/ncov#1074.
Tasks
Rollout
Original proposed solution
Implement an option --subsample-weights, which reads a file that specifies weights per --group-by column. A simple example:
augur filter \
--group-by country \
--subsample-max-sequences 60 \
--subsample-weights weights.yaml
weights.yaml:
# Weight countries by population size.
country:
A: 1000
B: 1000
C: 300
D: 100
E: 600
With this information, a different amount of sequences can be calculated per group.
A would have 60*1000/3000 = 20 sequences.
C would have 60*300/3000 = 6 sequences.
The absence of a column from the weights file can imply equal weighting. In other words, the example can be updated to use --group-by country month while keeping weights.yaml as-is to have weighted country sampling for each time bin.
Or, a more complex example where time is also weighted:
augur filter \
--group-by country month \
--subsample-max-sequences 60 \
--subsample-weights weights.yaml
weights.yaml:
# Weight countries by population size.
country:
A: 1000
B: 1000
C: 300
D: 100
E: 600
# Get twice the amount of sequences from 2021 compared to 2020.
month:
2020-01: 1
2020-02: 1
2020-03: 1
# … all months in 2020 are weighted with 1
2020-01: 2
2020-02: 2
2020-03: 2
# … all months in 2021 are weighted with 2
Notes:
- The file format is up for debate. At the least, it can be JSON or YAML, but not anything tabular (not enough dimensions to cover multiple group by columns).
- This would necessitate all possible values of a column to have a weight so that the denominator can be calculated by summing up all the values.
- Weights should be relative within each column.
- (I think) as long as the weights are non-negative, the values can be multiplied across columns to get effective weighting for all combinations.
Context
Currently,
--subsample-max-sequenceseffectively calculates a value for--sequences-per-groupwhich applies to all groups specified by--group-by.This behavior does not work for all scenarios. Example: There are 5 countries with vastly different population sizes. 60 sequences are requested for subsampling. The command would be:
This means 12 sequences will be sampled from each country, regardless of population size. One may want the sample to be representative of country population sizes and not uniform. This limitation is what prompts higher-level subsampling workarounds such as nextstrain/ncov#1074.
Tasks
Rollout
Original proposed solution
Implement an option
--subsample-weights, which reads a file that specifies weights per--group-bycolumn. A simple example:weights.yaml:With this information, a different amount of sequences can be calculated per group.
Awould have 60*1000/3000 = 20 sequences.Cwould have 60*300/3000 = 6 sequences.The absence of a column from the weights file can imply equal weighting. In other words, the example can be updated to use
--group-by country monthwhile keepingweights.yamlas-is to have weightedcountrysampling for each time bin.Or, a more complex example where time is also weighted:
weights.yaml:Notes: