Context
@trvrb from nextstrain/ncov#957:
With this narrow of timespans there is some unavoidable funny interaction with how augur filter subsamples based on --vpm, ie viruses per month. We have common situations where if current date is say May 15 we end up with
- min date of March 15
- desire by
augur filter to equally sample viruses from March, April and May categories
so that March and May have 2 weeks for sampling of X viruses and April has 4 weeks for sampling of X viruses. This results in more densely sampled, in terms of viruses per day, months of March and May compared to April.
This effect will be more pronounced in scenarios where current date is, say, May 28, and so X viruses are sampled in 3 days in March and 30 days in April.
To fully address this we'd need to extend augur filter to have the option of per-week sampling categories in addition to per-month sampling categories. Or perhaps some continuous specification. However, I don't think this is too big of an issue in terms of the current PR and it's something we can refine once Augur is updated.
Example
cat > metadata.tsv << ~~
strain date
SEQ1 2022-03-21
SEQ2 2022-03-22
SEQ3 2022-03-23
SEQ4 2022-04-01
SEQ5 2022-04-02
SEQ6 2022-04-03
SEQ7 2022-05-01
SEQ8 2022-05-02
SEQ9 2022-05-03
SEQ10 2022-05-04
~~
augur filter \
--metadata metadata.tsv \
--min-date 2022-03-15 \
--max-date 2022-05-15 \
--group-by year month \
--subsample-max-sequences 8 \
--subsample-seed 0 \
--output-metadata out.tsv
# Sampling at 2 per group.
# 4 strains were dropped during filtering
# 4 of these were dropped because of subsampling criteria
# 6 strains passed all filters
cat out.tsv | sort -k 2
# SEQ1 2022-03-21
# SEQ2 2022-03-22
# SEQ4 2022-04-01
# SEQ5 2022-04-02
# SEQ7 2022-05-01
# SEQ9 2022-05-03
# strain date
When requesting --subsample-max-sequences, this will evenly sample from the 3 groups 2022-03, 2022-04, 2022-05. However, note that the --min-date and --max-date make the sampling window to be half of 2022-03, all of 2022-04, and half of 2022-05. An ideal sampling strategy would sample proportional to the sampling window (e.g. a 2/4/2 split).
Context
@trvrb from nextstrain/ncov#957:
Example
When requesting
--subsample-max-sequences, this will evenly sample from the 3 groups2022-03,2022-04,2022-05. However, note that the--min-dateand--max-datemake the sampling window to be half of2022-03, all of2022-04, and half of2022-05. An ideal sampling strategy would sample proportional to the sampling window (e.g. a 2/4/2 split).