Skip to content

filter: Reduce over-sampling in partial months with --group-by month #960

@victorlin

Description

@victorlin

Context

@trvrb from nextstrain/ncov#957:

With this narrow of timespans there is some unavoidable funny interaction with how augur filter subsamples based on --vpm, ie viruses per month. We have common situations where if current date is say May 15 we end up with

  • min date of March 15
  • desire by augur filter to equally sample viruses from March, April and May categories

so that March and May have 2 weeks for sampling of X viruses and April has 4 weeks for sampling of X viruses. This results in more densely sampled, in terms of viruses per day, months of March and May compared to April.

This effect will be more pronounced in scenarios where current date is, say, May 28, and so X viruses are sampled in 3 days in March and 30 days in April.

To fully address this we'd need to extend augur filter to have the option of per-week sampling categories in addition to per-month sampling categories. Or perhaps some continuous specification. However, I don't think this is too big of an issue in terms of the current PR and it's something we can refine once Augur is updated.

Example

cat > metadata.tsv << ~~
strain	date
SEQ1	2022-03-21
SEQ2	2022-03-22
SEQ3	2022-03-23
SEQ4	2022-04-01
SEQ5	2022-04-02
SEQ6	2022-04-03
SEQ7	2022-05-01
SEQ8	2022-05-02
SEQ9	2022-05-03
SEQ10	2022-05-04
~~

augur filter \
--metadata metadata.tsv \
--min-date 2022-03-15 \
--max-date 2022-05-15 \
--group-by year month \
--subsample-max-sequences 8 \
--subsample-seed 0 \
--output-metadata out.tsv
# Sampling at 2 per group.
# 4 strains were dropped during filtering
# 	4 of these were dropped because of subsampling criteria
# 6 strains passed all filters

cat out.tsv | sort -k 2
# SEQ1	2022-03-21
# SEQ2	2022-03-22
# SEQ4	2022-04-01
# SEQ5	2022-04-02
# SEQ7	2022-05-01
# SEQ9	2022-05-03
# strain	date

When requesting --subsample-max-sequences, this will evenly sample from the 3 groups 2022-03, 2022-04, 2022-05. However, note that the --min-date and --max-date make the sampling window to be half of 2022-03, all of 2022-04, and half of 2022-05. An ideal sampling strategy would sample proportional to the sampling window (e.g. a 2/4/2 split).

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestproposalProposals that warrant further discussion

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions