Skip to content

Allow weighted subsampling #1318

@victorlin

Description

@victorlin

Context

Currently, --subsample-max-sequences effectively calculates a value for --sequences-per-group which applies to all groups specified by --group-by.

This behavior does not work for all scenarios. Example: There are 5 countries with vastly different population sizes. 60 sequences are requested for subsampling. The command would be:

augur filter \
  --group-by country \
  --subsample-max-sequences 60

This means 12 sequences will be sampled from each country, regardless of population size. One may want the sample to be representative of country population sizes and not uniform. This limitation is what prompts higher-level subsampling workarounds such as nextstrain/ncov#1074.

Tasks

Rollout

Original proposed solution

Implement an option --subsample-weights, which reads a file that specifies weights per --group-by column. A simple example:

augur filter \
  --group-by country \
  --subsample-max-sequences 60 \
  --subsample-weights weights.yaml

weights.yaml:

# Weight countries by population size.
country:
    A: 1000
    B: 1000
    C: 300
    D: 100
    E: 600

With this information, a different amount of sequences can be calculated per group.

  • A would have 60*1000/3000 = 20 sequences.
  • C would have 60*300/3000 = 6 sequences.

The absence of a column from the weights file can imply equal weighting. In other words, the example can be updated to use --group-by country month while keeping weights.yaml as-is to have weighted country sampling for each time bin.

Or, a more complex example where time is also weighted:

augur filter \
  --group-by country month \
  --subsample-max-sequences 60 \
  --subsample-weights weights.yaml

weights.yaml:

# Weight countries by population size.
country:
    A: 1000
    B: 1000
    C: 300
    D: 100
    E: 600

# Get twice the amount of sequences from 2021 compared to 2020.
month:
    2020-01: 1
    2020-02: 1
    2020-03: 1
    # … all months in 2020 are weighted with 1
    2020-01: 2
    2020-02: 2
    2020-03: 2
    # … all months in 2021 are weighted with 2

Notes:

  1. The file format is up for debate. At the least, it can be JSON or YAML, but not anything tabular (not enough dimensions to cover multiple group by columns).
  2. This would necessitate all possible values of a column to have a weight so that the denominator can be calculated by summing up all the values.
  3. Weights should be relative within each column.
  4. (I think) as long as the weights are non-negative, the values can be multiplied across columns to get effective weighting for all combinations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions