Allow weighted subsampling

## Context

Currently, `--subsample-max-sequences` effectively calculates a value for `--sequences-per-group` which applies to all groups specified by `--group-by`.

This behavior does not work for all scenarios. Example: There are 5 countries with vastly different population sizes. 60 sequences are requested for subsampling. The command would be:

```
augur filter \
  --group-by country \
  --subsample-max-sequences 60
```

This means 12 sequences will be sampled from each country, regardless of population size. One may want the sample to be representative of country population sizes and not uniform. This limitation is what prompts higher-level subsampling workarounds such as https://github.com/nextstrain/ncov/pull/1074.

## Tasks

- [x] #1454
- [x] Release in a new version of Augur: [25.3.0](https://github.com/nextstrain/augur/actions/runs/10512823207)
- [x] https://github.com/nextstrain/docs.nextstrain.org/pull/223

## Rollout

- [x] Use it in ncov workflow: https://github.com/nextstrain/ncov/issues/1141
- [ ] Use it in other workflows?

## Original proposed solution

Implement an option `--subsample-weights`, which reads a file that specifies weights per `--group-by` column. A simple example:

```
augur filter \
  --group-by country \
  --subsample-max-sequences 60 \
  --subsample-weights weights.yaml
```

`weights.yaml`:

```yaml
# Weight countries by population size.
country:
    A: 1000
    B: 1000
    C: 300
    D: 100
    E: 600
```

With this information, a different amount of sequences can be calculated per group.

- `A` would have 60*1000/3000 = 20 sequences.
- `C` would have 60*300/3000 = 6 sequences.

The absence of a column from the weights file can imply equal weighting. In other words, the example can be updated to use `--group-by country month` while keeping `weights.yaml` as-is to have weighted `country` sampling for each time bin.

Or, a more complex example where time is also weighted:

```
augur filter \
  --group-by country month \
  --subsample-max-sequences 60 \
  --subsample-weights weights.yaml
```

`weights.yaml`:

```yaml
# Weight countries by population size.
country:
    A: 1000
    B: 1000
    C: 300
    D: 100
    E: 600

# Get twice the amount of sequences from 2021 compared to 2020.
month:
    2020-01: 1
    2020-02: 1
    2020-03: 1
    # … all months in 2020 are weighted with 1
    2020-01: 2
    2020-02: 2
    2020-03: 2
    # … all months in 2021 are weighted with 2
```

Notes:

1. The file format is up for debate. At the least, it can be JSON or YAML, but not anything tabular (not enough dimensions to cover multiple group by columns).
2. This would necessitate all possible values of a column to have a weight so that the denominator can be calculated by summing up all the values.
3. Weights should be relative within each column.
4. (I think) as long as the weights are non-negative, the values can be multiplied across columns to get effective weighting for all combinations.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow weighted subsampling #1318

Context

Tasks

Rollout

Original proposed solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Allow weighted subsampling #1318

Description

Context

Tasks

Rollout

Original proposed solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions