filter: Make under-sampling more apparent

### Context

Undersampling occurs in `augur filter` when the number of available sequences is lower than the targeted group size. This is not reported in any output. It is explained in this recently added [docs section](https://docs.nextstrain.org/en/latest/guides/bioinformatics/filtering-and-subsampling.html#caveats):

> consider a dataset with 200 sequences available from 2023 and 100 sequences available from 2024. `--group-by year --subsample-max-sequences 300` is equivalent to `--group-by year --sequences-per-group 150`. This will take 150 sequences from 2023 and all 100 sequences from 2024 for a total of 250 sequences, which is less than the target of 300.

Some historical context from https://github.com/nextstrain/augur/pull/1454#discussion_r1723786170:

> In the original formulation of only `--sequences-per-group` the idea was to say specify `--sequences-per-group 10` and `--group-by country` would target 10 sequences per country and randomly sample these sequences for each country group. In the original formulation, we wouldn't top-up other countries. I think this is a semantic complication with adding the convenience parameter of `--subsample-max-sequences`. I'd think of `--subsample-max-sequences` as solely specifying `--sequences-per-group`.



### Possible solutions

Roughly sorted from least to most work involved.

1. Add warnings. Example:

    ```
    WARNING: Targeted 150 sequences for group [year='2024'] but only 100 are available.
    ```

2. Add an option `--output-group-by-sizes` to highlight any discrepancies. Example:

    | year | target size | available sequences | output size |
    |------|-------------|---------------------|-------------|
    | 2023 | 150         | 200                 | 150         |
    | 2024 | 150         | 100                 | 100         |

Both (1) and (2) have been adopted for `--group-by-weights` in #1454, but they could be extended to other sampling methods.

3. Create an "`augur filter` GUI" that has a sidebar with controls to adjust `augur filter` parameters and graphs on the main view that shows spread of output data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

filter: Make under-sampling more apparent #1590

Context

Possible solutions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

filter: Make under-sampling more apparent #1590

Description

Context

Possible solutions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions