Skip to content

filter: Make under-sampling more apparent #1590

@victorlin

Description

@victorlin

Context

Undersampling occurs in augur filter when the number of available sequences is lower than the targeted group size. This is not reported in any output. It is explained in this recently added docs section:

consider a dataset with 200 sequences available from 2023 and 100 sequences available from 2024. --group-by year --subsample-max-sequences 300 is equivalent to --group-by year --sequences-per-group 150. This will take 150 sequences from 2023 and all 100 sequences from 2024 for a total of 250 sequences, which is less than the target of 300.

Some historical context from #1454 (comment):

In the original formulation of only --sequences-per-group the idea was to say specify --sequences-per-group 10 and --group-by country would target 10 sequences per country and randomly sample these sequences for each country group. In the original formulation, we wouldn't top-up other countries. I think this is a semantic complication with adding the convenience parameter of --subsample-max-sequences. I'd think of --subsample-max-sequences as solely specifying --sequences-per-group.

Possible solutions

Roughly sorted from least to most work involved.

  1. Add warnings. Example:

    WARNING: Targeted 150 sequences for group [year='2024'] but only 100 are available.
    
  2. Add an option --output-group-by-sizes to highlight any discrepancies. Example:

    year target size available sequences output size
    2023 150 200 150
    2024 150 100 100

Both (1) and (2) have been adopted for --group-by-weights in #1454, but they could be extended to other sampling methods.

  1. Create an "augur filter GUI" that has a sidebar with controls to adjust augur filter parameters and graphs on the main view that shows spread of output data.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions