Context
Undersampling occurs in augur filter when the number of available sequences is lower than the targeted group size. This is not reported in any output. It is explained in this recently added docs section:
consider a dataset with 200 sequences available from 2023 and 100 sequences available from 2024. --group-by year --subsample-max-sequences 300 is equivalent to --group-by year --sequences-per-group 150. This will take 150 sequences from 2023 and all 100 sequences from 2024 for a total of 250 sequences, which is less than the target of 300.
Some historical context from #1454 (comment):
In the original formulation of only --sequences-per-group the idea was to say specify --sequences-per-group 10 and --group-by country would target 10 sequences per country and randomly sample these sequences for each country group. In the original formulation, we wouldn't top-up other countries. I think this is a semantic complication with adding the convenience parameter of --subsample-max-sequences. I'd think of --subsample-max-sequences as solely specifying --sequences-per-group.
Possible solutions
Roughly sorted from least to most work involved.
-
Add warnings. Example:
WARNING: Targeted 150 sequences for group [year='2024'] but only 100 are available.
-
Add an option --output-group-by-sizes to highlight any discrepancies. Example:
| year |
target size |
available sequences |
output size |
| 2023 |
150 |
200 |
150 |
| 2024 |
150 |
100 |
100 |
Both (1) and (2) have been adopted for --group-by-weights in #1454, but they could be extended to other sampling methods.
- Create an "
augur filter GUI" that has a sidebar with controls to adjust augur filter parameters and graphs on the main view that shows spread of output data.
Context
Undersampling occurs in
augur filterwhen the number of available sequences is lower than the targeted group size. This is not reported in any output. It is explained in this recently added docs section:Some historical context from #1454 (comment):
Possible solutions
Roughly sorted from least to most work involved.
Add warnings. Example:
Add an option
--output-group-by-sizesto highlight any discrepancies. Example:Both (1) and (2) have been adopted for
--group-by-weightsin #1454, but they could be extended to other sampling methods.augur filterGUI" that has a sidebar with controls to adjustaugur filterparameters and graphs on the main view that shows spread of output data.