Current Behavior
Although the documentation for augur filter's --probabilistic-sampling option says that we only sample probabilistically "when --subsample-max-sequences is provided", we actually always probabilistically filter unless the user has disabled it with the additional --no-probabilistic-sampling flag.
Expected behavior
When the user provides a specific value for --sequences-per-group, augur filter should not use that value as a parameter for Poisson sampling. Instead, augur filter should produce at most the requested number of sequences per group and no more.
How to reproduce
cd tests/functional
augur filter \
--metadata filter/metadata.tsv \
--group-by region \
--sequences-per-group 1 \
--output-strains test.txt
The output is (with some variation due to random sampling):
10 strains were dropped during filtering
10 of these were dropped because of subsampling criteria
1 strains passed all filters
The output should be the following (since there are three distinct regions in the Zika metadata for Augur tests and 11 total strains):
8 strains were dropped during filtering
8 of these were dropped because of subsampling criteria
3 strains passed all filters
Possible solution
Create a top-level boolean variable for probabilistic_sampling in filter's run function initialized to the value from the argument parser. Then set this variable to False if the user has provided a specific --sequences-per-group. Use this variable downstream instead of referring directly to args.probabilistic_sampling.
Current Behavior
Although the documentation for augur filter's
--probabilistic-samplingoption says that we only sample probabilistically "when--subsample-max-sequencesis provided", we actually always probabilistically filter unless the user has disabled it with the additional--no-probabilistic-samplingflag.Expected behavior
When the user provides a specific value for
--sequences-per-group, augur filter should not use that value as a parameter for Poisson sampling. Instead, augur filter should produce at most the requested number of sequences per group and no more.How to reproduce
The output is (with some variation due to random sampling):
The output should be the following (since there are three distinct regions in the Zika metadata for Augur tests and 11 total strains):
Possible solution
Create a top-level boolean variable for
probabilistic_samplingin filter'srunfunction initialized to the value from the argument parser. Then set this variable toFalseif the user has provided a specific--sequences-per-group. Use this variable downstream instead of referring directly toargs.probabilistic_sampling.