Skip to content

filter: Do not subsample probabilistically when users specify --sequences-per-group #736

@huddlej

Description

@huddlej

Current Behavior

Although the documentation for augur filter's --probabilistic-sampling option says that we only sample probabilistically "when --subsample-max-sequences is provided", we actually always probabilistically filter unless the user has disabled it with the additional --no-probabilistic-sampling flag.

Expected behavior

When the user provides a specific value for --sequences-per-group, augur filter should not use that value as a parameter for Poisson sampling. Instead, augur filter should produce at most the requested number of sequences per group and no more.

How to reproduce

cd tests/functional
augur filter \
    --metadata filter/metadata.tsv \
    --group-by region \
    --sequences-per-group 1 \
    --output-strains test.txt

The output is (with some variation due to random sampling):

10 strains were dropped during filtering
	10 of these were dropped because of subsampling criteria
1 strains passed all filters

The output should be the following (since there are three distinct regions in the Zika metadata for Augur tests and 11 total strains):

8 strains were dropped during filtering
	8 of these were dropped because of subsampling criteria
3 strains passed all filters

Possible solution

Create a top-level boolean variable for probabilistic_sampling in filter's run function initialized to the value from the argument parser. Then set this variable to False if the user has provided a specific --sequences-per-group. Use this variable downstream instead of referring directly to args.probabilistic_sampling.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions