Implement weighted sampling by victorlin · Pull Request #1454 · nextstrain/augur

victorlin · 2024-04-25T23:39:37Z

Description of proposed changes

This new option takes a weights file to determine group sizes. Summarized by the help text:

Lines 73 to 97 in 9a2090b

    
               subsample_group.add_argument('--group-by-weights', type=str, metavar="FILE", help=""" 
        
                   TSV file defining weights for grouping. Requirements: 
        
                   (1) Lines starting with '#' are treated as comment lines. 
        
                   (2) The first non-comment line must be a header row. 
        
                   (3) There must be a numeric ``weight`` column (weights can take on any 
        
                       non-negative values). 
        
                   (4) Other columns must be a subset of columns used in ``--group-by``, 
        
                       with combinations of values covering all combinations present in the 
        
                       metadata. 
        
                   (5) This option only applies when ``--group-by`` and 
        
                       ``--subsample-max-sequences`` are provided. 
        
                   (6) This option cannot be used with ``--no-probabilistic-sampling``. 
        
                   Notes: 
        
                   (1) Any ``--group-by`` columns absent from this file will be given equal 
        
                       weighting across all values *within* groups defined by the other 
        
                       weighted columns. 
        
                   (2) An entry with the value ``default`` under all columns will be 
        
                       treated as the default weight for specific groups present in the 
        
                       metadata but missing from the weights file. If there is no default 
        
                       weight and the metadata contains rows that are not covered by the 
        
                       given weights, augur filter will exit with an error. 
        
               """)

Related issue(s)

Allow weighted subsampling #1318

Notable comment threads

How to handle a mix of weighted and unweighted columns? Apply weights within unweighted groups
How to handle weighted groups that don't appear in metadata? Drop them
Add an option to show number of outputted sequences? Done
How to handle over-targeting group sizes with only weighted columns? Added warnings in output
How to handle over-targeting group sizes with a mix of weighted and unweighted columns? Added warnings in output
Clarify interaction with probabilistic sampling Done

Checklist

Prototype usage in an existing workflow: Use weighted sampling for Asia builds ncov#1106
Automated checks pass
Check if you need to add a changelog message
Check if you need to add tests (added)
Check if you need to update docs
Create issue for FIXME: filter: Make under-sampling more apparent #1590

augur/filter/subsample.py

victorlin · 2024-04-30T23:42:38Z

This should be in a working state, at least based on limited testing in subsample-weighted.t. Planning to test using #1444 + nextstrain/ncov#1102 to get a feel for how add an additional weights.tsv file fits in a pathogen repo and make sure it works as expected on real data.

UPDATE: first testing in ncov independently of augur subsample

augur/filter/subsample.py

jameshadfield

This is going to be really powerful @victorlin, nice work. I tested this out a little bit, but haven't gone through every part of it. There's a few aspects that this brought up for which I'll think through a little more, but I think the overall direction is sound.

augur/filter/__init__.py

augur/filter/subsample.py

<#1454 (comment)>

genehack · 2024-05-20T20:04:52Z

Having just read this from scratch, and hoping to help it get moving again, it seems like the remaining outstanding discussion items are:

#1 adding an (optional?) output with requested and calculated group sizes – which seems like a pretty valuable debugging tool to me, but maybe non-blocking for this particular PR
#2 validating that the current behavior when a group-by group is empty is desired convo continued in second comment – seems like this is the case but maybe a bit more explicit validation/commentary would be good?

Did I miss anything @victorlin ?

victorlin · 2024-05-20T20:35:46Z

Thanks for bumping this @genehack. I added a comment on the thread for (1). Testing and discussion in nextstrain/ncov#1106 should address (2) and impact the overall direction of this PR. I want to make sure we have at least one solid use case for this feature.

<#1454 (comment)>

codecov · 2024-07-17T19:03:04Z

Codecov Report

Attention: Patch coverage is 98.72611% with 2 lines in your changes missing coverage. Please review.

Project coverage is 70.98%. Comparing base (d8faf01) to head (9a2090b).
Report is 120 commits behind head on master.

Files with missing lines	Patch %	Lines
augur/filter/weights_file.py	94.59%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1454      +/-   ##
==========================================
+ Coverage   70.51%   70.98%   +0.46%     
==========================================
  Files          78       79       +1     
  Lines        8107     8244     +137     
  Branches     1966     2002      +36     
==========================================
+ Hits         5717     5852     +135     
- Misses       2100     2101       +1     
- Partials      290      291       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

new changes

joverlee521

I haven't fully grokked what's going on in get_weighted_group_sizes, but the behaviors described in the functional tests make sense to me.

Once this feature is released, it definitely deserves it's own section in the Filtering and Subsampling docs.

augur/filter/weights_file.py

jameshadfield · 2024-08-05T23:19:11Z

tests/functional/filter/cram/subsample-weighted-and-uniform-mix.t

+
+Using 1:1 weights is similarly straightforward, with 50 sequences from each location.
+
+  $ cat >weights.tsv <<~~


[general question about test best practices]

I commonly use cram --keep-tmpdir and then inspect the files themselves to understand the tests & thus behaviour of augur. This test overwrites files (e.g. weights.tsv) so that approach doesn't work. It also seems like it'd hide bugs where a command fails, but exits code 0, and thus we're not using the version of the file we think we are. Thoughts on using weights1.tsv, weights2.tsv etc?

I've added more specific file names: b7fc345.

The other approach which I've found better for debugging tests is to split the test file into multiple files. However, in this case it's better to keep all in the same file to share the temporary metadata.tsv.

It also seems like it'd hide bugs where a command fails, but exits code 0, and thus we're not using the version of the file we think we are.

I assume "command" = "Augur command", but I'm having trouble understanding your scenario.

Weights files are created separately from the Augur commands.

Augur command failures will exit with a nonzero exit code and be shown as a test failure.

tests/functional/filter/cram/subsample-weighted-and-uniform-mix.t

victorlin · 2024-08-07T22:54:00Z

[@joverlee521] I haven't fully grokked what's going on in get_weighted_group_sizes

I've split this function up into smaller parts and added types to make it more readable: c305fe8

augur/filter/__init__.py

trvrb

@victorlin: This is great! I appreciate all the documentation and tests. I'm good for this to be merged as it stands. ~~However, I do think we should follow up shortly with a separate PR that implements support for --probabilistic-sampling + --group-by-weights per #1454 (comment).~~ I was confused. Nothing more to be done with probabilistic sampling.

This reverts "Rename probabilistic sampling group" (a8f8827). That change was made prematurely and it is no longer necessary.

This makes it easier to weight on month or week values from a TSV file. No reason to use a Python list object when all that's needed is a unique value per year-month or year-week.

victorlin · 2024-08-21T00:14:20Z

Finished addressing all conversations and cleaned up commits. I'm going to re-run trial builds once more on nextstrain/ncov#1106 with latest changes then merge this tomorrow if all looks good.

Add a new option --group-by-weights and --output-group-by-sizes to support the new feature.

victorlin self-assigned this Apr 25, 2024

victorlin force-pushed the victorlin/weighted-sampling branch from 96a7a0a to adc1d0b Compare April 27, 2024 00:15

jameshadfield reviewed Apr 29, 2024

View reviewed changes

augur/filter/subsample.py Outdated Show resolved Hide resolved

victorlin force-pushed the victorlin/weighted-sampling branch 2 times, most recently from 335325d to dc1d1ab Compare May 3, 2024 22:23

victorlin mentioned this pull request May 3, 2024

Prepare for weighted sampling #1460

Merged

3 tasks

victorlin changed the base branch from master to victorlin/prep-weighted-sampling May 3, 2024 22:25

victorlin force-pushed the victorlin/prep-weighted-sampling branch from c18acb5 to d5004bf Compare May 3, 2024 22:27

victorlin force-pushed the victorlin/weighted-sampling branch from dc1d1ab to 0a321a6 Compare May 3, 2024 22:27

victorlin mentioned this pull request May 3, 2024

Use weighted sampling for Asia builds nextstrain/ncov#1106

Merged

9 tasks

victorlin force-pushed the victorlin/prep-weighted-sampling branch from d5004bf to 80eb23b Compare May 6, 2024 22:44

victorlin force-pushed the victorlin/weighted-sampling branch from 0a321a6 to 9be1e84 Compare May 6, 2024 22:44

victorlin commented May 7, 2024

View reviewed changes

augur/filter/subsample.py Outdated Show resolved Hide resolved

jameshadfield reviewed May 10, 2024

View reviewed changes

victorlin added a commit that referenced this pull request May 10, 2024

🚧 add test for discussion

740e748

<#1454 (comment)>

victorlin mentioned this pull request May 23, 2024

Add subsampling command with YAML config, basic weighted sampling #1444

Closed

genehack previously approved these changes Jun 7, 2024

View reviewed changes

victorlin force-pushed the victorlin/weighted-sampling branch from 19c685f to b88a8c4 Compare July 5, 2024 18:56

victorlin added a commit that referenced this pull request Jul 5, 2024

🚧 add test for discussion

66f1c1b

<#1454 (comment)>

victorlin force-pushed the victorlin/weighted-sampling branch from b88a8c4 to 7d041b4 Compare July 17, 2024 18:28

victorlin force-pushed the victorlin/prep-weighted-sampling branch from 80eb23b to a8f8827 Compare July 17, 2024 18:38

victorlin force-pushed the victorlin/weighted-sampling branch 2 times, most recently from e81cdc5 to 32627ad Compare July 17, 2024 18:45

victorlin force-pushed the victorlin/weighted-sampling branch from 9e0a4b8 to 5c318c0 Compare July 17, 2024 19:25

victorlin marked this pull request as ready for review July 17, 2024 19:38

joverlee521 approved these changes Jul 29, 2024

View reviewed changes

augur/filter/weights_file.py Outdated Show resolved Hide resolved

augur/filter/weights_file.py Outdated Show resolved Hide resolved

genehack reviewed Jul 31, 2024

View reviewed changes

augur/filter/weights_file.py Outdated Show resolved Hide resolved

Base automatically changed from victorlin/prep-weighted-sampling to master August 5, 2024 22:03

victorlin force-pushed the victorlin/weighted-sampling branch 2 times, most recently from 648e53b to 87df694 Compare August 5, 2024 22:04

jameshadfield reviewed Aug 6, 2024

View reviewed changes

victorlin force-pushed the victorlin/weighted-sampling branch 3 times, most recently from 18dbff0 to 6b3c63b Compare August 6, 2024 22:34

victorlin force-pushed the victorlin/weighted-sampling branch 4 times, most recently from c860492 to 66d0df4 Compare August 14, 2024 17:19

victorlin mentioned this pull request Aug 14, 2024

Allow weighted subsampling #1318

Closed

5 tasks

trvrb reviewed Aug 20, 2024

View reviewed changes

augur/filter/__init__.py Outdated Show resolved Hide resolved

trvrb self-requested a review August 20, 2024 20:25

trvrb approved these changes Aug 20, 2024

View reviewed changes

victorlin mentioned this pull request Aug 20, 2024

filter: Make under-sampling more apparent #1590

Open

victorlin added 2 commits August 20, 2024 17:08

Restore probabilistic sampling group name

9ebbd48

This reverts "Rename probabilistic sampling group" (a8f8827). That change was made prematurely and it is no longer necessary.

Encode months/weeks as strings

c9217fe

This makes it easier to weight on month or week values from a TSV file. No reason to use a Python list object when all that's needed is a unique value per year-month or year-week.

victorlin force-pushed the victorlin/weighted-sampling branch from 66d0df4 to 8e3d6be Compare August 21, 2024 00:09

victorlin added 2 commits August 21, 2024 10:06

Implement weighted sampling

84f5f0e

Add a new option --group-by-weights and --output-group-by-sizes to support the new feature.

Update changelog

9a2090b

victorlin force-pushed the victorlin/weighted-sampling branch from 8e3d6be to 9a2090b Compare August 21, 2024 17:06

victorlin merged commit 1e9d131 into master Aug 21, 2024

victorlin deleted the victorlin/weighted-sampling branch August 21, 2024 20:45

This was referenced Aug 21, 2024

Update changelog entry for weighted sampling #1595

Merged

Subsample has uncaught assertion in ncov #1598

Closed

	subsample_group.add_argument('--group-by-weights', type=str, metavar="FILE", help="""
	TSV file defining weights for grouping. Requirements:

	(1) Lines starting with '#' are treated as comment lines.
	(2) The first non-comment line must be a header row.
	(3) There must be a numeric ``weight`` column (weights can take on any
	non-negative values).
	(4) Other columns must be a subset of columns used in ``--group-by``,
	with combinations of values covering all combinations present in the
	metadata.
	(5) This option only applies when ``--group-by`` and
	``--subsample-max-sequences`` are provided.
	(6) This option cannot be used with ``--no-probabilistic-sampling``.

	Notes:

	(1) Any ``--group-by`` columns absent from this file will be given equal
	weighting across all values within groups defined by the other
	weighted columns.
	(2) An entry with the value ``default`` under all columns will be
	treated as the default weight for specific groups present in the
	metadata but missing from the weights file. If there is no default
	weight and the metadata contains rows that are not covered by the
	given weights, augur filter will exit with an error.
	""")


		Using 1:1 weights is similarly straightforward, with 50 sequences from each location.

		$ cat >weights.tsv <<~~

Conversation

victorlin commented Apr 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of proposed changes

Related issue(s)

Notable comment threads

Checklist

Uh oh!

Uh oh!

victorlin commented Apr 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jameshadfield left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

genehack commented May 20, 2024

Uh oh!

victorlin commented May 20, 2024

Uh oh!

codecov bot commented Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

joverlee521 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jameshadfield Aug 5, 2024

Choose a reason for hiding this comment

Uh oh!

victorlin Aug 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

victorlin commented Aug 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

trvrb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

victorlin commented Aug 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

victorlin commented Apr 25, 2024 •

edited

Loading

victorlin commented Apr 30, 2024 •

edited

Loading

codecov bot commented Jul 17, 2024 •

edited

Loading

victorlin Aug 6, 2024 •

edited

Loading

victorlin commented Aug 7, 2024 •

edited

Loading

trvrb left a comment •

edited

Loading