Conversation
96a7a0a to
adc1d0b
Compare
|
This should be in a working state, at least based on limited testing in subsample-weighted.t. Planning to test using #1444 + nextstrain/ncov#1102 to get a feel for how add an additional weights.tsv file fits in a pathogen repo and make sure it works as expected on real data. UPDATE: first testing in ncov independently of |
335325d to
dc1d1ab
Compare
c18acb5 to
d5004bf
Compare
dc1d1ab to
0a321a6
Compare
d5004bf to
80eb23b
Compare
0a321a6 to
9be1e84
Compare
jameshadfield
left a comment
There was a problem hiding this comment.
This is going to be really powerful @victorlin, nice work. I tested this out a little bit, but haven't gone through every part of it. There's a few aspects that this brought up for which I'll think through a little more, but I think the overall direction is sound.
|
Having just read this from scratch, and hoping to help it get moving again, it seems like the remaining outstanding discussion items are: #1 adding an (optional?) output with requested and calculated group sizes – which seems like a pretty valuable debugging tool to me, but maybe non-blocking for this particular PR Did I miss anything @victorlin ? |
|
Thanks for bumping this @genehack. I added a comment on the thread for (1). Testing and discussion in nextstrain/ncov#1106 should address (2) and impact the overall direction of this PR. I want to make sure we have at least one solid use case for this feature. |
19c685f to
b88a8c4
Compare
b88a8c4 to
7d041b4
Compare
80eb23b to
a8f8827
Compare
e81cdc5 to
32627ad
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1454 +/- ##
==========================================
+ Coverage 70.51% 70.98% +0.46%
==========================================
Files 78 79 +1
Lines 8107 8244 +137
Branches 1966 2002 +36
==========================================
+ Hits 5717 5852 +135
- Misses 2100 2101 +1
- Partials 290 291 +1 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
9e0a4b8 to
5c318c0
Compare
joverlee521
left a comment
There was a problem hiding this comment.
I haven't fully grokked what's going on in get_weighted_group_sizes, but the behaviors described in the functional tests make sense to me.
Once this feature is released, it definitely deserves it's own section in the Filtering and Subsampling docs.
648e53b to
87df694
Compare
|
|
||
| Using 1:1 weights is similarly straightforward, with 50 sequences from each location. | ||
|
|
||
| $ cat >weights.tsv <<~~ |
There was a problem hiding this comment.
[general question about test best practices]
I commonly use cram --keep-tmpdir and then inspect the files themselves to understand the tests & thus behaviour of augur. This test overwrites files (e.g. weights.tsv) so that approach doesn't work. It also seems like it'd hide bugs where a command fails, but exits code 0, and thus we're not using the version of the file we think we are. Thoughts on using weights1.tsv, weights2.tsv etc?
There was a problem hiding this comment.
I've added more specific file names: b7fc345.
The other approach which I've found better for debugging tests is to split the test file into multiple files. However, in this case it's better to keep all in the same file to share the temporary metadata.tsv.
It also seems like it'd hide bugs where a command fails, but exits code 0, and thus we're not using the version of the file we think we are.
I assume "command" = "Augur command", but I'm having trouble understanding your scenario.
- Weights files are created separately from the Augur commands.
- Augur command failures will exit with a nonzero exit code and be shown as a test failure.
tests/functional/filter/cram/subsample-weighted-and-uniform-mix.t
Outdated
Show resolved
Hide resolved
18dbff0 to
6b3c63b
Compare
I've split this function up into smaller parts and added types to make it more readable: c305fe8 |
c860492 to
66d0df4
Compare
There was a problem hiding this comment.
@victorlin: This is great! I appreciate all the documentation and tests. I'm good for this to be merged as it stands. However, I do think we should follow up shortly with a separate PR that implements support for I was confused. Nothing more to be done with probabilistic sampling.--probabilistic-sampling + --group-by-weights per #1454 (comment).
This reverts "Rename probabilistic sampling group" (a8f8827). That change was made prematurely and it is no longer necessary.
This makes it easier to weight on month or week values from a TSV file. No reason to use a Python list object when all that's needed is a unique value per year-month or year-week.
66d0df4 to
8e3d6be
Compare
|
Finished addressing all conversations and cleaned up commits. I'm going to re-run trial builds once more on nextstrain/ncov#1106 with latest changes then merge this tomorrow if all looks good. |
Add a new option --group-by-weights and --output-group-by-sizes to support the new feature.
8e3d6be to
9a2090b
Compare
Description of proposed changes
This new option takes a weights file to determine group sizes. Summarized by the help text:
augur/augur/filter/__init__.py
Lines 73 to 97 in 9a2090b
Related issue(s)
Notable comment threads
Checklist