Use weighted sampling for other builds#1151
Conversation
|
This is great @victorlin! Please go ahead and plan to rebase and merge whenever you'd like. I did want to note for the future that the switch to country-level only for North America has resulted in US states with more samples getting over-represented relative to population size. Here's trail build: https://nextstrain.org/staging/ncov/gisaid/trial/victorlin-all-builds-weighted/north-america/6m?f_region=North%20America
Here's the comparison to previous behavior:
Overall, I think it's a big improvement to have country like Costa Rica with many admin divisions get down-weighted to balance its population size. We could consider implementing division population weights at some point in the future, but I wouldn't worry too much about it immediately. As it stands, states with large population sizes like CA, NY and TX are tending to submit more sequences anyway so the resulting sampling maybe more representative anyway. |
Extend the weighted sampling approach from Asia builds to other regional builds. This comes with the added benefit of reducing redundancy in subsampling schemes.
Extend the weighted sampling approach from regional builds to global builds. This comes with the added benefit of simplifying logic to avoid region/country-specific max_sequences.
d5c9d19 to
7d9f31c
Compare


Description of proposed changes
Extend the weighted sampling approach from Asia builds to all other builds.
Trial build links
Related issue(s)
Closes #1141
Checklist
docs/src/reference/change_log.mdin this pull request to document these changes by the date they were added.