Restrict contextual samples to 1 year back in 1m/2m/6m builds#1129
Closed
Restrict contextual samples to 1 year back in 1m/2m/6m builds#1129
Conversation
For when subsampling in the Nextstrain GISAID profile, rather than treating early contextual samples as origin of pandemic to beginning of focal window, eg for 6m analysis from 2020 to 6m ago, instead use a consistent 24m of additional context. So, for 6m, this is context of 30m ago to 6m and focal of 6m ago to present. Additionally, reduce the amount of contextual sequences included from a 4:1 ratio of focal to context to a 10:1 ratio of focal to context.
Drop forced inclusion of Wuhan/1 root in the Nextstrain GISAID profile and swap rooting to use "best", ie temporally optimal rooting. This allows the root to be the common ancestor of the subsampled sequences. This makes it so that with the changes to time-based subsampling in the previous commit, the "6m" analysis includes samples from the previous 30m and the TMRCA is in ~2021. This set up should be significantly more future proof than needing to continually make new clade-specific (eg /21L/) roots as selective sweeps occur.
Member
genehack
approved these changes
Jul 26, 2024
Member
Author
After working through the coloring option in PR #1132 I'm definitely more of a fan of the color update. Unless there's conflicting preferences, I'll plan to just close this PR. |
Member
Author
|
I just added the "revisit sometime" label. As time accrues leaving the older context all the way back to Wuhan will get increasingly clunky. In perhaps 6 months or a year we should implement something like this PR and include a |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Description of proposed changes
Currently, we focus on a recent time window in many of our ncov analyses. For example, ncov/gisaid/north-america/6m, does more intensive sampling of the previous 6 months (aiming for ~4000 "recent" samples and ~1000 "early" samples). However, as the past continues to recede (as it does) we're getting more and more early context that isn't so relevant for understanding circulating diversity. Here's the live 6m tree for example:
Once selective sweeps occur, we can largely forget about past evolution when looking at current diversity. This had previously prompted us to make the "21L" builds that root to clade 21L / lineage BA.2.
At this point, we're getting closer and closer to wanting the same thing with a clade 24A / lineage JN.1 rooting. However, this strategy is clearly not sustainable. In 4 more years we don't want to have have 4 different rootings, all of which require updating and it not being clear to users what they should be looking at.
This PR addresses the issue in a simple fashion, basically making ncov work more like seasonal-flu or avian-flu where there is recent focal samples and older contextual samples, but the contextual samples only go back a year rather than many.
Here's the resulting global 6m tree:
Results from running this PR can be seen at:
etc...
This is using a 10:1 ratio of recent to early samples and doing a +1 year back for early samples. For for the 6m analysis, it's 0m to 6m back as recent focal samples and 6m to 18m back as early contextual samples.
I had tried a +2 year context as well, but it didn't seem to add much understanding while taking up additional screen real estate and additional color ramp. You can compare here however: global/6m
The the biggest worry I see here is that people currently landing at ncov/gisaid/global/6m can see what's currently circulating and get all the context that they may need going back to the beginning of the pandemic (with well known VOCs, etc...).
If we did merge this, we should make two showcase cards on the splash page to direct to
6mvsall-timeto (partially) address this. Also, if we did merge this, I'd imagine deprecating the 21L builds, where we'd remove them from the automated GitHub Actions rebuild, remove them from the manifest and add redirects to go from ncov/gisaid/21L/global/6m to ncov/gisaid/global/6m.The other approach to this same issue would be take older clades (and conceivably Pango lineages) and make these gray while keeping the color ramp only for more recent clades. This strategy is also not mutually exclusive and we do could do both or neither. I'll try to put together a separate PR for the clade colors idea. Though even if colors update fixes things enough for the time being, I do think we'll eventually want to do something like this strategy. But it's possible this is a couple years down the road.
In addition to code review, I'd appreciate 👍 / 👎 feedback on whether you prefer this to the current sampling strategy.
Testing
Tested locally and via GitHub Action trial builds.
Release checklist
If this pull request introduces new features, complete the following steps:
docs/src/reference/change_log.mdin this pull request to document these changes by the date they were added.