Update Omicron clade definitions to be robust to BA.4 and BA.5#913
Update Omicron clade definitions to be robust to BA.4 and BA.5#913
Conversation
Including BA.4 and BA.5 sequences can alter tree topology and change clade placement. These clade updates should make existing 21K, 21L and 21M clades robust to these tree topology changes. Existing non-artifactual genomes should receive the same clade label as previously. For the moment, this groups BA.4 and BA.5 into 21L along with BA.2. This was chosen to be robust to stochastic differences in phylogenetic reconstruction.
|
Trial builds are running from: |
defaults/clades.tsv
Outdated
| 21L (Omicron) nuc 27259 C | ||
|
|
||
| 21M (Omicron) nuc 8782 C | ||
| 21M (Omicron) nuc 10449 A |
There was a problem hiding this comment.
To prevent inversion of clade labels we've started to include all ancestral mutations in the children - so this line 21M (Omicron) nuc 10449 A should be added to 21L and 21K according to the rule.
Probably nothing will happen if we don't do this but it served us well with Delta.
Soon I'll also rewrite clades.tsv to take advantage of our new inheritance feature that was released in Augur a while ago. Now everyone should have had a chance to update to that version.
There was a problem hiding this comment.
Got it. I think this may have been the issue with https://nextstrain.org/staging/ncov/open/trial/clades-update/global with 21M appears as subclade of 21K. I'll revise as you suggest and confirm that this fixes the issue.
There was a problem hiding this comment.
Yes, I think that's what happened.
It's crazy that all of the defining mutations are so flaky with Omicron. Maybe we'll need more redundant clade logic, like 1 out of these 3 mutations to be more robust.
For whatever reason 10449A is fully congruent with Omicron clades in GISAID builds but not in GenBank builds. Swapping to 18163G makes things consistent across the board. Although 18163G occurs on the branch leading to 21M, including it here for 21K and 21L helps to keep clades from becoming "inverted".
|
This should be good now. I think it's working across latest |
Description of proposed changes
Including BA.4 and BA.5 sequences can alter tree topology and change clade placement. These clade updates should make existing 21K, 21L and 21M clades robust to these tree topology changes. Existing non-artifactual genomes should receive the same clade label as previously. For the moment, this groups BA.4 and BA.5 into 21L along with BA.2. This was chosen to be robust to stochastic differences in phylogenetic reconstruction.
Here, I walked through my build with the spiked in BA.4 and BA.5 genomes that showed stochastically different topologies for different regional build targets (global, Africa, Asia, etc...).
Iterated to remove homoplasic sites from the clades list and supplement with unique sites. With these changes BA.2, BA.4 and BA.5 consistently group together into a single 21L clade, BA.1 groups into its own 21K clade and BA.3 as well as other non-specific Omicron sequences fall into basal 21M clade.
This update should really be the least impactful to our designations and if BA.4 and/or BA.5 grow they can be given their own clade labels to distinguish from BA.2.
In the linked builds I've highlighted mutations G12160A and G27788T as together these are quite specific for BA.4 and BA.5.
You can see three primary topologies.
Clade labels are now robust with regards to topology in this test dataset with BA.4 and BA.5 spiked-in.
I'll do a trial run to make sure we get consistent behavior on the normal subsampled dataset, but I'd think that this is unlikely to fail.
Related issue(s)
This PR supersedes #908 and #912.
Testing
I've tested fairly extensively locally. Following up with trial builds.
Release checklist
If this pull request introduces new features, complete the following steps:
docs/src/reference/change_log.mdin this pull request to document these changes by the date they were added.