Surface Nextclade QC metrics as color-bys#861
Conversation
Updates Auspice config(s) to include colors for Nextclade QC information that now gets passed through metadata to augur export. When these columns are present in the metadata, users can color their trees by reversions, potential contaminants, and rare mutations. Users can now skip the diagnostic filters in the workflow, to keep as much data in the tree as possible, but still identify the strains that would be have been filtered by diagnostics.
|
If these are all counts, should they be on some discrete color scale instead so scale values like |
We can add a Here's an example of custom legend entries, albeit for percentages not counts: |
|
I don't think the scale appearance is in the scope of this PR. We had this discussion about S1 mutations a year ago and never agreed on an outcome. This isn't to say we shouldn't fix the appearance, but this PR is primarily about which columns to surface or not. A fix to the appearance of counts in ncov should be its own separate PR. |
|
Ah, I see. Agreed not in scope here. I had thought this was just flipping the |
|
That's great! Good to surface this, could help spot regions where there's something going on, either not enough coverage on the Nextclade reference tree or some recombination/amplicon dropout etc. I can tell you that the expected range of contamination is 0-5, very rarely >10, so making the color scale 0,1,2,3,4,... up to 10, then in steps of 5 or so would make sense. Reversions appear a bit more often, so maybe 0,1,2,3,4,5-6,7-8,9-10,11-15,>15? |
|
@corneliusroemer Maybe it would be helpful to see what the numbers are for a proper build? From there we could determine whether custom scale(s) would improve the UX... Are there any other fields you'd think to include beside the three proposed here? Here's the list of what's added by the QC table: |
|
After discussion in the group, we should pull in the overall QC columns from Nextclade's QC TSV in the join script:
These columns should go into the default Auspice config and the main Nextstrain builds. We don't need to include the reversion mutations, etc. in the main Nextstrain build config. |
Also, allow users to filter by overall QC status.
Addresses the metadata portion of nextstrain/ncov#861 See the symmetrical PR in ncov: TODO This adds the copying of the following columns from `nextclade.tsv` to `metadata.tsv`: ``` "qc.overallScore": "QC_overall_score", "qc.overallStatus": "QC_overall_status", ```
Addresses the metadata portion of nextstrain/ncov#861 See the symmetrical PR in ncov: nextstrain/ncov#865 This adds the copying of the following columns from `nextclade.tsv` to `metadata.tsv`: ``` "qc.overallScore": "QC_overall_score", "qc.overallStatus": "QC_overall_status", ```
|
@huddlej I made 2 symmetrical PRs in ingest and ncov to add the columns:
The order of operations:
Upd: see John's comment below |
|
@huddlej On an actual build the numbers will be pretty much what I outlined above. The problem of automatic scales is that they can be rendered very coarse by a single outlier. If most values are 0-5 but one value comes in at 50, it'll make all the 0-5 appear as one category, even though it'd be best to have something like 0,1,2,3,4,5,5-10,10+. I also think that just including the overall QC score may not be hugely helpful in our builds for two reasons:
Did I understand it correctly yesterday that the reason for limiting extra color bys to 1 was that we didn't want to exacerbate the Auspice bug nextstrain/auspice#1453 (overlapping items in filter dropdown)? I can think of use cases for almost all of the fields. There's a general question whether we should use more coarse grained Divergence is useful in time view, just as date is useful in divergence view. Missing_data, nonACGTN, substitutions, deletions, insertions: not sure reversion_mutations: useful to investigate problematic parts of the tree and reason for bad qc QC_missing_data, QC_mixed_sites, QC_rare_mutations, QC_snp_clusters, QC_frame_shifts, QC_stop_codons, clock_deviation: would help quickly spot if basal sequences fit the clock or could be there because of reference backfilling/reversion |
|
Thanks, @ivan-aksamentov and @corneliusroemer! @ivan-aksamentov, regarding:
6d325f2 in this PR adds the same columns to the join script. I'm not too worried about the order of operations, here. If expected Nextclade QC columns don't exist in the user's metadata already, the script will annotate the metadata with the newly generated QC columns from the workflow. Both the Docker image and the conda environment use the latest Nextclade, so these results should be the same as what we'd get from the ingest (unless I'm missing something). Updating ncov-ingest makes sense, though. The actual use of these columns depends on the user's Auspice config JSON to define the columns as colors or filters (as in the default config in this PR).
That's right. The purpose of surfacing these data is to help users with their own builds, especially in the case where users want to skip diagnostic filters to show as much data as possible. In this case, these data allow users to identify problematic sequences they allowed into their build. This is a common use case with US public health labs and our Africa CDC partners. So, that said, we do not need to include these additional columns in our Auspice config JSONs. It should be sufficient to include them in the default and then people can choose to use them or not per build.
This is an orthogonal issue that @trvrb has been concerned about with the Auspice UI/UX more generally. The problem would definitely be exacerbated by adding all of these columns to the main Nextstrain builds, but I think we generally agree that our main Nextstrain builds don't need all of these columns. @rneher and @trvrb seemed to agree that the overall QC score and/or status might be good enough.
Yeah, the more we've discussed this, the more I think folks should be able to opt in to whichever fields they find most helpful. I can imagine the quantitative fields have little meaning for people independent of some qualitative thresholding (e.g., "Is 50 a good or bad QC score?"). We just demoed the overall QC, reversion, potential contaminants, and rare mutations fields at office hours and folks were excited about these. Based on this feedback, I think our plan could be:
|
|
Alright, I merged the nextstrain/ncov-ingest#283 and closed #865. The new columns should be in the metadata after tomorrow's ingest. |

Description of proposed changes
Updates Auspice config(s) to include colors for Nextclade QC information that now gets passed through metadata to augur export. When these columns are present in the metadata, users can color their trees by reversions, potential contaminants, and rare mutations. Users can now skip the diagnostic filters in the workflow, to keep as much data in the tree as possible, but still identify the strains that would be have been filtered by diagnostics.
Note: This PR only updates the default Auspice config at the moment. It will update all configs once we agree on the fields to include.
Reversion mutations for the Nextstrain CI tree look like this:

Potential contaminants look like this:

Testing
Release checklist
docs/src/reference/change_log.mdin this pull request to document these changes by the date they were added.