Surface Nextclade QC metrics as color-bys by huddlej · Pull Request #861 · nextstrain/ncov

huddlej · 2022-02-07T22:26:15Z

Description of proposed changes

Updates Auspice config(s) to include colors for Nextclade QC information that now gets passed through metadata to augur export. When these columns are present in the metadata, users can color their trees by reversions, potential contaminants, and rare mutations. Users can now skip the diagnostic filters in the workflow, to keep as much data in the tree as possible, but still identify the strains that would be have been filtered by diagnostics.

Note: This PR only updates the default Auspice config at the moment. It will update all configs once we agree on the fields to include.

Reversion mutations for the Nextstrain CI tree look like this:

Potential contaminants look like this:

Testing

Test by CI
Test with full Nextstrain builds

Release checklist

Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

Updates Auspice config(s) to include colors for Nextclade QC information that now gets passed through metadata to augur export. When these columns are present in the metadata, users can color their trees by reversions, potential contaminants, and rare mutations. Users can now skip the diagnostic filters in the workflow, to keep as much data in the tree as possible, but still identify the strains that would be have been filtered by diagnostics.

tsibley · 2022-02-07T22:33:49Z

If these are all counts, should they be on some discrete color scale instead so scale values like 0.555 don't show up?

jameshadfield · 2022-02-07T22:56:15Z

If these are all counts, should they be on some discrete color scale instead so scale values like 0.555 don't show up?

We can add a "legend" item to the coloring which will bin them into groups (e.g. "<5", "5-10"). Since we don't know these values ahead of time, the way to do this would be by adding this to the dataset JSON once it's created (similar to epiweeks).

Here's an example of custom legend entries, albeit for percentages not counts:

huddlej · 2022-02-07T23:45:32Z

I don't think the scale appearance is in the scope of this PR. We had this discussion about S1 mutations a year ago and never agreed on an outcome. This isn't to say we shouldn't fix the appearance, but this PR is primarily about which columns to surface or not. A fix to the appearance of counts in ncov should be its own separate PR.

tsibley · 2022-02-08T00:36:23Z

Ah, I see. Agreed not in scope here. I had thought this was just flipping the type on the colorings entry and Auspice would DTRT, but I see now it's not that simple.

corneliusroemer · 2022-02-08T05:55:47Z

That's great! Good to surface this, could help spot regions where there's something going on, either not enough coverage on the Nextclade reference tree or some recombination/amplicon dropout etc.

I can tell you that the expected range of contamination is 0-5, very rarely >10, so making the color scale 0,1,2,3,4,... up to 10, then in steps of 5 or so would make sense.

Reversions appear a bit more often, so maybe 0,1,2,3,4,5-6,7-8,9-10,11-15,>15?

huddlej · 2022-02-08T18:06:38Z

@corneliusroemer Maybe it would be helpful to see what the numbers are for a proper build? From there we could determine whether custom scale(s) would improve the UX...

Are there any other fields you'd think to include beside the three proposed here? Here's the list of what's added by the QC table:

divergence
missing_data
nonACGTN
substitutions
deletions
insertions
reversion_mutations
potential_contaminants
rare_mutations
frame_shifts
aaSubstitutions
QC_missing_data
QC_mixed_sites
QC_rare_mutations
QC_snp_clusters
QC_frame_shifts
QC_stop_codons
clock_deviation

huddlej · 2022-02-09T20:36:31Z

After discussion in the group, we should pull in the overall QC columns from Nextclade's QC TSV in the join script:

qc.overallScore -> QC_overall_score
qc.overallStatus -> QC_overall_status

These columns should go into the default Auspice config and the main Nextstrain builds. We don't need to include the reversion mutations, etc. in the main Nextstrain build config.

Also, allow users to filter by overall QC status.

Addresses the metadata portion of nextstrain/ncov#861 See the symmetrical PR in ncov: TODO This adds the copying of the following columns from `nextclade.tsv` to `metadata.tsv`: ``` "qc.overallScore": "QC_overall_score", "qc.overallStatus": "QC_overall_status", ```

Addresses the metadata portion of nextstrain/ncov#861 See the symmetrical PR in ncov: nextstrain/ncov#865 This adds the copying of the following columns from `nextclade.tsv` to `metadata.tsv`: ``` "qc.overallScore": "QC_overall_score", "qc.overallStatus": "QC_overall_status", ```

ivan-aksamentov · 2022-02-10T07:38:47Z

@huddlej I made 2 symmetrical PRs in ingest and ncov to add the columns:

The order of operations:

review and merge PR in ingest #283
run daily ingest to update the metadata
~~wait until metadata is updated, then merge PR in ncov #865~~
~~use the new columns (I am not familiar with this part, so that's on augur/auspice gurus)~~

Upd: see John's comment below

corneliusroemer · 2022-02-10T08:38:50Z

@huddlej On an actual build the numbers will be pretty much what I outlined above. The problem of automatic scales is that they can be rendered very coarse by a single outlier. If most values are 0-5 but one value comes in at 50, it'll make all the 0-5 appear as one category, even though it'd be best to have something like 0,1,2,3,4,5,5-10,10+.

I also think that just including the overall QC score may not be hugely helpful in our builds for two reasons:

we filter out most bad sequences in diagnostic.py
There are a lot of potential reasons why something could be bad, I'd be more interested in how it breaks down than just whether it's good or bad

Did I understand it correctly yesterday that the reason for limiting extra color bys to 1 was that we didn't want to exacerbate the Auspice bug nextstrain/auspice#1453 (overlapping items in filter dropdown)?

I can think of use cases for almost all of the fields. There's a general question whether we should use more coarse grained status (just good, mediocre, bad) or continuous quantitative scales. I'd probably be in favour of continuous quant scales.

Divergence is useful in time view, just as date is useful in divergence view.

Missing_data, nonACGTN, substitutions, deletions, insertions: not sure

reversion_mutations: useful to investigate problematic parts of the tree and reason for bad qc
potential_contaminants: useful to investigate problematic parts of the tree and reason for bad qc
rare_mutations: useful to investigate problematic parts of the tree and reason for bad qc
frame_shifts: useful to investigate problematic parts of the tree and reason for bad qc

QC_missing_data, QC_mixed_sites, QC_rare_mutations, QC_snp_clusters, QC_frame_shifts, QC_stop_codons,
For these the question is: score (continuous) vs status (categorical) and there's some overlap with the raw counts above, but should be useful to find out what caused a bad QC. Definitely interesting in the tip-tooltip if one can see all of these at a glance.

clock_deviation: would help quickly spot if basal sequences fit the clock or could be there because of reference backfilling/reversion

huddlej · 2022-02-10T19:58:27Z

Thanks, @ivan-aksamentov and @corneliusroemer!

@ivan-aksamentov, regarding:

The order of operations:

6d325f2 in this PR adds the same columns to the join script. I'm not too worried about the order of operations, here. If expected Nextclade QC columns don't exist in the user's metadata already, the script will annotate the metadata with the newly generated QC columns from the workflow. Both the Docker image and the conda environment use the latest Nextclade, so these results should be the same as what we'd get from the ingest (unless I'm missing something).

Updating ncov-ingest makes sense, though. The actual use of these columns depends on the user's Auspice config JSON to define the columns as colors or filters (as in the default config in this PR).

@corneliusroemer:

I also think that just including the overall QC score may not be hugely helpful in our builds

That's right. The purpose of surfacing these data is to help users with their own builds, especially in the case where users want to skip diagnostic filters to show as much data as possible. In this case, these data allow users to identify problematic sequences they allowed into their build. This is a common use case with US public health labs and our Africa CDC partners.

So, that said, we do not need to include these additional columns in our Auspice config JSONs. It should be sufficient to include them in the default and then people can choose to use them or not per build.

Did I understand it correctly yesterday that the reason for limiting extra color bys to 1 was that we didn't want to exacerbate the Auspice bug

This is an orthogonal issue that @trvrb has been concerned about with the Auspice UI/UX more generally. The problem would definitely be exacerbated by adding all of these columns to the main Nextstrain builds, but I think we generally agree that our main Nextstrain builds don't need all of these columns. @rneher and @trvrb seemed to agree that the overall QC score and/or status might be good enough.

I can think of use cases for almost all of the fields. There's a general question whether we should use more coarse grained status (just good, mediocre, bad) or continuous quantitative scales. I'd probably be in favour of continuous quant scales.

Yeah, the more we've discussed this, the more I think folks should be able to opt in to whichever fields they find most helpful. I can imagine the quantitative fields have little meaning for people independent of some qualitative thresholding (e.g., "Is 50 a good or bad QC score?").

We just demoed the overall QC, reversion, potential contaminants, and rare mutations fields at office hours and folks were excited about these. Based on this feedback, I think our plan could be:

Merge this PR as it is with only changes to the default Auspice config
Add overall QC score/status to main Nextstrain build Auspice configs once @ivan-aksamentov's PRs go in
Write up a separate PR with a how-to guide about surfacing ncov-specific metadata fields as colors or filters in Auspice.

ivan-aksamentov · 2022-02-10T20:18:26Z

Alright, I merged the nextstrain/ncov-ingest#283 and closed #865. The new columns should be in the metadata after tomorrow's ingest.

Add colors for overall QC scores from Nextclade

6d325f2

Also, allow users to filter by overall QC status.

corneliusroemer mentioned this pull request Feb 10, 2022

Overlapping filter dropdown options nextstrain/auspice#1453

Closed

This was referenced Feb 10, 2022

feat: add overallScore and overallStatus QC columns to metadata.tsv nextstrain/ncov-ingest#283

Merged

feat: add overallScore and overallStatus QC columns to metadata.tsv #865

Closed

huddlej marked this pull request as ready for review February 10, 2022 19:58

Note new default colors for Auspice config

2ebbe06

huddlej merged commit 9d8b9f6 into master Feb 11, 2022

huddlej deleted the add-nextclade-colorbys branch February 11, 2022 23:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Surface Nextclade QC metrics as color-bys#861

Surface Nextclade QC metrics as color-bys#861
huddlej merged 3 commits intomasterfrom
add-nextclade-colorbys

huddlej commented Feb 7, 2022 •

edited

Loading

Uh oh!

tsibley commented Feb 7, 2022

Uh oh!

jameshadfield commented Feb 7, 2022

Uh oh!

huddlej commented Feb 7, 2022

Uh oh!

tsibley commented Feb 8, 2022 •

edited

Loading

Uh oh!

corneliusroemer commented Feb 8, 2022

Uh oh!

huddlej commented Feb 8, 2022

Uh oh!

huddlej commented Feb 9, 2022

Uh oh!

ivan-aksamentov commented Feb 10, 2022 •

edited

Loading

Uh oh!

corneliusroemer commented Feb 10, 2022

Uh oh!

huddlej commented Feb 10, 2022

Uh oh!

ivan-aksamentov commented Feb 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

huddlej commented Feb 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of proposed changes

Testing

Release checklist

Uh oh!

tsibley commented Feb 7, 2022

Uh oh!

jameshadfield commented Feb 7, 2022

Uh oh!

huddlej commented Feb 7, 2022

Uh oh!

tsibley commented Feb 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

corneliusroemer commented Feb 8, 2022

Uh oh!

huddlej commented Feb 8, 2022

Uh oh!

huddlej commented Feb 9, 2022

Uh oh!

ivan-aksamentov commented Feb 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

corneliusroemer commented Feb 10, 2022

Uh oh!

huddlej commented Feb 10, 2022

Uh oh!

ivan-aksamentov commented Feb 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

huddlej commented Feb 7, 2022 •

edited

Loading

tsibley commented Feb 8, 2022 •

edited

Loading

ivan-aksamentov commented Feb 10, 2022 •

edited

Loading