Skip to content

Surface Nextclade QC metrics as color-bys#861

Merged
huddlej merged 3 commits intomasterfrom
add-nextclade-colorbys
Feb 11, 2022
Merged

Surface Nextclade QC metrics as color-bys#861
huddlej merged 3 commits intomasterfrom
add-nextclade-colorbys

Conversation

@huddlej
Copy link
Copy Markdown
Contributor

@huddlej huddlej commented Feb 7, 2022

Description of proposed changes

Updates Auspice config(s) to include colors for Nextclade QC information that now gets passed through metadata to augur export. When these columns are present in the metadata, users can color their trees by reversions, potential contaminants, and rare mutations. Users can now skip the diagnostic filters in the workflow, to keep as much data in the tree as possible, but still identify the strains that would be have been filtered by diagnostics.

Note: This PR only updates the default Auspice config at the moment. It will update all configs once we agree on the fields to include.

Reversion mutations for the Nextstrain CI tree look like this:
image

Potential contaminants look like this:
image

Testing

  • Test by CI
  • Test with full Nextstrain builds

Release checklist

  • Update docs/src/reference/change_log.md in this pull request to document these changes by the date they were added.

Updates Auspice config(s) to include colors for Nextclade QC information
that now gets passed through metadata to augur export. When these
columns are present in the metadata, users can color their trees by
reversions, potential contaminants, and rare mutations. Users can now
skip the diagnostic filters in the workflow, to keep as much data in the
tree as possible, but still identify the strains that would be have been
filtered by diagnostics.
@tsibley
Copy link
Copy Markdown
Contributor

tsibley commented Feb 7, 2022

If these are all counts, should they be on some discrete color scale instead so scale values like 0.555 don't show up?

@jameshadfield
Copy link
Copy Markdown
Member

If these are all counts, should they be on some discrete color scale instead so scale values like 0.555 don't show up?

We can add a "legend" item to the coloring which will bin them into groups (e.g. "<5", "5-10"). Since we don't know these values ahead of time, the way to do this would be by adding this to the dataset JSON once it's created (similar to epiweeks).

Here's an example of custom legend entries, albeit for percentages not counts:

image

@huddlej
Copy link
Copy Markdown
Contributor Author

huddlej commented Feb 7, 2022

I don't think the scale appearance is in the scope of this PR. We had this discussion about S1 mutations a year ago and never agreed on an outcome. This isn't to say we shouldn't fix the appearance, but this PR is primarily about which columns to surface or not. A fix to the appearance of counts in ncov should be its own separate PR.

@tsibley
Copy link
Copy Markdown
Contributor

tsibley commented Feb 8, 2022

Ah, I see. Agreed not in scope here. I had thought this was just flipping the type on the colorings entry and Auspice would DTRT, but I see now it's not that simple.

@corneliusroemer
Copy link
Copy Markdown
Member

That's great! Good to surface this, could help spot regions where there's something going on, either not enough coverage on the Nextclade reference tree or some recombination/amplicon dropout etc.

I can tell you that the expected range of contamination is 0-5, very rarely >10, so making the color scale 0,1,2,3,4,... up to 10, then in steps of 5 or so would make sense.

Reversions appear a bit more often, so maybe 0,1,2,3,4,5-6,7-8,9-10,11-15,>15?

@huddlej
Copy link
Copy Markdown
Contributor Author

huddlej commented Feb 8, 2022

@corneliusroemer Maybe it would be helpful to see what the numbers are for a proper build? From there we could determine whether custom scale(s) would improve the UX...

Are there any other fields you'd think to include beside the three proposed here? Here's the list of what's added by the QC table:

divergence
missing_data
nonACGTN
substitutions
deletions
insertions
reversion_mutations
potential_contaminants
rare_mutations
frame_shifts
aaSubstitutions
QC_missing_data
QC_mixed_sites
QC_rare_mutations
QC_snp_clusters
QC_frame_shifts
QC_stop_codons
clock_deviation

@huddlej
Copy link
Copy Markdown
Contributor Author

huddlej commented Feb 9, 2022

After discussion in the group, we should pull in the overall QC columns from Nextclade's QC TSV in the join script:

  • qc.overallScore -> QC_overall_score
  • qc.overallStatus -> QC_overall_status

These columns should go into the default Auspice config and the main Nextstrain builds. We don't need to include the reversion mutations, etc. in the main Nextstrain build config.

Also, allow users to filter by overall QC status.
ivan-aksamentov added a commit to nextstrain/ncov-ingest that referenced this pull request Feb 10, 2022
Addresses the metadata portion of nextstrain/ncov#861

See the symmetrical PR in ncov: TODO

This adds the copying of the following columns from `nextclade.tsv` to `metadata.tsv`:

```
    "qc.overallScore": "QC_overall_score",
    "qc.overallStatus": "QC_overall_status",
```
ivan-aksamentov added a commit to nextstrain/ncov-ingest that referenced this pull request Feb 10, 2022
Addresses the metadata portion of nextstrain/ncov#861

See the symmetrical PR in ncov: nextstrain/ncov#865

This adds the copying of the following columns from `nextclade.tsv` to `metadata.tsv`:

```
    "qc.overallScore": "QC_overall_score",
    "qc.overallStatus": "QC_overall_status",
```
@ivan-aksamentov
Copy link
Copy Markdown
Member

ivan-aksamentov commented Feb 10, 2022

@huddlej I made 2 symmetrical PRs in ingest and ncov to add the columns:

The order of operations:

  • review and merge PR in ingest #283
  • run daily ingest to update the metadata
  • wait until metadata is updated, then merge PR in ncov #865
  • use the new columns (I am not familiar with this part, so that's on augur/auspice gurus)

Upd: see John's comment below

@corneliusroemer
Copy link
Copy Markdown
Member

@huddlej On an actual build the numbers will be pretty much what I outlined above. The problem of automatic scales is that they can be rendered very coarse by a single outlier. If most values are 0-5 but one value comes in at 50, it'll make all the 0-5 appear as one category, even though it'd be best to have something like 0,1,2,3,4,5,5-10,10+.

I also think that just including the overall QC score may not be hugely helpful in our builds for two reasons:

  • we filter out most bad sequences in diagnostic.py
  • There are a lot of potential reasons why something could be bad, I'd be more interested in how it breaks down than just whether it's good or bad

Did I understand it correctly yesterday that the reason for limiting extra color bys to 1 was that we didn't want to exacerbate the Auspice bug nextstrain/auspice#1453 (overlapping items in filter dropdown)?

I can think of use cases for almost all of the fields. There's a general question whether we should use more coarse grained status (just good, mediocre, bad) or continuous quantitative scales. I'd probably be in favour of continuous quant scales.

Divergence is useful in time view, just as date is useful in divergence view.

Missing_data, nonACGTN, substitutions, deletions, insertions: not sure

reversion_mutations: useful to investigate problematic parts of the tree and reason for bad qc
potential_contaminants: useful to investigate problematic parts of the tree and reason for bad qc
rare_mutations: useful to investigate problematic parts of the tree and reason for bad qc
frame_shifts: useful to investigate problematic parts of the tree and reason for bad qc

QC_missing_data, QC_mixed_sites, QC_rare_mutations, QC_snp_clusters, QC_frame_shifts, QC_stop_codons,
For these the question is: score (continuous) vs status (categorical) and there's some overlap with the raw counts above, but should be useful to find out what caused a bad QC. Definitely interesting in the tip-tooltip if one can see all of these at a glance.

clock_deviation: would help quickly spot if basal sequences fit the clock or could be there because of reference backfilling/reversion

@huddlej
Copy link
Copy Markdown
Contributor Author

huddlej commented Feb 10, 2022

Thanks, @ivan-aksamentov and @corneliusroemer!

@ivan-aksamentov, regarding:

The order of operations:

6d325f2 in this PR adds the same columns to the join script. I'm not too worried about the order of operations, here. If expected Nextclade QC columns don't exist in the user's metadata already, the script will annotate the metadata with the newly generated QC columns from the workflow. Both the Docker image and the conda environment use the latest Nextclade, so these results should be the same as what we'd get from the ingest (unless I'm missing something).

Updating ncov-ingest makes sense, though. The actual use of these columns depends on the user's Auspice config JSON to define the columns as colors or filters (as in the default config in this PR).

@corneliusroemer:

I also think that just including the overall QC score may not be hugely helpful in our builds

That's right. The purpose of surfacing these data is to help users with their own builds, especially in the case where users want to skip diagnostic filters to show as much data as possible. In this case, these data allow users to identify problematic sequences they allowed into their build. This is a common use case with US public health labs and our Africa CDC partners.

So, that said, we do not need to include these additional columns in our Auspice config JSONs. It should be sufficient to include them in the default and then people can choose to use them or not per build.

Did I understand it correctly yesterday that the reason for limiting extra color bys to 1 was that we didn't want to exacerbate the Auspice bug

This is an orthogonal issue that @trvrb has been concerned about with the Auspice UI/UX more generally. The problem would definitely be exacerbated by adding all of these columns to the main Nextstrain builds, but I think we generally agree that our main Nextstrain builds don't need all of these columns. @rneher and @trvrb seemed to agree that the overall QC score and/or status might be good enough.

I can think of use cases for almost all of the fields. There's a general question whether we should use more coarse grained status (just good, mediocre, bad) or continuous quantitative scales. I'd probably be in favour of continuous quant scales.

Yeah, the more we've discussed this, the more I think folks should be able to opt in to whichever fields they find most helpful. I can imagine the quantitative fields have little meaning for people independent of some qualitative thresholding (e.g., "Is 50 a good or bad QC score?").

We just demoed the overall QC, reversion, potential contaminants, and rare mutations fields at office hours and folks were excited about these. Based on this feedback, I think our plan could be:

  1. Merge this PR as it is with only changes to the default Auspice config
  2. Add overall QC score/status to main Nextstrain build Auspice configs once @ivan-aksamentov's PRs go in
  3. Write up a separate PR with a how-to guide about surfacing ncov-specific metadata fields as colors or filters in Auspice.

@huddlej huddlej marked this pull request as ready for review February 10, 2022 19:58
@ivan-aksamentov
Copy link
Copy Markdown
Member

Alright, I merged the nextstrain/ncov-ingest#283 and closed #865. The new columns should be in the metadata after tomorrow's ingest.

@huddlej huddlej merged commit 9d8b9f6 into master Feb 11, 2022
@huddlej huddlej deleted the add-nextclade-colorbys branch February 11, 2022 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants