Include MLR lineage fitness by trvrb · Pull Request #1169 · nextstrain/ncov

trvrb · 2025-01-23T01:59:22Z

This include a script to grab live estimate of MLR lineage fitness and apply this as a tip-level coloring in the tree.

This script is currently assuming a match on lineage fitness that uses https://data.nextstrain.org/files/workflows/forecasts-ncov/gisaid/pango_lineages/global/mlr/latest_results.json, which backs the live estimates on nextstrain.org/sars-cov-2/forecasts

This uses Nextclade_pango metadata label to derive sequence counts from GISAID data and estimate relative growth advantages across collapsed Pango lineages. It will be most relevant for 1m, 2m and 6m builds, but is not at all broken for the all-time builds. It would be possible to swap this to key on clade instead, but I think the greater detail of lineages is better in this case.

Here are the results from trial builds:

Coloring 6m tree by MLR lineage fitness - https://nextstrain.org/staging/ncov/gisaid/trial/mlr-lineage-fitness/global/6m?c=mlr_lineage_fitness

You can see that, as expected, high fitness strains are increasing in frequency:

We can also do fun things like compare fitness to S1 mutations - https://nextstrain.org/staging/ncov/gisaid/trial/mlr-lineage-fitness/global/6m?branches=hide&l=scatter&regression=show&scatterX=S1_mutations&scatterY=mlr_lineage_fitness

Or just plot out fitness across Pango lineages - https://nextstrain.org/staging/ncov/gisaid/trial/mlr-lineage-fitness/global/6m?branches=hide&l=scatter&regression=show&scatterX=mlr_lineage_fitness&scatterY=Nextclade_pango

Correspondingly, I've deprecated and removed the previous mutational_fitness and logistic_growth analyses.

The mutational fitness analysis relied on results posted to https://github.com/bkotzen/sars-cov2-modeling. However, this analysis hasn't been updated since 2024-07-22. This results in all circulating strains in the key 6m tree to have essentially identical fitness. This makes this coloring not informative and potentially misleading. If these results become updated more frequently, we should restore this analysis. In this case, we should also contact the authors to ask for a more stable GitHub URL. The current strategy of dated analyses with now "latest" tag makes automated updated difficult.

The logistic growth analysis used frequencies of strains in the tree to estimate logistic growth rates. However, analyses at nextstrain.org/sars-cov-2/forecasts use all the available sequence data and provide a more sophisticated estimate of growth rate accounting for competition between circulating viruses. These updated results are available in the mlr_lineage_fitness analysis. The mlr_lineage_fitness results should be substantially more accurate.

I opted to take the previous scripts of parse_mutational_fitness_tsv_into_distance_map.py and calculate_delta_frequency.py and move them to scripts/deprecated/ and additionally to just comment out the relevant rules in main_workflow.smk. These scripts are going to be useful for 1-off scientific analyses and I don't want to have to hunt through Git to recover them.

Trials builds completed successfully.

This include a script to grab live estimate of MLR lineage fitness and apply this as a tip-level coloring in the tree.

This is mostly updating the description.md files to include sentence describing the MLR fitness coloring.

The mutational fitness analysis relies on results posted to https://github.com/bkotzen/sars-cov2-modeling. However, this analysis hasn't been updated since 2024-07-22. This results in all circulating strains in the key 6m tree to have essentially identical fitness. This makes this coloring not informative and potentially misleading. If these results become updated more frequently, we should restore this analysis. In this case, we should also contact the authors to ask for a more stable GitHub URL. The current strategy of dated analyses with now "latest" tag makes automated updated difficult.

The logistic growth analysis used frequencies of strains in the tree to estimate logistic growth rates. However, analyses at nextstrain.org/sars-cov-2/forecasts use all the available sequence data and provide a more sophisticated estimate of growth rate accounting for competition between circulating viruses. These updated results are available in the `mlr_lineage_fitness` analysis. The `mlr_lineage_fitness` results should be substantially more accurate.

huddlej

I really like this idea and how much simpler it makes the workflow by removing the older fitness estimates! I left some minor code style comments below.

defaults/parameters.yaml

huddlej · 2025-01-23T19:44:37Z

scripts/fetch_mlr_lineage_fitness.py

+except FileNotFoundError as e:
+    print(f"Error reading metadata file: {e}")
+except Exception as e:
+    print(f"An unexpected error occurred: {e}")


Do we not want these errors to stop the workflow? As these lines are written, the script can fail in many different ways and still return a 0 exit code that allows the workflow to run.

Related to troubleshooting errors, it's easier to track error messages when they are written to sys.stderr like this:

except FileNotFoundError as e: print(f"Error reading metadata file: {e}", file=sys.stderr) except Exception as e: print(f"An unexpected error occurred: {e}", file=sys.stderr)

This error handling is something I'm not very familiar with. I intensionally set this up to fail and proceed with workflow. If the remote URL can't be fetched, then

metadata[args.metadata_clade_attribute] = math.nan

still results in a file that's just

"Scotland/CLIMB-CM7YFURA/2024": { "mlr_lineage_fitness": NaN },

I'd prefer an empty coloring to workflow that crashes if S3 can't be reached.

I updated to use sys.stderr with 6f5d767.

FWIW, NaN is not a valid JSON value. Python's json module will read/write it by default, but it violates the JSON spec and will cause other parsers to error.

It "works" here because this node data file is only being read by augur export in this case which then suppresses the NaN value as a side-effect of other filtering:

https://github.com/nextstrain/augur/blob/73f1fa94e03775340843a04d63ce97af1dee372f/augur/export_v2.py#L812

but not intentionally because it considers NaN or related values Infinity/-Infinity. (In fact, putting the latter in a node data JSON will cause an error in Augur because it loads the value into a float (against JSON spec) but fails to expect/handle it.)

We really shouldn't be emitting NaN (or Infinity/-Infinity) in JSON outputs.

How do you recommend labeling these instead? "?"? "NaN"? I don't think it's appropriate to just leave these missing entries out of the branch JSON.

You could set them to Python's None (JSON's null) if you want an explicit missing/unknown value. But I'd wonder why not just leave them out? What's the benefit to including all missing values?

I'd prefer an empty coloring to workflow that crashes if S3 can't be reached.

Is the idea that in the face of an error you want the coloring to be present, even if there's no data for it, rather than the coloring to be absent?

Yes. And this is what happens in the current Auspice JSON. It's lacking any entry for mlr_lineage_fitness for Philippines/PH-PGC-52649/2021. This should be due to how augur export is working. However, regarding #1169 (comment), I wonder if this is an issue with write_json that should be documented?

I'm assuming this has to do with something in augur.utils.write_json?

Sigh. Looking at (and quickly testing) write_json(), I don't think it is converting None → NaN, but Pandas may be. It's the sort of thing Pandas does in some cases, e.g.

>>> pd.Series([1, None, 2]) 0 1.0 1 NaN 2 2.0 dtype: float64

though I don't know if it applies here without digging into the specifics of this data frame and the series dtypes that were sniffed, and on top of that it may be Pandas-version dependent.

(This sort of behaviour is why I loathe Pandas. Behaviours which are convenient for interactive usage, like in a notebook, are often very hard to reliably reason about for repeated and unattended usage.)

It felt cleaner and more inspectable for the script to explicitly write out each entry and flag the strains that have missing data.

I'm talking about the

else: metadata[args.metadata_clade_attribute] = math.nan

case where all the strains having missing data, though. (Am I being thick and missing something?)

I was thinking that this URL fetch would potentially fail transiently when the workflow is run. I didn't want a single failure in one of the 72 Snakemake endpoints to blow up the entire workflow on S3. I'm okay with this (presumably rare) situation resulting in a live dataset that is just completely gray for the MLR fitness coloring.

This felt like the most graceful way for the script to fail.

Ok. The downside is that failure is silent. Our workflows tend to use loud noisy failures, so this is a departure. That's fine, to be clear, but notable. When introducing non-fatal failures that can be ignored/moved past, it's operationally useful to have a way to still surface those as prominent warnings and/or put the monitoring in a "degraded" status. (Historically we've used Slack for the former and don't have the infra in place for the latter.)

In any case, this feels like the weeds for this PR, and I didn't intend to wade into actual code review here. Please do not consider anything I'm saying as a blocker.

Ok. The downside is that failure is silent. Our workflows tend to use loud noisy failures, so this is a departure. That's fine, to be clear, but notable. When introducing non-fatal failures that can be ignored/moved past, it's operationally useful to have a way to still surface those as prominent warnings and/or put the monitoring in a "degraded" status.

This is me not knowing how to use Snakemake. Is there a way for the rule mlr_lineage_fitness to emit an error, still produce the mlr_lineage_fitness.json file and then for the dataset to be produced?

Or I don't understand how to "still surface those as prominent warnings" that is different than

except Exception as e: print(f"Error fetching the JSON file: {e}", file=sys.stderr) return None

Okay. f088d49 fixes the NaN vs null issue. I didn't realize that this was how Pandas worked. Thanks for the pointer.

@tsibley: Please let me know if you have a specific change you'd recommend for the error / warning issue. I'm confused on what you're currently advocating. Otherwise, I'll plan to merge.

I'm going to go ahead with merging this now. @tsibley: if you have recommended changes I'm happy to make these on a subsequent branch. I'll leave this thread open.

workflow/snakemake_rules/main_workflow.smk

jameshadfield

+1 for this direction

[beyond this PR] we should work out a way to add provenance information to the auspice JSON which points to a full summary of the information & methods used to generate the models. Right now the closest we have (I think) is the updated date in the MLR JSON. By provenance I mean evofr version, versions / date cutoffs for the input files, config YAML etc, basically all the information one would need to reproduce the JSON. It wouldn't be a bad idea to sketch out how this could work for other parts of the auspice JSON, perhaps starting with including the git commit of the workflow.

victorlin

~~Only change requested is changelog tweak in #1169 (comment), everything else is soft suggestions.~~ Sorry, I was looking at the wrong diff. I see it's already been added.

workflow/snakemake_rules/main_workflow.smk

scripts/fetch_mlr_lineage_fitness.py

defaults/parameters.yaml

scripts/fetch_mlr_lineage_fitness.py

Small refactor of fetch_mlr_lineage_fitness script primarily to break out a `main()` function and to call this function from `if __name__ == '__main__':`

Explicitly set these entries as Python None rather than the NaN that Pandas wants to do. Then the resulting JSON gets values of `null` rather than `NaN`.

trvrb added 5 commits January 22, 2025 17:58

Include MLR lineage fitness

340b925

This include a script to grab live estimate of MLR lineage fitness and apply this as a tip-level coloring in the tree.

Improve documentation of the MLR lineage fitness analysis

30dc8a8

This is mostly updating the description.md files to include sentence describing the MLR fitness coloring.

Update change log

ce167ce

trvrb marked this pull request as ready for review January 23, 2025 19:14

trvrb requested a review from victorlin January 23, 2025 19:14

huddlej reviewed Jan 23, 2025

View reviewed changes

trvrb added 2 commits January 23, 2025 16:30

Move unused workflow logic to deprecated scripts

7bd4daa

Update error message output

6f5d767

jameshadfield approved these changes Jan 26, 2025

View reviewed changes

victorlin requested changes Jan 27, 2025

View reviewed changes

workflow/snakemake_rules/main_workflow.smk Outdated Show resolved Hide resolved

scripts/fetch_mlr_lineage_fitness.py Show resolved Hide resolved

defaults/parameters.yaml Show resolved Hide resolved

scripts/fetch_mlr_lineage_fitness.py Outdated Show resolved Hide resolved

Refactor fetch_mlr_lineage_fitness script

904fdde

Small refactor of fetch_mlr_lineage_fitness script primarily to break out a `main()` function and to call this function from `if __name__ == '__main__':`

victorlin approved these changes Jan 27, 2025

View reviewed changes

MLR lineage fitness: Save missing values as null rather than NaN

f088d49

Explicitly set these entries as Python None rather than the NaN that Pandas wants to do. Then the resulting JSON gets values of `null` rather than `NaN`.

trvrb self-assigned this Jan 29, 2025

trvrb merged commit b97d749 into master Jan 29, 2025

trvrb deleted the mlr-lineage-fitness branch January 29, 2025 23:23

trvrb mentioned this pull request Jan 31, 2025

Pivot growth advantage not exported nextstrain/forecasts-ncov#126

Closed

huddlej mentioned this pull request Mar 19, 2025

Representation of uncertainty in JSONs nextstrain/augur#386

Open

Conversation

trvrb commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huddlej left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tsibley Jan 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trvrb Jan 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jameshadfield left a comment

Choose a reason for hiding this comment

Uh oh!

victorlin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

trvrb commented Jan 23, 2025 •

edited

Loading

tsibley Jan 27, 2025 •

edited

Loading

trvrb Jan 28, 2025 •

edited

Loading

victorlin left a comment •

edited

Loading