Skip to content

Include MLR lineage fitness#1169

Merged
trvrb merged 9 commits intomasterfrom
mlr-lineage-fitness
Jan 29, 2025
Merged

Include MLR lineage fitness#1169
trvrb merged 9 commits intomasterfrom
mlr-lineage-fitness

Conversation

@trvrb
Copy link
Copy Markdown
Member

@trvrb trvrb commented Jan 23, 2025

This include a script to grab live estimate of MLR lineage fitness and apply this as a tip-level coloring in the tree.

This script is currently assuming a match on lineage fitness that uses https://data.nextstrain.org/files/workflows/forecasts-ncov/gisaid/pango_lineages/global/mlr/latest_results.json, which backs the live estimates on nextstrain.org/sars-cov-2/forecasts

This uses Nextclade_pango metadata label to derive sequence counts from GISAID data and estimate relative growth advantages across collapsed Pango lineages. It will be most relevant for 1m, 2m and 6m builds, but is not at all broken for the all-time builds. It would be possible to swap this to key on clade instead, but I think the greater detail of lineages is better in this case.

Here are the results from trial builds:

Coloring 6m tree by MLR lineage fitness - https://nextstrain.org/staging/ncov/gisaid/trial/mlr-lineage-fitness/global/6m?c=mlr_lineage_fitness

Screenshot 2025-01-23 at 10 29 10 AM

You can see that, as expected, high fitness strains are increasing in frequency:

Screenshot 2025-01-23 at 10 30 02 AM

We can also do fun things like compare fitness to S1 mutations - https://nextstrain.org/staging/ncov/gisaid/trial/mlr-lineage-fitness/global/6m?branches=hide&l=scatter&regression=show&scatterX=S1_mutations&scatterY=mlr_lineage_fitness

Screenshot 2025-01-23 at 10 31 08 AM

Or just plot out fitness across Pango lineages - https://nextstrain.org/staging/ncov/gisaid/trial/mlr-lineage-fitness/global/6m?branches=hide&l=scatter&regression=show&scatterX=mlr_lineage_fitness&scatterY=Nextclade_pango

Screenshot 2025-01-23 at 10 32 02 AM

Correspondingly, I've deprecated and removed the previous mutational_fitness and logistic_growth analyses.

The mutational fitness analysis relied on results posted to https://github.com/bkotzen/sars-cov2-modeling. However, this analysis hasn't been updated since 2024-07-22. This results in all circulating strains in the key 6m tree to have essentially identical fitness. This makes this coloring not informative and potentially misleading. If these results become updated more frequently, we should restore this analysis. In this case, we should also contact the authors to ask for a more stable GitHub URL. The current strategy of dated analyses with now "latest" tag makes automated updated difficult.

The logistic growth analysis used frequencies of strains in the tree to estimate logistic growth rates. However, analyses at nextstrain.org/sars-cov-2/forecasts use all the available sequence data and provide a more sophisticated estimate of growth rate accounting for competition between circulating viruses. These updated results are available in the mlr_lineage_fitness analysis. The mlr_lineage_fitness results should be substantially more accurate.

I opted to take the previous scripts of parse_mutational_fitness_tsv_into_distance_map.py and calculate_delta_frequency.py and move them to scripts/deprecated/ and additionally to just comment out the relevant rules in main_workflow.smk. These scripts are going to be useful for 1-off scientific analyses and I don't want to have to hunt through Git to recover them.


Trials builds completed successfully.

trvrb added 5 commits January 22, 2025 17:58
This include a script to grab live estimate of MLR lineage fitness and apply this as a tip-level coloring in the tree.
This is mostly updating the description.md files to include sentence describing the MLR fitness coloring.
The mutational fitness analysis relies on results posted to https://github.com/bkotzen/sars-cov2-modeling. However, this analysis hasn't been updated since 2024-07-22. This results in all circulating strains in the key 6m tree to have essentially identical fitness. This makes this coloring not informative and potentially misleading. If these results become updated more frequently, we should restore this analysis. In this case, we should also contact the authors to ask for a more stable GitHub URL. The current strategy of dated analyses with now "latest" tag makes automated updated difficult.
The logistic growth analysis used frequencies of strains in the tree to estimate logistic growth rates. However, analyses at nextstrain.org/sars-cov-2/forecasts use all the available sequence data and provide a more sophisticated estimate of growth rate accounting for competition between circulating viruses. These updated results are available in the `mlr_lineage_fitness` analysis. The `mlr_lineage_fitness` results should be substantially more accurate.
@trvrb trvrb marked this pull request as ready for review January 23, 2025 19:14
@trvrb trvrb requested a review from victorlin January 23, 2025 19:14
Copy link
Copy Markdown
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this idea and how much simpler it makes the workflow by removing the older fitness estimates! I left some minor code style comments below.

Comment on lines +72 to +75
except FileNotFoundError as e:
print(f"Error reading metadata file: {e}")
except Exception as e:
print(f"An unexpected error occurred: {e}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not want these errors to stop the workflow? As these lines are written, the script can fail in many different ways and still return a 0 exit code that allows the workflow to run.

Related to troubleshooting errors, it's easier to track error messages when they are written to sys.stderr like this:

except FileNotFoundError as e:
    print(f"Error reading metadata file: {e}", file=sys.stderr)
except Exception as e:
    print(f"An unexpected error occurred: {e}", file=sys.stderr)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error handling is something I'm not very familiar with. I intensionally set this up to fail and proceed with workflow. If the remote URL can't be fetched, then

metadata[args.metadata_clade_attribute] = math.nan

still results in a file that's just

    "Scotland/CLIMB-CM7YFURA/2024": {
      "mlr_lineage_fitness": NaN
    },

I'd prefer an empty coloring to workflow that crashes if S3 can't be reached.

I updated to use sys.stderr with 6f5d767.

Copy link
Copy Markdown
Contributor

@tsibley tsibley Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, NaN is not a valid JSON value. Python's json module will read/write it by default, but it violates the JSON spec and will cause other parsers to error.

It "works" here because this node data file is only being read by augur export in this case which then suppresses the NaN value as a side-effect of other filtering:

https://github.com/nextstrain/augur/blob/73f1fa94e03775340843a04d63ce97af1dee372f/augur/export_v2.py#L812

but not intentionally because it considers NaN or related values Infinity/-Infinity. (In fact, putting the latter in a node data JSON will cause an error in Augur because it loads the value into a float (against JSON spec) but fails to expect/handle it.)

We really shouldn't be emitting NaN (or Infinity/-Infinity) in JSON outputs.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you recommend labeling these instead? "?"? "NaN"? I don't think it's appropriate to just leave these missing entries out of the branch JSON.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could set them to Python's None (JSON's null) if you want an explicit missing/unknown value. But I'd wonder why not just leave them out? What's the benefit to including all missing values?

I'd prefer an empty coloring to workflow that crashes if S3 can't be reached.

Is the idea that in the face of an error you want the coloring to be present, even if there's no data for it, rather than the coloring to be absent?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. And this is what happens in the current Auspice JSON. It's lacking any entry for mlr_lineage_fitness for Philippines/PH-PGC-52649/2021. This should be due to how augur export is working. However, regarding #1169 (comment), I wonder if this is an issue with write_json that should be documented?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this has to do with something in augur.utils.write_json?

Sigh. Looking at (and quickly testing) write_json(), I don't think it is converting NoneNaN, but Pandas may be. It's the sort of thing Pandas does in some cases, e.g.

>>> pd.Series([1, None, 2])
0    1.0
1    NaN
2    2.0
dtype: float64

though I don't know if it applies here without digging into the specifics of this data frame and the series dtypes that were sniffed, and on top of that it may be Pandas-version dependent.

(This sort of behaviour is why I loathe Pandas. Behaviours which are convenient for interactive usage, like in a notebook, are often very hard to reliably reason about for repeated and unattended usage.)

It felt cleaner and more inspectable for the script to explicitly write out each entry and flag the strains that have missing data.

I'm talking about the

else:
    metadata[args.metadata_clade_attribute] = math.nan

case where all the strains having missing data, though. (Am I being thick and missing something?)

I was thinking that this URL fetch would potentially fail transiently when the workflow is run. I didn't want a single failure in one of the 72 Snakemake endpoints to blow up the entire workflow on S3. I'm okay with this (presumably rare) situation resulting in a live dataset that is just completely gray for the MLR fitness coloring.

This felt like the most graceful way for the script to fail.

Ok. The downside is that failure is silent. Our workflows tend to use loud noisy failures, so this is a departure. That's fine, to be clear, but notable. When introducing non-fatal failures that can be ignored/moved past, it's operationally useful to have a way to still surface those as prominent warnings and/or put the monitoring in a "degraded" status. (Historically we've used Slack for the former and don't have the infra in place for the latter.)


In any case, this feels like the weeds for this PR, and I didn't intend to wade into actual code review here. Please do not consider anything I'm saying as a blocker.

Copy link
Copy Markdown
Member Author

@trvrb trvrb Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. The downside is that failure is silent. Our workflows tend to use loud noisy failures, so this is a departure. That's fine, to be clear, but notable. When introducing non-fatal failures that can be ignored/moved past, it's operationally useful to have a way to still surface those as prominent warnings and/or put the monitoring in a "degraded" status.

This is me not knowing how to use Snakemake. Is there a way for the rule mlr_lineage_fitness to emit an error, still produce the mlr_lineage_fitness.json file and then for the dataset to be produced?

Or I don't understand how to "still surface those as prominent warnings" that is different than

except Exception as e:
	print(f"Error fetching the JSON file: {e}", file=sys.stderr)
	return None

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. f088d49 fixes the NaN vs null issue. I didn't realize that this was how Pandas worked. Thanks for the pointer.

@tsibley: Please let me know if you have a specific change you'd recommend for the error / warning issue. I'm confused on what you're currently advocating. Otherwise, I'll plan to merge.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to go ahead with merging this now. @tsibley: if you have recommended changes I'm happy to make these on a subsequent branch. I'll leave this thread open.

Copy link
Copy Markdown
Member

@jameshadfield jameshadfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for this direction

[beyond this PR] we should work out a way to add provenance information to the auspice JSON which points to a full summary of the information & methods used to generate the models. Right now the closest we have (I think) is the updated date in the MLR JSON. By provenance I mean evofr version, versions / date cutoffs for the input files, config YAML etc, basically all the information one would need to reproduce the JSON. It wouldn't be a bad idea to sketch out how this could work for other parts of the auspice JSON, perhaps starting with including the git commit of the workflow.

Copy link
Copy Markdown
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only change requested is changelog tweak in #1169 (comment), everything else is soft suggestions. Sorry, I was looking at the wrong diff. I see it's already been added.

Small refactor of fetch_mlr_lineage_fitness script primarily to break out a `main()` function and to call this function from `if __name__ == '__main__':`
Explicitly set these entries as Python None rather than the NaN that Pandas wants to do. Then the resulting JSON gets values of `null` rather than `NaN`.
@trvrb trvrb self-assigned this Jan 29, 2025
@trvrb trvrb merged commit b97d749 into master Jan 29, 2025
@trvrb trvrb deleted the mlr-lineage-fitness branch January 29, 2025 23:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants