Conversation
This include a script to grab live estimate of MLR lineage fitness and apply this as a tip-level coloring in the tree.
This is mostly updating the description.md files to include sentence describing the MLR fitness coloring.
The mutational fitness analysis relies on results posted to https://github.com/bkotzen/sars-cov2-modeling. However, this analysis hasn't been updated since 2024-07-22. This results in all circulating strains in the key 6m tree to have essentially identical fitness. This makes this coloring not informative and potentially misleading. If these results become updated more frequently, we should restore this analysis. In this case, we should also contact the authors to ask for a more stable GitHub URL. The current strategy of dated analyses with now "latest" tag makes automated updated difficult.
The logistic growth analysis used frequencies of strains in the tree to estimate logistic growth rates. However, analyses at nextstrain.org/sars-cov-2/forecasts use all the available sequence data and provide a more sophisticated estimate of growth rate accounting for competition between circulating viruses. These updated results are available in the `mlr_lineage_fitness` analysis. The `mlr_lineage_fitness` results should be substantially more accurate.
huddlej
left a comment
There was a problem hiding this comment.
I really like this idea and how much simpler it makes the workflow by removing the older fitness estimates! I left some minor code style comments below.
scripts/fetch_mlr_lineage_fitness.py
Outdated
| except FileNotFoundError as e: | ||
| print(f"Error reading metadata file: {e}") | ||
| except Exception as e: | ||
| print(f"An unexpected error occurred: {e}") |
There was a problem hiding this comment.
Do we not want these errors to stop the workflow? As these lines are written, the script can fail in many different ways and still return a 0 exit code that allows the workflow to run.
Related to troubleshooting errors, it's easier to track error messages when they are written to sys.stderr like this:
except FileNotFoundError as e:
print(f"Error reading metadata file: {e}", file=sys.stderr)
except Exception as e:
print(f"An unexpected error occurred: {e}", file=sys.stderr)There was a problem hiding this comment.
This error handling is something I'm not very familiar with. I intensionally set this up to fail and proceed with workflow. If the remote URL can't be fetched, then
metadata[args.metadata_clade_attribute] = math.nan
still results in a file that's just
"Scotland/CLIMB-CM7YFURA/2024": {
"mlr_lineage_fitness": NaN
},
I'd prefer an empty coloring to workflow that crashes if S3 can't be reached.
I updated to use sys.stderr with 6f5d767.
There was a problem hiding this comment.
FWIW, NaN is not a valid JSON value. Python's json module will read/write it by default, but it violates the JSON spec and will cause other parsers to error.
It "works" here because this node data file is only being read by augur export in this case which then suppresses the NaN value as a side-effect of other filtering:
but not intentionally because it considers NaN or related values Infinity/-Infinity. (In fact, putting the latter in a node data JSON will cause an error in Augur because it loads the value into a float (against JSON spec) but fails to expect/handle it.)
We really shouldn't be emitting NaN (or Infinity/-Infinity) in JSON outputs.
There was a problem hiding this comment.
How do you recommend labeling these instead? "?"? "NaN"? I don't think it's appropriate to just leave these missing entries out of the branch JSON.
There was a problem hiding this comment.
You could set them to Python's None (JSON's null) if you want an explicit missing/unknown value. But I'd wonder why not just leave them out? What's the benefit to including all missing values?
I'd prefer an empty coloring to workflow that crashes if S3 can't be reached.
Is the idea that in the face of an error you want the coloring to be present, even if there's no data for it, rather than the coloring to be absent?
There was a problem hiding this comment.
Yes. And this is what happens in the current Auspice JSON. It's lacking any entry for mlr_lineage_fitness for Philippines/PH-PGC-52649/2021. This should be due to how augur export is working. However, regarding #1169 (comment), I wonder if this is an issue with write_json that should be documented?
There was a problem hiding this comment.
I'm assuming this has to do with something in
augur.utils.write_json?
Sigh. Looking at (and quickly testing) write_json(), I don't think it is converting None → NaN, but Pandas may be. It's the sort of thing Pandas does in some cases, e.g.
>>> pd.Series([1, None, 2])
0 1.0
1 NaN
2 2.0
dtype: float64
though I don't know if it applies here without digging into the specifics of this data frame and the series dtypes that were sniffed, and on top of that it may be Pandas-version dependent.
(This sort of behaviour is why I loathe Pandas. Behaviours which are convenient for interactive usage, like in a notebook, are often very hard to reliably reason about for repeated and unattended usage.)
It felt cleaner and more inspectable for the script to explicitly write out each entry and flag the strains that have missing data.
I'm talking about the
else:
metadata[args.metadata_clade_attribute] = math.nan
case where all the strains having missing data, though. (Am I being thick and missing something?)
I was thinking that this URL fetch would potentially fail transiently when the workflow is run. I didn't want a single failure in one of the 72 Snakemake endpoints to blow up the entire workflow on S3. I'm okay with this (presumably rare) situation resulting in a live dataset that is just completely gray for the MLR fitness coloring.
This felt like the most graceful way for the script to fail.
Ok. The downside is that failure is silent. Our workflows tend to use loud noisy failures, so this is a departure. That's fine, to be clear, but notable. When introducing non-fatal failures that can be ignored/moved past, it's operationally useful to have a way to still surface those as prominent warnings and/or put the monitoring in a "degraded" status. (Historically we've used Slack for the former and don't have the infra in place for the latter.)
In any case, this feels like the weeds for this PR, and I didn't intend to wade into actual code review here. Please do not consider anything I'm saying as a blocker.
There was a problem hiding this comment.
Ok. The downside is that failure is silent. Our workflows tend to use loud noisy failures, so this is a departure. That's fine, to be clear, but notable. When introducing non-fatal failures that can be ignored/moved past, it's operationally useful to have a way to still surface those as prominent warnings and/or put the monitoring in a "degraded" status.
This is me not knowing how to use Snakemake. Is there a way for the rule mlr_lineage_fitness to emit an error, still produce the mlr_lineage_fitness.json file and then for the dataset to be produced?
Or I don't understand how to "still surface those as prominent warnings" that is different than
except Exception as e:
print(f"Error fetching the JSON file: {e}", file=sys.stderr)
return None
There was a problem hiding this comment.
Okay. f088d49 fixes the NaN vs null issue. I didn't realize that this was how Pandas worked. Thanks for the pointer.
@tsibley: Please let me know if you have a specific change you'd recommend for the error / warning issue. I'm confused on what you're currently advocating. Otherwise, I'll plan to merge.
There was a problem hiding this comment.
I'm going to go ahead with merging this now. @tsibley: if you have recommended changes I'm happy to make these on a subsequent branch. I'll leave this thread open.
jameshadfield
left a comment
There was a problem hiding this comment.
+1 for this direction
[beyond this PR] we should work out a way to add provenance information to the auspice JSON which points to a full summary of the information & methods used to generate the models. Right now the closest we have (I think) is the updated date in the MLR JSON. By provenance I mean evofr version, versions / date cutoffs for the input files, config YAML etc, basically all the information one would need to reproduce the JSON. It wouldn't be a bad idea to sketch out how this could work for other parts of the auspice JSON, perhaps starting with including the git commit of the workflow.
There was a problem hiding this comment.
Only change requested is changelog tweak in #1169 (comment), everything else is soft suggestions. Sorry, I was looking at the wrong diff. I see it's already been added.
Small refactor of fetch_mlr_lineage_fitness script primarily to break out a `main()` function and to call this function from `if __name__ == '__main__':`
Explicitly set these entries as Python None rather than the NaN that Pandas wants to do. Then the resulting JSON gets values of `null` rather than `NaN`.
This include a script to grab live estimate of MLR lineage fitness and apply this as a tip-level coloring in the tree.
This script is currently assuming a match on lineage fitness that uses https://data.nextstrain.org/files/workflows/forecasts-ncov/gisaid/pango_lineages/global/mlr/latest_results.json, which backs the live estimates on nextstrain.org/sars-cov-2/forecasts
This uses
Nextclade_pangometadata label to derive sequence counts from GISAID data and estimate relative growth advantages across collapsed Pango lineages. It will be most relevant for1m,2mand6mbuilds, but is not at all broken for theall-timebuilds. It would be possible to swap this to key on clade instead, but I think the greater detail of lineages is better in this case.Here are the results from trial builds:
Coloring 6m tree by MLR lineage fitness - https://nextstrain.org/staging/ncov/gisaid/trial/mlr-lineage-fitness/global/6m?c=mlr_lineage_fitness
You can see that, as expected, high fitness strains are increasing in frequency:
We can also do fun things like compare fitness to S1 mutations - https://nextstrain.org/staging/ncov/gisaid/trial/mlr-lineage-fitness/global/6m?branches=hide&l=scatter®ression=show&scatterX=S1_mutations&scatterY=mlr_lineage_fitness
Or just plot out fitness across Pango lineages - https://nextstrain.org/staging/ncov/gisaid/trial/mlr-lineage-fitness/global/6m?branches=hide&l=scatter®ression=show&scatterX=mlr_lineage_fitness&scatterY=Nextclade_pango
Correspondingly, I've deprecated and removed the previous
mutational_fitnessandlogistic_growthanalyses.The mutational fitness analysis relied on results posted to https://github.com/bkotzen/sars-cov2-modeling. However, this analysis hasn't been updated since 2024-07-22. This results in all circulating strains in the key 6m tree to have essentially identical fitness. This makes this coloring not informative and potentially misleading. If these results become updated more frequently, we should restore this analysis. In this case, we should also contact the authors to ask for a more stable GitHub URL. The current strategy of dated analyses with now "latest" tag makes automated updated difficult.
The logistic growth analysis used frequencies of strains in the tree to estimate logistic growth rates. However, analyses at nextstrain.org/sars-cov-2/forecasts use all the available sequence data and provide a more sophisticated estimate of growth rate accounting for competition between circulating viruses. These updated results are available in the
mlr_lineage_fitnessanalysis. Themlr_lineage_fitnessresults should be substantially more accurate.I opted to take the previous scripts of
parse_mutational_fitness_tsv_into_distance_map.pyandcalculate_delta_frequency.pyand move them toscripts/deprecated/and additionally to just comment out the relevant rules inmain_workflow.smk. These scripts are going to be useful for 1-off scientific analyses and I don't want to have to hunt through Git to recover them.Trials builds completed successfully.