Track intermediate candidates and evaluation scores in gepa optimizer#20198
Conversation
🛠 DevTools 🛠
Install mlflow from this PRFor Databricks, use the following command: |
|
@chenmoneygithub Thank you for the contribution! Could you fix the following issue(s)? ⚠ DCO checkThe DCO check failed. Please sign off your commit(s) by following the instructions here. See https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.md#sign-your-work for more details. |
There was a problem hiding this comment.
Pull request overview
This PR adds tracking capabilities to the GEPA optimizer to monitor intermediate candidates and their evaluation scores during the optimization process. This enhancement provides visibility into the optimization workflow by logging validation candidates, aggregate scores, per-record scores, and individual scorer results.
Changes:
- Extended
EvaluationResultRecordto include individual scorer results - Modified evaluation metric functions to return individual scores alongside aggregate scores
- Implemented artifact logging for validation candidates in the GEPA optimizer
- Added comprehensive test coverage for the new tracking functionality
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| mlflow/genai/optimize/types.py | Added individual_scores field to EvaluationResultRecord with default factory |
| mlflow/genai/optimize/util.py | Updated create_metric_from_scorers to return tuple with individual scores |
| mlflow/genai/optimize/optimize.py | Modified _build_eval_fn to unpack and pass individual scores from metric function |
| mlflow/genai/optimize/optimizers/gepa_optimizer.py | Added tracking logic in MlflowGEPAAdapter with _log_validation_candidate method for artifact logging |
| tests/genai/optimize/optimizers/test_gepa_optimizer.py | Added comprehensive test for prompt candidate logging functionality |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Documentation preview for 9f524f0 is available at: More info
|
mlflow/genai/optimize/util.py
Outdated
| # Log per-scorer metrics (for API response parsing) | ||
| if output.initial_eval_score_per_scorer: | ||
| for scorer_name, score in output.initial_eval_score_per_scorer.items(): | ||
| mlflow.log_metric(f"initial_eval_score.{scorer_name}", score) |
There was a problem hiding this comment.
Let's use log_metrics for efficient logging
| batch: List of data instances to evaluate | ||
| candidate: Proposed text components (prompts) | ||
| capture_traces: Whether to capture execution traces | ||
| capture_traces: Whether to capture execution traces. |
There was a problem hiding this comment.
Can we not add this period or add periods to all args for consistency?
| outputs = [result.outputs for result in eval_results] | ||
| scores = [result.score for result in eval_results] | ||
| trajectories = eval_results if capture_traces else None | ||
| objective_scores = [result.individual_scores for result in eval_results] |
There was a problem hiding this comment.
q: is this parameter present for all supported versions?
There was a problem hiding this comment.
I will bump the GEPA requirement to be >=0.0.26, there was a critical bug fixed in this PR gepa-ai/gepa#171, which has been released with 0.0.26. So yes, the parameter is available for valid gepa versions.
| candidate: The candidate prompts being validated | ||
| eval_results: Evaluation results containing scores | ||
| """ | ||
| import mlflow |
There was a problem hiding this comment.
nit: Can we move this to the module level?
| """ | ||
| import mlflow | ||
|
|
||
| active_run = mlflow.active_run() |
There was a problem hiding this comment.
any reason not to use self.tracking_enabled?
There was a problem hiding this comment.
good call, changed
| # Compute per-scorer average scores | ||
| scorer_names = set() | ||
| for result in eval_results: | ||
| scorer_names.update(result.individual_scores.keys()) |
There was a problem hiding this comment.
nit: we can use scorer_names |= result.individual_scores.keys()
There was a problem hiding this comment.
gotcha, changed
|
|
||
| # Build the evaluation results table and log to MLflow as a table artifact | ||
| eval_results_table = { | ||
| "inputs": [json.dumps(r.inputs) for r in eval_results], |
There was a problem hiding this comment.
Do we need json.dumps here? isn't this handled by mlflow.log_table?
There was a problem hiding this comment.
good call, I wasn't aware of that part, removed!
| for result in eval_results: | ||
| scorer_names.update(result.individual_scores.keys()) | ||
|
|
||
| per_scorer_scores = {} |
There was a problem hiding this comment.
Can we move this logic below so that the variable generation logic is closer to where is't used?
| prompt_path = tmp_path / f"{prompt_name}.txt" | ||
| with open(prompt_path, "w") as f: | ||
| f.write(prompt_text) | ||
| mlflow.log_artifact(str(prompt_path), artifact_path=iteration_dir) |
There was a problem hiding this comment.
Can't we pass Path to mlflow.log_artifact?
There was a problem hiding this comment.
good call, AI-redundant code, changed!
| r.individual_scores.get(scorer_name) for r in eval_results | ||
| ] | ||
|
|
||
| iteration_dir = f"prompt_candidates/iteration_{iteration}" |
There was a problem hiding this comment.
I guess these path names will be used for returning the optimization result from the rest API, so shall we extract these path/file names as consts?
There was a problem hiding this comment.
good call, the new commit defines a few constants for reusing.
| score: float | None | ||
| trace: Trace | ||
| rationales: dict[str, str] | ||
| individual_scores: dict[str, float] = field(default_factory=dict) |
There was a problem hiding this comment.
Per scorer score is not necessarily numeric. Please refer to Scorer.__call__
There was a problem hiding this comment.
Yes that's a good callout, I am now planning to only allow numeric scorers: #20197 (comment)
let's discuss!
| if scorer_name in r.individual_scores | ||
| ] | ||
| if scores: | ||
| per_scorer_scores[scorer_name] = sum(scores) / len(scores) |
There was a problem hiding this comment.
I think individual scorers cannot always be aggregated using sum. Please see my comment below.
| # Log scores summary as JSON artifact | ||
| scores_data = { | ||
| "aggregate": aggregate_score, | ||
| "per_scorer": per_scorer_scores, |
There was a problem hiding this comment.
I wonder if we should log metrics for each scorer name where possible. This enables users to see the time progression of each scorer result.
There was a problem hiding this comment.
yes good idea, done!
TomeHirata
left a comment
There was a problem hiding this comment.
LGTM if we support string output in the follow up PR
Related Issues/PRs
#xxxWhat changes are proposed in this pull request?
Track intermediate candidates and evaluation scores in gepa optimizer, which is useful to track the GEPA optimization process. See the screenshot below for a sample:
We are tracking 3 things for each candidate that hits the full validation:
Please note that we don't track the candidate that doesn't hit full validation, which is not addd to the pareto frontier of GEPA optimizer.
How is this PR tested?
Does this PR require documentation update?
Release Notes
Is this a user-facing change?
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/tracking: Tracking Service, tracking client APIs, autologgingarea/models: MLmodel format, model serialization/deserialization, flavorsarea/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registryarea/scoring: MLflow Model server, model deployment tools, Spark UDFsarea/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflowsarea/gateway: MLflow AI Gateway client APIs, server, and third-party integrationsarea/prompts: MLflow prompt engineering features, prompt templates, and prompt managementarea/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionalityarea/projects: MLproject format, project running backendsarea/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/build: Build and test infrastructure for MLflowarea/docs: MLflow documentation pagesHow should the PR be classified in the release notes? Choose one:
rn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/breaking-change- The PR will be mentioned in the "Breaking Changes" sectionrn/feature- A new user-facing feature worth mentioning in the release notesrn/bug-fix- A user-facing bug fix worth mentioning in the release notesrn/documentation- A user-facing documentation change worth mentioning in the release notesShould this PR be included in the next patch release?
Yesshould be selected for bug fixes, documentation updates, and other small changes.Noshould be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.What is a minor/patch release?
Bug fixes, doc updates and new features usually go into minor releases.
Bug fixes and doc updates usually go into patch releases.