[MLflow Demo] Add Eval simulation data#20046
Conversation
🛠 DevTools 🛠
Install mlflow from this PRFor Databricks, use the following command: |
There was a problem hiding this comment.
Pull request overview
This PR adds evaluation simulation data to the MLflow Demo feature, building upon the traces generation framework established in previous PRs. It simulates a progressive improvement workflow by creating evaluation datasets and runs for both baseline (v1) and improved (v2) agent outputs.
Changes:
- Adds
EvaluationDemoGeneratorto create evaluation datasets and runs based on trace versions - Implements deterministic pass/fail scorers that simulate LLM judges with reproducible results
- Adds comprehensive test coverage for evaluation generation and integration tests
Reviewed changes
Copilot reviewed 16 out of 17 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| mlflow/demo/generators/evaluation.py | Core evaluation generator with dataset creation and scoring logic |
| mlflow/demo/data.py | Demo trace data definitions with v1/v2 responses and expected answers |
| mlflow/demo/generators/init.py | Registers EvaluationDemoGenerator with the demo registry |
| tests/demo/test_evaluation_generator.py | Unit tests for evaluation generator functionality |
| tests/demo/test_demo_integration.py | Integration tests for full demo data lifecycle |
| .github/workflows/master.yml | Adds dedicated CI job for demo tests |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
mlflow/demo/generators/evaluation.py
Outdated
| def _create_pass_fail_scorer( | ||
| name: str, | ||
| pass_rate: float, | ||
| rationale_fn, |
There was a problem hiding this comment.
The rationale_fn parameter lacks a type hint. Consider adding Callable[[bool], str] to improve type safety and code clarity.
mlflow/demo/data.py
Outdated
| tools=[ | ||
| ToolCall( | ||
| name="web_search", | ||
| input={"query": "MLflow latest release 2024"}, |
There was a problem hiding this comment.
The hardcoded year '2024' in the search query will become outdated. Consider using a more generic query or documenting that this is intentionally historical demo data.
| input={"query": "MLflow latest release 2024"}, | |
| input={"query": "MLflow latest release"}, |
|
Documentation preview for 97f9646 is available at: More info
|
5f9860e to
1e789e5
Compare
mlflow/demo/generators/traces.py
Outdated
|
|
||
| name = DemoFeature.TRACES | ||
| version = 1 | ||
| version = 2 # Bumped for timestamp and token count changes |
There was a problem hiding this comment.
What's this change for?
There was a problem hiding this comment.
Legacy artifact - fixed in a later branch, just missed it in this one. Will address!
| traces_generator = TracesDemoGenerator() | ||
| if not traces_generator.is_generated(): | ||
| traces_generator.generate() | ||
| traces_generator.store_version() |
There was a problem hiding this comment.
Should we invoke this inside generate instead?
There was a problem hiding this comment.
The dependency check is at the top of generate() intentionally. This ensures traces exist before evaluation runs. Moving it elsewhere would separate the dependency logic from where it's used.
mlflow/demo/generators/evaluation.py
Outdated
| experiment = mlflow.get_experiment_by_name(DEMO_EXPERIMENT_NAME) | ||
| if experiment is None: | ||
| raise RuntimeError("Demo experiment not found after trace generation") |
There was a problem hiding this comment.
Do we need this check? line 127-129 should already cover it?
There was a problem hiding this comment.
Good point - no need to add extra defensive checks. will remove!
mlflow/demo/generators/evaluation.py
Outdated
| client = mlflow.MlflowClient() | ||
| all_traces = client.search_traces( |
There was a problem hiding this comment.
| client = mlflow.MlflowClient() | |
| all_traces = client.search_traces( | |
| all_traces = mlflow.search_traces( |
There was a problem hiding this comment.
Agreed - will move to fluent APIs for consistency
mlflow/demo/generators/evaluation.py
Outdated
| locations=[experiment_id], | ||
| max_results=100, | ||
| ) | ||
| return [t for t in all_traces if t.info.trace_metadata.get(DEMO_VERSION_TAG) == version] |
There was a problem hiding this comment.
Could we use filter_string in search_traces instead of manual filtering here?
| ), | ||
| ] | ||
|
|
||
| mlflow.set_experiment(experiment_id=experiment_id) |
There was a problem hiding this comment.
Could we set it once in intialization?
There was a problem hiding this comment.
Same as in the earlier PR - this is intentional to have here.
| client = mlflow.MlflowClient() | ||
| client.set_tag(result.run_id, "mlflow.runName", run_name) | ||
| client.log_param(result.run_id, "scorer_version", scorer_version) | ||
| client.log_param(result.run_id, "description", description) | ||
| client.log_param(result.run_id, "demo", "true") |
There was a problem hiding this comment.
Could we use fluent APIs
There was a problem hiding this comment.
Actually do we need to log these?
There was a problem hiding this comment.
It's purely for identifying that this is demo data as a visual clue to users. Will move to fluent APIs, though!
mlflow/demo/generators/evaluation.py
Outdated
| entities = [ | ||
| f"eval_runs:{len(run_ids)}", | ||
| f"expectations:{expectation_count}", | ||
| f"feedback:{total_feedback}", | ||
| f"v1_dataset_records:{v1_dataset_count}", | ||
| f"v2_dataset_records:{v2_dataset_count}", | ||
| ] |
There was a problem hiding this comment.
Not explicitly. It's purely for debugging for maintainability. I'll remove.
| _logger.debug("Failed to check if evaluation demo exists", exc_info=True) | ||
| return False | ||
|
|
||
| def delete_demo(self) -> None: |
There was a problem hiding this comment.
Do we need to support this? IMO users shouldn't delete the demo data and it should be pre-generated for users, if they really don't like it they can delete the experiment
There was a problem hiding this comment.
Yes. delete_demo() is called automatically on version mismatch to clean up stale data before regeneration. It's also used by the Settings page 'Clear demo data' feature for users who want to remove demo data from their server.
tests/demo/conftest.py
Outdated
| def tracking_uri(tmp_path): | ||
| uri = f"sqlite:///{tmp_path / 'mlflow.db'}" | ||
| mlflow.set_tracking_uri(uri) | ||
| yield uri | ||
| mlflow.set_tracking_uri(None) |
There was a problem hiding this comment.
The default tracking_uri_mock fixture in tests/conftest doesn't work?
There was a problem hiding this comment.
Oh, great point! Thanks for the reminder :) Will update to use the standard!
1e789e5 to
fd5015b
Compare
4c0def3 to
5c66c66
Compare
35cb050 to
3e033ad
Compare
3e033ad to
9016f7f
Compare
mlflow/demo/generators/evaluation.py
Outdated
| return expectation_count | ||
|
|
||
| def _find_expected_answer(self, query: str) -> str | None: | ||
| expected_answers = EXPECTED_ANSWERS |
There was a problem hiding this comment.
nit: Do we need this assignment?
There was a problem hiding this comment.
nah, I forgot to clean that up when moving to the constant and just replaced it. Fixed!
9016f7f to
380849b
Compare
…plates - Add timestamp distribution across 7 days (v1 in days 0-3.5, v2 in days 3.5-7) - Add token count estimation as span attributes (SpanAttributeKey.CHAT_USAGE) - Add prompt-based traces with template rendering and variables - Restructure sessions to have 2-4 turns each across 3 sessions - Fix trace metadata by using InMemoryTraceManager directly - Update test expectations for new trace counts (34 total: 4 RAG, 4 agent, 12 prompt, 14 session) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
380849b to
97f9646
Compare
🥞 Stacked PR
Use this link to review incremental changes.
Related Issues/PRs
#xxxWhat changes are proposed in this pull request?
Adds the Evaluation Dataset creation based on the iterative "trace versions" to simulate progressive improvements of GenAI applications.
Adds evaluation run simulation of the "before" and "after" for running evaluation through surrogate built-in scorers.
How is this PR tested?
Does this PR require documentation update?
Release Notes
Is this a user-facing change?
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/tracking: Tracking Service, tracking client APIs, autologgingarea/models: MLmodel format, model serialization/deserialization, flavorsarea/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registryarea/scoring: MLflow Model server, model deployment tools, Spark UDFsarea/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflowsarea/gateway: MLflow AI Gateway client APIs, server, and third-party integrationsarea/prompts: MLflow prompt engineering features, prompt templates, and prompt managementarea/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionalityarea/projects: MLproject format, project running backendsarea/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/build: Build and test infrastructure for MLflowarea/docs: MLflow documentation pagesHow should the PR be classified in the release notes? Choose one:
rn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/breaking-change- The PR will be mentioned in the "Breaking Changes" sectionrn/feature- A new user-facing feature worth mentioning in the release notesrn/bug-fix- A user-facing bug fix worth mentioning in the release notesrn/documentation- A user-facing documentation change worth mentioning in the release notesShould this PR be included in the next patch release?
Yesshould be selected for bug fixes, documentation updates, and other small changes.Noshould be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.What is a minor/patch release?
Bug fixes, doc updates and new features usually go into minor releases.
Bug fixes and doc updates usually go into patch releases.