[MLflow Demo] Add Eval simulation data by BenWilson2 · Pull Request #20046 · mlflow/mlflow

BenWilson2 · 2026-01-16T00:33:41Z

🥞 Stacked PR

Use this link to review incremental changes.

stack/demo/eval [Files changed]
- stack/demo/prompts [Files changed]
  - stack/demo/cli [Files changed]
    - stack/demo/home [Files changed]
      - stack/demo/docs [Files changed]
        
        stack/demo/scorers [Files changed]
        
        stack/demo/handling [Files changed]

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Adds the Evaluation Dataset creation based on the iterative "trace versions" to simulate progressive improvements of GenAI applications.
Adds evaluation run simulation of the "before" and "after" for running evaluation through surrogate built-in scorers.

How is this PR tested?

Existing unit/integration tests
New unit/integration tests
Manual tests

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

How should the PR be classified in the release notes? Choose one:

rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?

Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
Bug fixes, doc updates and new features usually go into minor releases.
Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
Bug fixes and doc updates usually go into patch releases.

Yes (this PR will be cherry-picked and included in the next patch release)
No (this PR will be included in the next minor release)

github-actions · 2026-01-16T01:37:33Z

🛠 DevTools 🛠

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/20046/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/20046/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/20046/merge

Copilot

Pull request overview

This PR adds evaluation simulation data to the MLflow Demo feature, building upon the traces generation framework established in previous PRs. It simulates a progressive improvement workflow by creating evaluation datasets and runs for both baseline (v1) and improved (v2) agent outputs.

Changes:

Adds EvaluationDemoGenerator to create evaluation datasets and runs based on trace versions
Implements deterministic pass/fail scorers that simulate LLM judges with reproducible results
Adds comprehensive test coverage for evaluation generation and integration tests

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
mlflow/demo/generators/evaluation.py	Core evaluation generator with dataset creation and scoring logic
mlflow/demo/data.py	Demo trace data definitions with v1/v2 responses and expected answers
mlflow/demo/generators/init.py	Registers EvaluationDemoGenerator with the demo registry
tests/demo/test_evaluation_generator.py	Unit tests for evaluation generator functionality
tests/demo/test_demo_integration.py	Integration tests for full demo data lifecycle
.github/workflows/master.yml	Adds dedicated CI job for demo tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-16T01:38:58Z

mlflow/demo/generators/evaluation.py

+def _create_pass_fail_scorer(
+    name: str,
+    pass_rate: float,
+    rationale_fn,


The rationale_fn parameter lacks a type hint. Consider adding Callable[[bool], str] to improve type safety and code clarity.

Copilot · 2026-01-16T01:38:58Z

mlflow/demo/data.py

+        tools=[
+            ToolCall(
+                name="web_search",
+                input={"query": "MLflow latest release 2024"},


The hardcoded year '2024' in the search query will become outdated. Consider using a more generic query or documenting that this is intentionally historical demo data.

Suggested change

input={"query": "MLflow latest release 2024"},

input={"query": "MLflow latest release"},

github-actions · 2026-01-16T01:45:38Z

Documentation preview for 97f9646 is available at:

https://pr-20046--mlflow-docs-preview.netlify.app/docs/latest/

More info

Ignore this comment if this PR does not change the documentation.
The preview is updated when a new commit is pushed to this PR.
This comment was created by this workflow run.
The documentation was built by this workflow run.

serena-ruan · 2026-01-20T11:16:41Z

mlflow/demo/generators/traces.py


    name = DemoFeature.TRACES
-    version = 1
+    version = 2  # Bumped for timestamp and token count changes


What's this change for?

Legacy artifact - fixed in a later branch, just missed it in this one. Will address!

serena-ruan · 2026-01-20T11:22:13Z

mlflow/demo/generators/evaluation.py

+        traces_generator = TracesDemoGenerator()
+        if not traces_generator.is_generated():
+            traces_generator.generate()
+            traces_generator.store_version()


Should we invoke this inside generate instead?

The dependency check is at the top of generate() intentionally. This ensures traces exist before evaluation runs. Moving it elsewhere would separate the dependency logic from where it's used.

serena-ruan · 2026-01-20T11:22:56Z

mlflow/demo/generators/evaluation.py

+        experiment = mlflow.get_experiment_by_name(DEMO_EXPERIMENT_NAME)
+        if experiment is None:
+            raise RuntimeError("Demo experiment not found after trace generation")


Do we need this check? line 127-129 should already cover it?

Good point - no need to add extra defensive checks. will remove!

serena-ruan · 2026-01-20T11:23:32Z

mlflow/demo/generators/evaluation.py

+        client = mlflow.MlflowClient()
+        all_traces = client.search_traces(


Suggested change

client = mlflow.MlflowClient()

all_traces = client.search_traces(

all_traces = mlflow.search_traces(

Agreed - will move to fluent APIs for consistency

serena-ruan · 2026-01-20T11:24:15Z

mlflow/demo/generators/evaluation.py

+            locations=[experiment_id],
+            max_results=100,
+        )
+        return [t for t in all_traces if t.info.trace_metadata.get(DEMO_VERSION_TAG) == version]


Could we use filter_string in search_traces instead of manual filtering here?

serena-ruan · 2026-01-20T11:36:04Z

mlflow/demo/generators/evaluation.py

+            ),
+        ]
+
+        mlflow.set_experiment(experiment_id=experiment_id)


Could we set it once in intialization?

Same as in the earlier PR - this is intentional to have here.

serena-ruan · 2026-01-20T11:36:38Z

mlflow/demo/generators/evaluation.py

+        client = mlflow.MlflowClient()
+        client.set_tag(result.run_id, "mlflow.runName", run_name)
+        client.log_param(result.run_id, "scorer_version", scorer_version)
+        client.log_param(result.run_id, "description", description)
+        client.log_param(result.run_id, "demo", "true")


Could we use fluent APIs

Actually do we need to log these?

It's purely for identifying that this is demo data as a visual clue to users. Will move to fluent APIs, though!

serena-ruan · 2026-01-20T11:38:12Z

mlflow/demo/generators/evaluation.py

+        entities = [
+            f"eval_runs:{len(run_ids)}",
+            f"expectations:{expectation_count}",
+            f"feedback:{total_feedback}",
+            f"v1_dataset_records:{v1_dataset_count}",
+            f"v2_dataset_records:{v2_dataset_count}",
+        ]


Do we need this?

Not explicitly. It's purely for debugging for maintainability. I'll remove.

serena-ruan · 2026-01-20T11:41:01Z

mlflow/demo/generators/evaluation.py

+            _logger.debug("Failed to check if evaluation demo exists", exc_info=True)
+            return False
+
+    def delete_demo(self) -> None:


Do we need to support this? IMO users shouldn't delete the demo data and it should be pre-generated for users, if they really don't like it they can delete the experiment

Yes. delete_demo() is called automatically on version mismatch to clean up stale data before regeneration. It's also used by the Settings page 'Clear demo data' feature for users who want to remove demo data from their server.

serena-ruan · 2026-01-20T11:42:40Z

tests/demo/conftest.py

+def tracking_uri(tmp_path):
+    uri = f"sqlite:///{tmp_path / 'mlflow.db'}"
+    mlflow.set_tracking_uri(uri)
+    yield uri
+    mlflow.set_tracking_uri(None)


The default tracking_uri_mock fixture in tests/conftest doesn't work?

Oh, great point! Thanks for the reminder :) Will update to use the standard!

B-Step62

LGTM

B-Step62 · 2026-01-29T12:08:21Z

mlflow/demo/generators/evaluation.py

+        return expectation_count
+
+    def _find_expected_answer(self, query: str) -> str | None:
+        expected_answers = EXPECTED_ANSWERS


nit: Do we need this assignment?

nah, I forgot to clean that up when moving to the constant and just replaced it. Fixed!

…plates - Add timestamp distribution across 7 days (v1 in days 0-3.5, v2 in days 3.5-7) - Add token count estimation as span attributes (SpanAttributeKey.CHAT_USAGE) - Add prompt-based traces with template rendering and variables - Restructure sessions to have 2-4 turns each across 3 sessions - Fix trace metadata by using InMemoryTraceManager directly - Update test expectations for new trace counts (34 total: 4 RAG, 4 agent, 12 prompt, 14 session) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

This was referenced Jan 16, 2026

[MLflow Demo] Base implementation for demo framework #19994

Merged

[MLflow Demo] Add trace data for demo #19995

Merged

[MLflow Demo] Add Prompt demo data #20047

Merged

[MLflow Demo] Add mlflow demo cli command #20048

Merged

BenWilson2 changed the title ~~add eval generation~~ [MLflow Demo] Add Eval simulation data Jan 16, 2026

BenWilson2 marked this pull request as ready for review January 16, 2026 01:37

Copilot AI review requested due to automatic review settings January 16, 2026 01:37

Copilot started reviewing on behalf of BenWilson2 January 16, 2026 01:37 View session

github-actions bot added area/tracking Tracking service, tracking client APIs, autologging rn/feature Mention under Features in Changelogs. labels Jan 16, 2026

Copilot AI reviewed Jan 16, 2026

View reviewed changes

BenWilson2 force-pushed the stack/demo/eval branch 2 times, most recently from 5f9860e to 1e789e5 Compare January 17, 2026 04:24

BenWilson2 added the team-review Trigger a team review request label Jan 18, 2026

github-actions bot requested review from daniellok-db and serena-ruan January 18, 2026 19:29

serena-ruan reviewed Jan 20, 2026

View reviewed changes

github-actions bot assigned serena-ruan Jan 20, 2026

BenWilson2 force-pushed the stack/demo/eval branch from 1e789e5 to fd5015b Compare January 20, 2026 22:46

BenWilson2 mentioned this pull request Jan 20, 2026

[ MLflow Demo ] UI updates for MLflow Demo interfaces #20162

Merged

29 tasks

BenWilson2 force-pushed the stack/demo/eval branch 4 times, most recently from 4c0def3 to 5c66c66 Compare January 22, 2026 17:20

This was referenced Jan 23, 2026

[MLflow Demo] Docs for GenAI Demo #20240

Merged

[MLflow Demo] Add scorers demo #20287

Merged

BenWilson2 force-pushed the stack/demo/eval branch 2 times, most recently from 35cb050 to 3e033ad Compare January 26, 2026 15:00

BenWilson2 mentioned this pull request Jan 26, 2026

[MLflow Demo] Add server availability handling checks #20349

Merged

29 tasks

BenWilson2 force-pushed the stack/demo/eval branch from 3e033ad to 9016f7f Compare January 27, 2026 19:45

B-Step62 approved these changes Jan 29, 2026

View reviewed changes

github-actions bot assigned B-Step62 Jan 29, 2026

BenWilson2 force-pushed the stack/demo/eval branch from 9016f7f to 380849b Compare January 29, 2026 19:37

BenWilson2 and others added 2 commits January 29, 2026 15:37

add eval generation

97f9646

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

BenWilson2 force-pushed the stack/demo/eval branch from 380849b to 97f9646 Compare January 29, 2026 20:54

BenWilson2 added this pull request to the merge queue Jan 29, 2026

Merged via the queue into mlflow:master with commit 2d59c3d Jan 29, 2026
50 checks passed

BenWilson2 deleted the stack/demo/eval branch January 29, 2026 21:53

	input={"query": "MLflow latest release 2024"},
	input={"query": "MLflow latest release"},

		client = mlflow.MlflowClient()
		all_traces = client.search_traces(

	client = mlflow.MlflowClient()
	all_traces = client.search_traces(
	all_traces = mlflow.search_traces(

Conversation

BenWilson2 commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🥞 Stacked PR

Related Issues/PRs

What changes are proposed in this pull request?

How is this PR tested?

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

Should this PR be included in the next patch release?

Uh oh!

github-actions bot commented Jan 16, 2026

Install mlflow from this PR

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

B-Step62 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

BenWilson2 commented Jan 16, 2026 •

edited

Loading

github-actions bot commented Jan 16, 2026 •

edited

Loading