Skip to content

[MLflow Demo] Add Eval simulation data#20046

Merged
BenWilson2 merged 2 commits intomlflow:masterfrom
BenWilson2:stack/demo/eval
Jan 29, 2026
Merged

[MLflow Demo] Add Eval simulation data#20046
BenWilson2 merged 2 commits intomlflow:masterfrom
BenWilson2:stack/demo/eval

Conversation

@BenWilson2
Copy link
Member

@BenWilson2 BenWilson2 commented Jan 16, 2026

🥞 Stacked PR

Use this link to review incremental changes.


Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Adds the Evaluation Dataset creation based on the iterative "trace versions" to simulate progressive improvements of GenAI applications.
Adds evaluation run simulation of the "before" and "after" for running evaluation through surrogate built-in scorers.

Screenshot 2026-01-15 at 8 35 33 PM Screenshot 2026-01-15 at 8 35 46 PM Screenshot 2026-01-15 at 8 35 56 PM

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/tracking: Tracking Service, tracking client APIs, autologging
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • area/gateway: MLflow AI Gateway client APIs, server, and third-party integrations
  • area/prompts: MLflow prompt engineering features, prompt templates, and prompt management
  • area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • area/projects: MLproject format, project running backends
  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

@BenWilson2 BenWilson2 changed the title add eval generation [MLflow Demo] Add Eval simulation data Jan 16, 2026
@BenWilson2 BenWilson2 marked this pull request as ready for review January 16, 2026 01:37
Copilot AI review requested due to automatic review settings January 16, 2026 01:37
@github-actions
Copy link
Contributor

🛠 DevTools 🛠

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/20046/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/20046/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/20046/merge

@github-actions github-actions bot added area/tracking Tracking service, tracking client APIs, autologging rn/feature Mention under Features in Changelogs. labels Jan 16, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds evaluation simulation data to the MLflow Demo feature, building upon the traces generation framework established in previous PRs. It simulates a progressive improvement workflow by creating evaluation datasets and runs for both baseline (v1) and improved (v2) agent outputs.

Changes:

  • Adds EvaluationDemoGenerator to create evaluation datasets and runs based on trace versions
  • Implements deterministic pass/fail scorers that simulate LLM judges with reproducible results
  • Adds comprehensive test coverage for evaluation generation and integration tests

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
mlflow/demo/generators/evaluation.py Core evaluation generator with dataset creation and scoring logic
mlflow/demo/data.py Demo trace data definitions with v1/v2 responses and expected answers
mlflow/demo/generators/init.py Registers EvaluationDemoGenerator with the demo registry
tests/demo/test_evaluation_generator.py Unit tests for evaluation generator functionality
tests/demo/test_demo_integration.py Integration tests for full demo data lifecycle
.github/workflows/master.yml Adds dedicated CI job for demo tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

def _create_pass_fail_scorer(
name: str,
pass_rate: float,
rationale_fn,
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale_fn parameter lacks a type hint. Consider adding Callable[[bool], str] to improve type safety and code clarity.

Copilot uses AI. Check for mistakes.
tools=[
ToolCall(
name="web_search",
input={"query": "MLflow latest release 2024"},
Copy link

Copilot AI Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded year '2024' in the search query will become outdated. Consider using a more generic query or documenting that this is intentionally historical demo data.

Suggested change
input={"query": "MLflow latest release 2024"},
input={"query": "MLflow latest release"},

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Contributor

github-actions bot commented Jan 16, 2026

Documentation preview for 97f9646 is available at:

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

@BenWilson2 BenWilson2 force-pushed the stack/demo/eval branch 2 times, most recently from 5f9860e to 1e789e5 Compare January 17, 2026 04:24
@BenWilson2 BenWilson2 added the team-review Trigger a team review request label Jan 18, 2026

name = DemoFeature.TRACES
version = 1
version = 2 # Bumped for timestamp and token count changes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this change for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Legacy artifact - fixed in a later branch, just missed it in this one. Will address!

traces_generator = TracesDemoGenerator()
if not traces_generator.is_generated():
traces_generator.generate()
traces_generator.store_version()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we invoke this inside generate instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dependency check is at the top of generate() intentionally. This ensures traces exist before evaluation runs. Moving it elsewhere would separate the dependency logic from where it's used.

Comment on lines +131 to +133
experiment = mlflow.get_experiment_by_name(DEMO_EXPERIMENT_NAME)
if experiment is None:
raise RuntimeError("Demo experiment not found after trace generation")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this check? line 127-129 should already cover it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - no need to add extra defensive checks. will remove!

Comment on lines +234 to +235
client = mlflow.MlflowClient()
all_traces = client.search_traces(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
client = mlflow.MlflowClient()
all_traces = client.search_traces(
all_traces = mlflow.search_traces(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed - will move to fluent APIs for consistency

locations=[experiment_id],
max_results=100,
)
return [t for t in all_traces if t.info.trace_metadata.get(DEMO_VERSION_TAG) == version]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use filter_string in search_traces instead of manual filtering here?

),
]

mlflow.set_experiment(experiment_id=experiment_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we set it once in intialization?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in the earlier PR - this is intentional to have here.

Comment on lines +348 to +352
client = mlflow.MlflowClient()
client.set_tag(result.run_id, "mlflow.runName", run_name)
client.log_param(result.run_id, "scorer_version", scorer_version)
client.log_param(result.run_id, "description", description)
client.log_param(result.run_id, "demo", "true")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use fluent APIs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually do we need to log these?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's purely for identifying that this is demo data as a visual clue to users. Will move to fluent APIs, though!

Comment on lines +179 to +185
entities = [
f"eval_runs:{len(run_ids)}",
f"expectations:{expectation_count}",
f"feedback:{total_feedback}",
f"v1_dataset_records:{v1_dataset_count}",
f"v2_dataset_records:{v2_dataset_count}",
]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not explicitly. It's purely for debugging for maintainability. I'll remove.

_logger.debug("Failed to check if evaluation demo exists", exc_info=True)
return False

def delete_demo(self) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to support this? IMO users shouldn't delete the demo data and it should be pre-generated for users, if they really don't like it they can delete the experiment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. delete_demo() is called automatically on version mismatch to clean up stale data before regeneration. It's also used by the Settings page 'Clear demo data' feature for users who want to remove demo data from their server.

Comment on lines +78 to +82
def tracking_uri(tmp_path):
uri = f"sqlite:///{tmp_path / 'mlflow.db'}"
mlflow.set_tracking_uri(uri)
yield uri
mlflow.set_tracking_uri(None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default tracking_uri_mock fixture in tests/conftest doesn't work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, great point! Thanks for the reminder :) Will update to use the standard!

Copy link
Collaborator

@B-Step62 B-Step62 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

return expectation_count

def _find_expected_answer(self, query: str) -> str | None:
expected_answers = EXPECTED_ANSWERS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do we need this assignment?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah, I forgot to clean that up when moving to the constant and just replaced it. Fixed!

BenWilson2 and others added 2 commits January 29, 2026 15:37
…plates

- Add timestamp distribution across 7 days (v1 in days 0-3.5, v2 in days 3.5-7)
- Add token count estimation as span attributes (SpanAttributeKey.CHAT_USAGE)
- Add prompt-based traces with template rendering and variables
- Restructure sessions to have 2-4 turns each across 3 sessions
- Fix trace metadata by using InMemoryTraceManager directly
- Update test expectations for new trace counts (34 total: 4 RAG, 4 agent, 12 prompt, 14 session)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
@BenWilson2 BenWilson2 added this pull request to the merge queue Jan 29, 2026
Merged via the queue into mlflow:master with commit 2d59c3d Jan 29, 2026
50 checks passed
@BenWilson2 BenWilson2 deleted the stack/demo/eval branch January 29, 2026 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/tracking Tracking service, tracking client APIs, autologging rn/feature Mention under Features in Changelogs. team-review Trigger a team review request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants