Cleanup evaluation datasets docs by BenWilson2 · Pull Request #18766 · mlflow/mlflow

BenWilson2 · 2025-11-10T18:47:04Z

🛠 DevTools 🛠

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/18766/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/18766/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/18766/merge

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Remove a bunch of redundant information, add in a video of Datasets in the UI, and reduce the coverage of topics in the concepts entry for datasets to focus more clearly on the purpose of the page.

How is this PR tested?

Existing unit/integration tests
New unit/integration tests
Manual tests

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

How should the PR be classified in the release notes? Choose one:

rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?

Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
Bug fixes, doc updates and new features usually go into minor releases.
Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
Bug fixes and doc updates usually go into patch releases.

Yes (this PR will be cherry-picked and included in the next patch release)
No (this PR will be included in the next minor release)

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

Copilot

Pull Request Overview

This PR refactors the evaluation datasets documentation to make it more concise and reference-focused. The changes transform verbose tutorial-style content into streamlined API reference documentation.

Key changes:

Converted sdk-guide.mdx from a workflow tutorial to an API reference guide
Added SQL backend requirement warnings across documentation files
Simplified content by removing extensive workflow diagrams and detailed patterns

Reviewed Changes

Copilot reviewed 4 out of 6 changed files in this pull request and generated 3 comments.

File	Description
docs/docs/genai/datasets/sdk-guide.mdx	Major restructure from tutorial to API reference; removed workflow patterns, simplified content structure
docs/docs/genai/datasets/index.mdx	Added SQL backend warning, UI-based dataset creation guidance, and corrected core component descriptions
docs/docs/genai/datasets/end-to-end-workflow.mdx	Replaced workflow diagram with SQL backend warning, added UI screenshot reference for expectations
docs/docs/genai/concepts/evaluation-datasets.mdx	Streamlined concepts documentation, removed verbose use cases and workflow diagrams

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-10T18:50:08Z

docs/docs/genai/datasets/index.mdx

-There are several ways to create evaluation datasets, each suited to different stages of your GenAI development process. **Expectations are the cornerstone of effective evaluation**—they define the ground truth against which your AI's outputs are measured, enabling systematic quality assessment across iterations.
+There are several ways to create evaluation datasets, each suited to different stages of your GenAI development process.
+
+The simplest way to create one is through MLflow's UI. Navigate to an Expeirment that you want the evaluation dataset to be associated with and you can directly create a new one by supplying a unique name.


Corrected spelling of 'Expeirment' to 'Experiment'.

Suggested change

The simplest way to create one is through MLflow's UI. Navigate to an Expeirment that you want the evaluation dataset to be associated with and you can directly create a new one by supplying a unique name.

The simplest way to create one is through MLflow's UI. Navigate to an Experiment that you want the evaluation dataset to be associated with and you can directly create a new one by supplying a unique name.

Copilot · 2025-11-10T18:50:08Z

docs/docs/genai/datasets/sdk-guide.mdx

 ```python
 # When adding traces directly (automatic TRACE source)
-traces = mlflow.search_traces(experiment_ids=["0"], return_type="list")
+traces = mlflow.search_traces(locations=["0"], return_type="list")


The search_traces function uses experiment_ids as the parameter name, not locations. This should be experiment_ids=["0"] to be consistent with the API and other documentation examples.

Copilot · 2025-11-10T18:50:09Z

docs/docs/genai/datasets/sdk-guide.mdx


 # Or when using DataFrame from search_traces
-traces_df = mlflow.search_traces(experiment_ids=["0"])  # Returns DataFrame
+traces_df = mlflow.search_traces(locations=["0"])  # Returns DataFrame


The search_traces function uses experiment_ids as the parameter name, not locations. This should be experiment_ids=["0"] to be consistent with the API and other documentation examples.

Suggested change

traces_df = mlflow.search_traces(locations=["0"]) # Returns DataFrame

traces_df = mlflow.search_traces(experiment_ids=["0"]) # Returns DataFrame

github-actions · 2025-11-10T19:00:02Z

Documentation preview for b62441d is available at:

https://pr-18766--mlflow-docs-preview.netlify.app/docs/latest/

Changed Pages (4)

genai/concepts/evaluation-datasets (modified)
genai/datasets/end-to-end-workflow (modified)
genai/datasets (modified)
genai/datasets/sdk-guide (modified)

More info

Ignore this comment if this PR does not change the documentation.
The preview is updated when a new commit is pushed to this PR.
This comment was created by this workflow run.
The documentation was built by this workflow run.

B-Step62 · 2025-11-11T03:13:49Z

docs/static/images/eval-datasets.gif

Awesome demonstration!

B-Step62 · 2025-11-11T03:17:43Z

docs/docs/genai/datasets/index.mdx

This doesn't look like an expectation but rather a feedback. Isn't the expected_answer sufficient for demonstrating how to record ground truth?

B-Step62 · 2025-11-11T03:18:03Z

docs/docs/genai/datasets/index.mdx

Can we just set max_result=20 and remove the slicing below?

B-Step62 · 2025-11-11T03:21:25Z

docs/docs/genai/datasets/index.mdx

nit: Can we combine this and next steps? Having 6 card links look a bit too much (and eval one is overlapping).

B-Step62 · 2025-11-11T03:22:48Z

docs/docs/genai/datasets/index.mdx


 ## Key Features

 <ConceptOverview concepts={[


nit: Can we remove border? It is confusing with other cards we have in the website which have links.

B-Step62 · 2025-11-11T03:27:43Z

docs/docs/genai/datasets/index.mdx

 Evaluation datasets are the foundation of systematic GenAI application testing. They provide a centralized way to manage test data, ground truth expectations, and evaluation results—enabling you to measure and improve the quality of your AI applications with confidence.

+:::warning[SQL Backend Required]
+Evaluation Datasets require an MLflow Tracking Server with a SQL backend (PostgreSQL, MySQL, SQLite, or MSSQL).


Can we add a link to the SQL backend setup? https://mlflow.org/docs/latest/self-hosting/architecture/backend-store/#types-of-backend-stores

B-Step62 · 2025-11-11T03:34:04Z

docs/docs/genai/datasets/sdk-guide.mdx

+:::

-MLflow provides a fluent API for working with evaluation datasets that makes common workflows simple and intuitive:
+## Quick API Overview


Since this page is titled "SDK guide", can we actually show how to use these APIs?

## Creating a dataset ## Adding new recods to the dataset ## Removing records from the dataset ...

B-Step62 · 2025-11-11T03:35:21Z

docs/docs/genai/concepts/evaluation-datasets.mdx

    "expectations": {
-        "answer": "The capital of France is Paris.",
+        "answer": "Paris",
        "confidence": 0.95,


nit: Similarly to the above comment, I think this is metric rather than ground truth?

B-Step62 · 2025-11-11T03:36:26Z

docs/docs/genai/concepts/evaluation-datasets.mdx

+When new fields are introduced in subsequent records, they're automatically incorporated into the schema. Existing records without those fields are handled gracefully during evaluation and analysis.

-Tags are key-value pairs that help categorize and organize datasets. Tags can be arbitrary values and are entirely searchable.
+## Source Types


Maybe we can move this to the main guide in genai/datasets.

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

B-Step62

LGTM! Thanks for the cleanup!

B-Step62 · 2025-11-12T04:33:21Z

docs/docs/genai/datasets/index.mdx

Suggested change

experiment_ids=["0"], max_results=20, return_type="list" # Get list[Trace] objects

nit

ahhhh forgot that one

B-Step62 · 2025-11-12T04:34:14Z

docs/docs/genai/datasets/index.mdx

nit: Not sure if we shuold document this here as a pandas dataframe pattern, for me it sounds more like "from traecs"

maybe just clone https://pr-18766--mlflow-docs-preview.netlify.app/docs/latest/genai/datasets/sdk-guide/#adding-records-to-a-dataset with create_dataset call added?

B-Step62 · 2025-11-12T04:37:03Z

docs/docs/genai/datasets/sdk-guide.mdx

-The `EvaluationDataset` object follows an active record pattern—it's both a data container and provides methods to interact with the backend:
+    </TabItem>
+    <TabItem value="search" label="Search Datasets">



nit: Can we add a link to search filter section?

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com> Signed-off-by: Tian Lan <sky.blue266000@gmail.com>

BenWilson2 added 2 commits November 10, 2025 13:15

Refactor eval datasets docs

85614ac

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

fix

da77740

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

github-actions bot added v3.6.1 area/docs Documentation issues rn/documentation Mention under Documentation Changes in Changelogs. labels Nov 10, 2025

BenWilson2 requested review from B-Step62 and Copilot November 10, 2025 18:47

Copilot AI reviewed Nov 10, 2025

View reviewed changes

B-Step62 reviewed Nov 11, 2025

View reviewed changes

fixes

b559044

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

BenWilson2 requested a review from B-Step62 November 12, 2025 00:40

B-Step62 approved these changes Nov 12, 2025

View reviewed changes

B-Step62 linked an issue Nov 12, 2025 that may be closed by this pull request

[BUG] mlflow.genai.datasets.create_dataset throws ConnectionError when registering dataset #18758

Closed

11 tasks

BenWilson2 added 2 commits November 12, 2025 12:34

cleanup

52ea104

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

move content

b62441d

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

BenWilson2 added this pull request to the merge queue Nov 12, 2025

Merged via the queue into mlflow:master with commit 0857e22 Nov 12, 2025
64 of 66 checks passed

BenWilson2 deleted the cleanup-datasets-docs branch November 12, 2025 21:58

BenWilson2 added a commit to BenWilson2/mlflow that referenced this pull request Nov 14, 2025

Cleanup evaluation datasets docs (mlflow#18766)

3191c42

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>

Tian-Sky-Lan pushed a commit to Tian-Sky-Lan/mlflow that referenced this pull request Nov 24, 2025

Cleanup evaluation datasets docs (mlflow#18766)

0b014ba

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com> Signed-off-by: Tian Lan <sky.blue266000@gmail.com>

	The simplest way to create one is through MLflow's UI. Navigate to an Expeirment that you want the evaluation dataset to be associated with and you can directly create a new one by supplying a unique name.
	The simplest way to create one is through MLflow's UI. Navigate to an Experiment that you want the evaluation dataset to be associated with and you can directly create a new one by supplying a unique name.

	traces_df = mlflow.search_traces(locations=["0"]) # Returns DataFrame
	traces_df = mlflow.search_traces(experiment_ids=["0"]) # Returns DataFrame


	experiment_ids=["0"], max_results=20, return_type="list" # Get list[Trace] objects

Conversation

BenWilson2 commented Nov 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Install mlflow from this PR

Related Issues/PRs

What changes are proposed in this pull request?

How is this PR tested?

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

Should this PR be included in the next patch release?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

B-Step62 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BenWilson2 commented Nov 10, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Nov 10, 2025 •

edited

Loading