Skip to content

Cleanup evaluation datasets docs#18766

Merged
BenWilson2 merged 5 commits intomlflow:masterfrom
BenWilson2:cleanup-datasets-docs
Nov 12, 2025
Merged

Cleanup evaluation datasets docs#18766
BenWilson2 merged 5 commits intomlflow:masterfrom
BenWilson2:cleanup-datasets-docs

Conversation

@BenWilson2
Copy link
Member

@BenWilson2 BenWilson2 commented Nov 10, 2025

🛠 DevTools 🛠

Open in GitHub Codespaces

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/18766/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/18766/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/18766/merge

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Remove a bunch of redundant information, add in a video of Datasets in the UI, and reduce the coverage of topics in the concepts entry for datasets to focus more clearly on the purpose of the page.

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/tracking: Tracking Service, tracking client APIs, autologging
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • area/gateway: MLflow AI Gateway client APIs, server, and third-party integrations
  • area/prompts: MLflow prompt engineering features, prompt templates, and prompt management
  • area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • area/projects: MLproject format, project running backends
  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
@github-actions github-actions bot added v3.6.1 area/docs Documentation issues rn/documentation Mention under Documentation Changes in Changelogs. labels Nov 10, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the evaluation datasets documentation to make it more concise and reference-focused. The changes transform verbose tutorial-style content into streamlined API reference documentation.

Key changes:

  • Converted sdk-guide.mdx from a workflow tutorial to an API reference guide
  • Added SQL backend requirement warnings across documentation files
  • Simplified content by removing extensive workflow diagrams and detailed patterns

Reviewed Changes

Copilot reviewed 4 out of 6 changed files in this pull request and generated 3 comments.

File Description
docs/docs/genai/datasets/sdk-guide.mdx Major restructure from tutorial to API reference; removed workflow patterns, simplified content structure
docs/docs/genai/datasets/index.mdx Added SQL backend warning, UI-based dataset creation guidance, and corrected core component descriptions
docs/docs/genai/datasets/end-to-end-workflow.mdx Replaced workflow diagram with SQL backend warning, added UI screenshot reference for expectations
docs/docs/genai/concepts/evaluation-datasets.mdx Streamlined concepts documentation, removed verbose use cases and workflow diagrams

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

There are several ways to create evaluation datasets, each suited to different stages of your GenAI development process. **Expectations are the cornerstone of effective evaluation**—they define the ground truth against which your AI's outputs are measured, enabling systematic quality assessment across iterations.
There are several ways to create evaluation datasets, each suited to different stages of your GenAI development process.

The simplest way to create one is through MLflow's UI. Navigate to an Expeirment that you want the evaluation dataset to be associated with and you can directly create a new one by supplying a unique name.
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'Expeirment' to 'Experiment'.

Suggested change
The simplest way to create one is through MLflow's UI. Navigate to an Expeirment that you want the evaluation dataset to be associated with and you can directly create a new one by supplying a unique name.
The simplest way to create one is through MLflow's UI. Navigate to an Experiment that you want the evaluation dataset to be associated with and you can directly create a new one by supplying a unique name.

Copilot uses AI. Check for mistakes.
```python
# When adding traces directly (automatic TRACE source)
traces = mlflow.search_traces(experiment_ids=["0"], return_type="list")
traces = mlflow.search_traces(locations=["0"], return_type="list")
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The search_traces function uses experiment_ids as the parameter name, not locations. This should be experiment_ids=["0"] to be consistent with the API and other documentation examples.

Copilot uses AI. Check for mistakes.

# Or when using DataFrame from search_traces
traces_df = mlflow.search_traces(experiment_ids=["0"]) # Returns DataFrame
traces_df = mlflow.search_traces(locations=["0"]) # Returns DataFrame
Copy link

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The search_traces function uses experiment_ids as the parameter name, not locations. This should be experiment_ids=["0"] to be consistent with the API and other documentation examples.

Suggested change
traces_df = mlflow.search_traces(locations=["0"]) # Returns DataFrame
traces_df = mlflow.search_traces(experiment_ids=["0"]) # Returns DataFrame

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Contributor

github-actions bot commented Nov 10, 2025

Documentation preview for b62441d is available at:

Changed Pages (4)
More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome demonstration!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look like an expectation but rather a feedback. Isn't the expected_answer sufficient for demonstrating how to record ground truth?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just set max_result=20 and remove the slicing below?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we combine this and next steps? Having 6 card links look a bit too much (and eval one is overlapping).


## Key Features

<ConceptOverview concepts={[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we remove border? It is confusing with other cards we have in the website which have links.

Evaluation datasets are the foundation of systematic GenAI application testing. They provide a centralized way to manage test data, ground truth expectations, and evaluation results—enabling you to measure and improve the quality of your AI applications with confidence.

:::warning[SQL Backend Required]
Evaluation Datasets require an MLflow Tracking Server with a SQL backend (PostgreSQL, MySQL, SQLite, or MSSQL).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:::

MLflow provides a fluent API for working with evaluation datasets that makes common workflows simple and intuitive:
## Quick API Overview
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this page is titled "SDK guide", can we actually show how to use these APIs?


## Creating a dataset

## Adding new recods to the dataset

## Removing records from the dataset

...

"expectations": {
"answer": "The capital of France is Paris.",
"answer": "Paris",
"confidence": 0.95,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Similarly to the above comment, I think this is metric rather than ground truth?

When new fields are introduced in subsequent records, they're automatically incorporated into the schema. Existing records without those fields are handled gracefully during evaluation and analysis.

Tags are key-value pairs that help categorize and organize datasets. Tags can be arbitrary values and are entirely searchable.
## Source Types
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can move this to the main guide in genai/datasets.

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
@BenWilson2 BenWilson2 requested a review from B-Step62 November 12, 2025 00:40
Copy link
Collaborator

@B-Step62 B-Step62 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the cleanup!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
experiment_ids=["0"], max_results=20, return_type="list" # Get list[Trace] objects

nit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahhhh forgot that one

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Not sure if we shuold document this here as a pandas dataframe pattern, for me it sounds more like "from traecs"

maybe just clone https://pr-18766--mlflow-docs-preview.netlify.app/docs/latest/genai/datasets/sdk-guide/#adding-records-to-a-dataset with create_dataset call added?

The `EvaluationDataset` object follows an active record pattern—it's both a data container and provides methods to interact with the backend:
</TabItem>
<TabItem value="search" label="Search Datasets">

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we add a link to search filter section?

Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
@BenWilson2 BenWilson2 added this pull request to the merge queue Nov 12, 2025
Merged via the queue into mlflow:master with commit 0857e22 Nov 12, 2025
64 of 66 checks passed
@BenWilson2 BenWilson2 deleted the cleanup-datasets-docs branch November 12, 2025 21:58
BenWilson2 added a commit to BenWilson2/mlflow that referenced this pull request Nov 14, 2025
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Tian-Sky-Lan pushed a commit to Tian-Sky-Lan/mlflow that referenced this pull request Nov 24, 2025
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Tian Lan <sky.blue266000@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/docs Documentation issues rn/documentation Mention under Documentation Changes in Changelogs. v3.6.1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] mlflow.genai.datasets.create_dataset throws ConnectionError when registering dataset

3 participants