Cleanup evaluation datasets docs#18766
Conversation
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
There was a problem hiding this comment.
Pull Request Overview
This PR refactors the evaluation datasets documentation to make it more concise and reference-focused. The changes transform verbose tutorial-style content into streamlined API reference documentation.
Key changes:
- Converted sdk-guide.mdx from a workflow tutorial to an API reference guide
- Added SQL backend requirement warnings across documentation files
- Simplified content by removing extensive workflow diagrams and detailed patterns
Reviewed Changes
Copilot reviewed 4 out of 6 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| docs/docs/genai/datasets/sdk-guide.mdx | Major restructure from tutorial to API reference; removed workflow patterns, simplified content structure |
| docs/docs/genai/datasets/index.mdx | Added SQL backend warning, UI-based dataset creation guidance, and corrected core component descriptions |
| docs/docs/genai/datasets/end-to-end-workflow.mdx | Replaced workflow diagram with SQL backend warning, added UI screenshot reference for expectations |
| docs/docs/genai/concepts/evaluation-datasets.mdx | Streamlined concepts documentation, removed verbose use cases and workflow diagrams |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
docs/docs/genai/datasets/index.mdx
Outdated
| There are several ways to create evaluation datasets, each suited to different stages of your GenAI development process. **Expectations are the cornerstone of effective evaluation**—they define the ground truth against which your AI's outputs are measured, enabling systematic quality assessment across iterations. | ||
| There are several ways to create evaluation datasets, each suited to different stages of your GenAI development process. | ||
|
|
||
| The simplest way to create one is through MLflow's UI. Navigate to an Expeirment that you want the evaluation dataset to be associated with and you can directly create a new one by supplying a unique name. |
There was a problem hiding this comment.
Corrected spelling of 'Expeirment' to 'Experiment'.
| The simplest way to create one is through MLflow's UI. Navigate to an Expeirment that you want the evaluation dataset to be associated with and you can directly create a new one by supplying a unique name. | |
| The simplest way to create one is through MLflow's UI. Navigate to an Experiment that you want the evaluation dataset to be associated with and you can directly create a new one by supplying a unique name. |
| ```python | ||
| # When adding traces directly (automatic TRACE source) | ||
| traces = mlflow.search_traces(experiment_ids=["0"], return_type="list") | ||
| traces = mlflow.search_traces(locations=["0"], return_type="list") |
There was a problem hiding this comment.
The search_traces function uses experiment_ids as the parameter name, not locations. This should be experiment_ids=["0"] to be consistent with the API and other documentation examples.
|
|
||
| # Or when using DataFrame from search_traces | ||
| traces_df = mlflow.search_traces(experiment_ids=["0"]) # Returns DataFrame | ||
| traces_df = mlflow.search_traces(locations=["0"]) # Returns DataFrame |
There was a problem hiding this comment.
The search_traces function uses experiment_ids as the parameter name, not locations. This should be experiment_ids=["0"] to be consistent with the API and other documentation examples.
| traces_df = mlflow.search_traces(locations=["0"]) # Returns DataFrame | |
| traces_df = mlflow.search_traces(experiment_ids=["0"]) # Returns DataFrame |
|
Documentation preview for b62441d is available at: Changed Pages (4)
More info
|
docs/docs/genai/datasets/index.mdx
Outdated
There was a problem hiding this comment.
This doesn't look like an expectation but rather a feedback. Isn't the expected_answer sufficient for demonstrating how to record ground truth?
docs/docs/genai/datasets/index.mdx
Outdated
There was a problem hiding this comment.
Can we just set max_result=20 and remove the slicing below?
docs/docs/genai/datasets/index.mdx
Outdated
There was a problem hiding this comment.
nit: Can we combine this and next steps? Having 6 card links look a bit too much (and eval one is overlapping).
|
|
||
| ## Key Features | ||
|
|
||
| <ConceptOverview concepts={[ |
There was a problem hiding this comment.
nit: Can we remove border? It is confusing with other cards we have in the website which have links.
docs/docs/genai/datasets/index.mdx
Outdated
| Evaluation datasets are the foundation of systematic GenAI application testing. They provide a centralized way to manage test data, ground truth expectations, and evaluation results—enabling you to measure and improve the quality of your AI applications with confidence. | ||
|
|
||
| :::warning[SQL Backend Required] | ||
| Evaluation Datasets require an MLflow Tracking Server with a SQL backend (PostgreSQL, MySQL, SQLite, or MSSQL). |
There was a problem hiding this comment.
Can we add a link to the SQL backend setup? https://mlflow.org/docs/latest/self-hosting/architecture/backend-store/#types-of-backend-stores
| ::: | ||
|
|
||
| MLflow provides a fluent API for working with evaluation datasets that makes common workflows simple and intuitive: | ||
| ## Quick API Overview |
There was a problem hiding this comment.
Since this page is titled "SDK guide", can we actually show how to use these APIs?
## Creating a dataset
## Adding new recods to the dataset
## Removing records from the dataset
...
| "expectations": { | ||
| "answer": "The capital of France is Paris.", | ||
| "answer": "Paris", | ||
| "confidence": 0.95, |
There was a problem hiding this comment.
nit: Similarly to the above comment, I think this is metric rather than ground truth?
| When new fields are introduced in subsequent records, they're automatically incorporated into the schema. Existing records without those fields are handled gracefully during evaluation and analysis. | ||
|
|
||
| Tags are key-value pairs that help categorize and organize datasets. Tags can be arbitrary values and are entirely searchable. | ||
| ## Source Types |
There was a problem hiding this comment.
Maybe we can move this to the main guide in genai/datasets.
B-Step62
left a comment
There was a problem hiding this comment.
LGTM! Thanks for the cleanup!
docs/docs/genai/datasets/index.mdx
Outdated
There was a problem hiding this comment.
| experiment_ids=["0"], max_results=20, return_type="list" # Get list[Trace] objects |
nit
There was a problem hiding this comment.
ahhhh forgot that one
docs/docs/genai/datasets/index.mdx
Outdated
There was a problem hiding this comment.
nit: Not sure if we shuold document this here as a pandas dataframe pattern, for me it sounds more like "from traecs"
maybe just clone https://pr-18766--mlflow-docs-preview.netlify.app/docs/latest/genai/datasets/sdk-guide/#adding-records-to-a-dataset with create_dataset call added?
| The `EvaluationDataset` object follows an active record pattern—it's both a data container and provides methods to interact with the backend: | ||
| </TabItem> | ||
| <TabItem value="search" label="Search Datasets"> | ||
|
|
There was a problem hiding this comment.
nit: Can we add a link to search filter section?
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com> Signed-off-by: Tian Lan <sky.blue266000@gmail.com>
🛠 DevTools 🛠
Install mlflow from this PR
For Databricks, use the following command:
Related Issues/PRs
#xxxWhat changes are proposed in this pull request?
Remove a bunch of redundant information, add in a video of Datasets in the UI, and reduce the coverage of topics in the concepts entry for datasets to focus more clearly on the purpose of the page.
How is this PR tested?
Does this PR require documentation update?
Release Notes
Is this a user-facing change?
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/tracking: Tracking Service, tracking client APIs, autologgingarea/models: MLmodel format, model serialization/deserialization, flavorsarea/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registryarea/scoring: MLflow Model server, model deployment tools, Spark UDFsarea/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflowsarea/gateway: MLflow AI Gateway client APIs, server, and third-party integrationsarea/prompts: MLflow prompt engineering features, prompt templates, and prompt managementarea/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionalityarea/projects: MLproject format, project running backendsarea/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/build: Build and test infrastructure for MLflowarea/docs: MLflow documentation pagesHow should the PR be classified in the release notes? Choose one:
rn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/breaking-change- The PR will be mentioned in the "Breaking Changes" sectionrn/feature- A new user-facing feature worth mentioning in the release notesrn/bug-fix- A user-facing bug fix worth mentioning in the release notesrn/documentation- A user-facing documentation change worth mentioning in the release notesShould this PR be included in the next patch release?
Yesshould be selected for bug fixes, documentation updates, and other small changes.Noshould be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.What is a minor/patch release?
Bug fixes, doc updates and new features usually go into minor releases.
Bug fixes and doc updates usually go into patch releases.