Documentation for online evaluation / scoring#20103
Conversation
🛠 DevTools 🛠
Install mlflow from this PRFor Databricks, use the following command: |
Signed-off-by: dbczumar <corey.zumar@databricks.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a sidebar navigation entry for new "Online Evaluations" documentation as part of the MLflow online evaluation/scoring feature. The PR is marked as WIP (Work in Progress) and introduces documentation for automated LLM judges that run as traces are logged.
Changes:
- Added a new "Online Evaluations" entry to the GenAI sidebar navigation
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Documentation preview for e2e739f is available at: Changed Pages (1) More info
|
- Fix prerequisites section with arrow-style links - Add AI Gateway endpoint prerequisite - Update UI instructions with Configure the judge section - Fix SDK example to use correct register/start API pattern - Add ScorerSamplingConfig and model parameter - Change 'Streamlined Development' to 'Streamlined Quality Improvement' - Various copy improvements Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>
- Add ImageBox import - Mention Sessions tab in addition to Traces tab - Use ImageBox with 80% width and caption - Remove redundant filter text Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>
- UI: Note about clicking existing judge to edit - SDK: Add get_scorer, update, and stop examples Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>
- Judges tab icon - New LLM judge button - Evaluation settings section Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>
|
|
||
| # Online Evaluations | ||
|
|
||
| _Automatically evaluate traces and multi-turn conversations as they're logged - no code required_ |
There was a problem hiding this comment.
Do we need no code required?
There was a problem hiding this comment.
We've had a couple of Databricks customers who want non-coders (e.g. PMs) to be able to create and run judges, so I'd like to keep this
|
|
||
| _Automatically evaluate traces and multi-turn conversations as they're logged - no code required_ | ||
|
|
||
| Online evaluations run your LLM judges automatically on traces and multi-turn conversations as they're logged to MLflow, without requiring manual execution of code. This enables two key use cases: |
There was a problem hiding this comment.
Could we unify the term to use scorers and add link to https://pr-20103--mlflow-docs-preview.netlify.app/docs/latest/genai/eval-monitor/scorers/?
There was a problem hiding this comment.
Let's keep judges, since:
- We're going to refactor scorers to "LLM Judges and Scorers" as part of the eval documentation overhaul (which is also what we do on Databricks)
- We only support online evaluation of LLM judges
| 1. **The MLflow Server is running** (see [Set Up MLflow Server](/genai/getting-started/connect-environment)) | ||
| 2. **MLflow Tracing is enabled** in your agent or LLM application (see [Tracing Quickstart](/genai/tracing/quickstart)) | ||
| - **For multi-turn conversation evaluation**, traces must include session IDs (see [Track Users & Sessions](/genai/tracing/track-users-sessions)) | ||
| 3. **An AI Gateway endpoint is configured** for LLM judge execution (see [Create and Manage Endpoints](/genai/governance/ai-gateway/endpoints/create-and-manage)) |
There was a problem hiding this comment.
Could we add link on the bolded text directly instead of adding see ... on each line?
There was a problem hiding this comment.
Absolutely - done!
| # Create and start a trace-level judge | ||
| tool_judge = ToolCallCorrectness(model="gateway:/your-endpoint") | ||
| registered_tool_judge = tool_judge.register(name="tool_call_correctness") | ||
| registered_tool_judge.start( | ||
| sampling_config=ScorerSamplingConfig(sample_rate=0.5), # Evaluate 50% of traces | ||
| ) | ||
|
|
||
| # Create and start a session-level judge for multi-turn conversations | ||
| frustration_judge = ConversationalGuidelines( | ||
| name="user_frustration", | ||
| guidelines="The user should not express frustration, confusion, or dissatisfaction during the conversation.", | ||
| model="gateway:/your-endpoint", | ||
| ) | ||
| registered_frustration_judge = frustration_judge.register(name="user_frustration") | ||
| registered_frustration_judge.start( | ||
| sampling_config=ScorerSamplingConfig(sample_rate=1.0), # Evaluate all conversations | ||
| ) |
There was a problem hiding this comment.
Feel like we only need one example for 'starting a judge'
There was a problem hiding this comment.
I think it's important to show that trace-level and session-level judges are supported
|
|
||
| Assessments from online evaluation appear directly in the MLflow UI. For traces, assessments typically appear within a minute or two of logging. For multi-turn sessions, evaluation begins after 5 minutes of inactivity (no new traces) by default—this is <APILink fn="mlflow.environment_variables.MLFLOW_ONLINE_SCORING_DEFAULT_SESSION_COMPLETION_BUFFER_SECONDS">configurable</APILink>. Navigate to your experiment's **Traces** or **Sessions** tab to see results. | ||
|
|
||
| <ImageBox |
There was a problem hiding this comment.
Could we include charts in quality tab of overview as well?
There was a problem hiding this comment.
Great idea. Done!
| | Issue | Solution | | ||
| | ---------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | | ||
| | **Missing assessments** | Verify that the judge is active, the filter matches your traces, the sampling rate is greater than zero, and the traces are less than one hour old | | ||
| | **Unexpected or unsatisfactory judge results** | Review the judge's prompt/guidelines and test with known inputs | |
There was a problem hiding this comment.
Maybe mention align for judge optimization here?
There was a problem hiding this comment.
Great idea - done!
| 1. **The MLflow Server is running** (see [Set Up MLflow Server](/genai/getting-started/connect-environment)) | ||
| 2. **MLflow Tracing is enabled** in your agent or LLM application (see [Tracing Quickstart](/genai/tracing/quickstart)) | ||
| - **For multi-turn conversation evaluation**, traces must include session IDs (see [Track Users & Sessions](/genai/tracing/track-users-sessions)) | ||
| 3. **An AI Gateway endpoint is configured** for LLM judge execution (see [Create and Manage Endpoints](/genai/governance/ai-gateway/endpoints/create-and-manage)) |
There was a problem hiding this comment.
Can we give a bit more detailed description for 3rd step? AI Gateway product category in general is unrelated to evaluation/quality (yet), so it might be confusing why this step is required.
Probably having "How it works" section is also helpful, so that users/admins don't feel online scoring a black box magic.
There was a problem hiding this comment.
Totally - I added the detailed description for the 3rd step and introduced a 'how it works' section before trouble shooting. Let me know if this reads better :)
There was a problem hiding this comment.
Yeah looks very clear, thanks!
|
|
||
| Assessments from online evaluation appear directly in the MLflow UI. For traces, assessments typically appear within a minute or two of logging. For multi-turn sessions, evaluation begins after 5 minutes of inactivity (no new traces) by default—this is <APILink fn="mlflow.environment_variables.MLFLOW_ONLINE_SCORING_DEFAULT_SESSION_COMPLETION_BUFFER_SECONDS">configurable</APILink>. Navigate to your experiment's **Traces** or **Sessions** tab to see results. | ||
|
|
||
| <ImageBox |
|
|
||
| ## Next Steps | ||
|
|
||
| <CardGroup cols={2}> |
There was a problem hiding this comment.
nit: Can we use TilesGrid for consistency with other pages?
There was a problem hiding this comment.
Totally - done :)
|
|
||
| ## Online vs Offline Evaluation | ||
|
|
||
| | | Online Evaluation | Offline Evaluation | |
There was a problem hiding this comment.
This comparison feels a bit narrowing down offline evaluation capability.
- Offline evaluation is still used commonly for internal QA phase, not only for regression or bug-fixes.
- The "data source" part sounds like offline evaluation does not support traces/conversations.
Here is Claude's suggestion when given the original table, to share another data point:
There was a problem hiding this comment.
Makes sense - updated - let me know what you think :)
Signed-off-by: dbczumar <corey.zumar@databricks.com> Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com> Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com> Co-authored-by: Claude <noreply@anthropic.com>
Related Issues/PRs
#xxxWhat changes are proposed in this pull request?
Docs for online evaluation (scoring)
How is this PR tested?
Does this PR require documentation update?
Release Notes
MLflow now supports online evaluation, enabling developers to automatically run LLM judges as traces are logged. This creates a more seamless experience for identifying quality issues during development and for production quality monitoring.
Is this a user-facing change?
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/tracking: Tracking Service, tracking client APIs, autologgingarea/models: MLmodel format, model serialization/deserialization, flavorsarea/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registryarea/scoring: MLflow Model server, model deployment tools, Spark UDFsarea/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflowsarea/gateway: MLflow AI Gateway client APIs, server, and third-party integrationsarea/prompts: MLflow prompt engineering features, prompt templates, and prompt managementarea/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionalityarea/projects: MLproject format, project running backendsarea/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/build: Build and test infrastructure for MLflowarea/docs: MLflow documentation pagesHow should the PR be classified in the release notes? Choose one:
rn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/breaking-change- The PR will be mentioned in the "Breaking Changes" sectionrn/feature- A new user-facing feature worth mentioning in the release notesrn/bug-fix- A user-facing bug fix worth mentioning in the release notesrn/documentation- A user-facing documentation change worth mentioning in the release notesShould this PR be included in the next patch release?
Yesshould be selected for bug fixes, documentation updates, and other small changes.Noshould be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.What is a minor/patch release?
Bug fixes, doc updates and new features usually go into minor releases.
Bug fixes and doc updates usually go into patch releases.