Skip to content

Documentation for online evaluation / scoring#20103

Merged
dbczumar merged 47 commits intomlflow:masterfrom
dbczumar:online_docs
Jan 21, 2026
Merged

Documentation for online evaluation / scoring#20103
dbczumar merged 47 commits intomlflow:masterfrom
dbczumar:online_docs

Conversation

@dbczumar
Copy link
Collaborator

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Docs for online evaluation (scoring)

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Release Notes

MLflow now supports online evaluation, enabling developers to automatically run LLM judges as traces are logged. This creates a more seamless experience for identifying quality issues during development and for production quality monitoring.

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/tracking: Tracking Service, tracking client APIs, autologging
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • area/gateway: MLflow AI Gateway client APIs, server, and third-party integrations
  • area/prompts: MLflow prompt engineering features, prompt templates, and prompt management
  • area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • area/projects: MLproject format, project running backends
  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

Signed-off-by: dbczumar <corey.zumar@databricks.com>
Copilot AI review requested due to automatic review settings January 18, 2026 07:01
@github-actions
Copy link
Contributor

🛠 DevTools 🛠

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/20103/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/20103/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/20103/merge

@github-actions github-actions bot added area/docs Documentation issues area/evaluation MLflow Evaluation rn/feature Mention under Features in Changelogs. labels Jan 18, 2026
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a sidebar navigation entry for new "Online Evaluations" documentation as part of the MLflow online evaluation/scoring feature. The PR is marked as WIP (Work in Progress) and introduces documentation for automated LLM judges that run as traces are logged.

Changes:

  • Added a new "Online Evaluations" entry to the GenAI sidebar navigation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 18, 2026

Documentation preview for e2e739f is available at:

Changed Pages (1)

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

dbczumar and others added 20 commits January 18, 2026 21:02
Signed-off-by: dbczumar <corey.zumar@databricks.com>
- Fix prerequisites section with arrow-style links
- Add AI Gateway endpoint prerequisite
- Update UI instructions with Configure the judge section
- Fix SDK example to use correct register/start API pattern
- Add ScorerSamplingConfig and model parameter
- Change 'Streamlined Development' to 'Streamlined Quality Improvement'
- Various copy improvements

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
- Add ImageBox import
- Mention Sessions tab in addition to Traces tab
- Use ImageBox with 80% width and caption
- Remove redundant filter text

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
- UI: Note about clicking existing judge to edit
- SDK: Add get_scorer, update, and stop examples

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
- Judges tab icon
- New LLM judge button
- Evaluation settings section

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>

# Online Evaluations

_Automatically evaluate traces and multi-turn conversations as they're logged - no code required_
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need no code required?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've had a couple of Databricks customers who want non-coders (e.g. PMs) to be able to create and run judges, so I'd like to keep this


_Automatically evaluate traces and multi-turn conversations as they're logged - no code required_

Online evaluations run your LLM judges automatically on traces and multi-turn conversations as they're logged to MLflow, without requiring manual execution of code. This enables two key use cases:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep judges, since:

  1. We're going to refactor scorers to "LLM Judges and Scorers" as part of the eval documentation overhaul (which is also what we do on Databricks)
  2. We only support online evaluation of LLM judges

Comment on lines +29 to +32
1. **The MLflow Server is running** (see [Set Up MLflow Server](/genai/getting-started/connect-environment))
2. **MLflow Tracing is enabled** in your agent or LLM application (see [Tracing Quickstart](/genai/tracing/quickstart))
- **For multi-turn conversation evaluation**, traces must include session IDs (see [Track Users & Sessions](/genai/tracing/track-users-sessions))
3. **An AI Gateway endpoint is configured** for LLM judge execution (see [Create and Manage Endpoints](/genai/governance/ai-gateway/endpoints/create-and-manage))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add link on the bolded text directly instead of adding see ... on each line?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely - done!

Comment on lines +91 to +107
# Create and start a trace-level judge
tool_judge = ToolCallCorrectness(model="gateway:/your-endpoint")
registered_tool_judge = tool_judge.register(name="tool_call_correctness")
registered_tool_judge.start(
sampling_config=ScorerSamplingConfig(sample_rate=0.5), # Evaluate 50% of traces
)

# Create and start a session-level judge for multi-turn conversations
frustration_judge = ConversationalGuidelines(
name="user_frustration",
guidelines="The user should not express frustration, confusion, or dissatisfaction during the conversation.",
model="gateway:/your-endpoint",
)
registered_frustration_judge = frustration_judge.register(name="user_frustration")
registered_frustration_judge.start(
sampling_config=ScorerSamplingConfig(sample_rate=1.0), # Evaluate all conversations
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel like we only need one example for 'starting a judge'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to show that trace-level and session-level judges are supported


Assessments from online evaluation appear directly in the MLflow UI. For traces, assessments typically appear within a minute or two of logging. For multi-turn sessions, evaluation begins after 5 minutes of inactivity (no new traces) by default—this is <APILink fn="mlflow.environment_variables.MLFLOW_ONLINE_SCORING_DEFAULT_SESSION_COMPLETION_BUFFER_SECONDS">configurable</APILink>. Navigate to your experiment's **Traces** or **Sessions** tab to see results.

<ImageBox
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we include charts in quality tab of overview as well?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea. Done!

| Issue | Solution |
| ---------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Missing assessments** | Verify that the judge is active, the filter matches your traces, the sampling rate is greater than zero, and the traces are less than one hour old |
| **Unexpected or unsatisfactory judge results** | Review the judge's prompt/guidelines and test with known inputs |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention align for judge optimization here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea - done!

1. **The MLflow Server is running** (see [Set Up MLflow Server](/genai/getting-started/connect-environment))
2. **MLflow Tracing is enabled** in your agent or LLM application (see [Tracing Quickstart](/genai/tracing/quickstart))
- **For multi-turn conversation evaluation**, traces must include session IDs (see [Track Users & Sessions](/genai/tracing/track-users-sessions))
3. **An AI Gateway endpoint is configured** for LLM judge execution (see [Create and Manage Endpoints](/genai/governance/ai-gateway/endpoints/create-and-manage))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give a bit more detailed description for 3rd step? AI Gateway product category in general is unrelated to evaluation/quality (yet), so it might be confusing why this step is required.

Probably having "How it works" section is also helpful, so that users/admins don't feel online scoring a black box magic.

Copy link
Collaborator Author

@dbczumar dbczumar Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally - I added the detailed description for the 3rd step and introduced a 'how it works' section before trouble shooting. Let me know if this reads better :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah looks very clear, thanks!


Assessments from online evaluation appear directly in the MLflow UI. For traces, assessments typically appear within a minute or two of logging. For multi-turn sessions, evaluation begins after 5 minutes of inactivity (no new traces) by default—this is <APILink fn="mlflow.environment_variables.MLFLOW_ONLINE_SCORING_DEFAULT_SESSION_COMPLETION_BUFFER_SECONDS">configurable</APILink>. Navigate to your experiment's **Traces** or **Sessions** tab to see results.

<ImageBox
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1


## Next Steps

<CardGroup cols={2}>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we use TilesGrid for consistency with other pages?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally - done :)


## Online vs Offline Evaluation

| | Online Evaluation | Offline Evaluation |
Copy link
Collaborator

@B-Step62 B-Step62 Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comparison feels a bit narrowing down offline evaluation capability.

  • Offline evaluation is still used commonly for internal QA phase, not only for regression or bug-fixes.
  • The "data source" part sounds like offline evaluation does not support traces/conversations.

Here is Claude's suggestion when given the original table, to share another data point:

Screenshot 2026-01-20 at 14 24 59

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense - updated - let me know what you think :)

Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Copy link
Collaborator

@B-Step62 B-Step62 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
@dbczumar dbczumar added this pull request to the merge queue Jan 21, 2026
Merged via the queue into mlflow:master with commit 27a5b53 Jan 21, 2026
48 checks passed
@dbczumar dbczumar deleted the online_docs branch January 21, 2026 18:09
harupy pushed a commit to harupy/mlflow that referenced this pull request Jan 28, 2026
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-authored-by: Claude <noreply@anthropic.com>
harupy pushed a commit to harupy/mlflow that referenced this pull request Jan 28, 2026
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-authored-by: Claude <noreply@anthropic.com>
harupy pushed a commit that referenced this pull request Jan 28, 2026
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/docs Documentation issues area/evaluation MLflow Evaluation rn/feature Mention under Features in Changelogs. v3.9.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants