Documentation for online evaluation / scoring by dbczumar · Pull Request #20103 · mlflow/mlflow

dbczumar · 2026-01-18T07:01:34Z

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Docs for online evaluation (scoring)

How is this PR tested?

Existing unit/integration tests
New unit/integration tests
Manual tests

Does this PR require documentation update?

Release Notes

MLflow now supports online evaluation, enabling developers to automatically run LLM judges as traces are logged. This creates a more seamless experience for identifying quality issues during development and for production quality monitoring.

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

How should the PR be classified in the release notes? Choose one:

rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?

Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
Bug fixes, doc updates and new features usually go into minor releases.
Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
Bug fixes and doc updates usually go into patch releases.

Yes (this PR will be cherry-picked and included in the next patch release)
No (this PR will be included in the next minor release)

Signed-off-by: dbczumar <corey.zumar@databricks.com>

github-actions · 2026-01-18T07:01:54Z

🛠 DevTools 🛠

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/20103/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/20103/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/20103/merge

Signed-off-by: dbczumar <corey.zumar@databricks.com>

Copilot

Pull request overview

This PR adds a sidebar navigation entry for new "Online Evaluations" documentation as part of the MLflow online evaluation/scoring feature. The PR is marked as WIP (Work in Progress) and introduces documentation for automated LLM judges that run as traces are logged.

Changes:

Added a new "Online Evaluations" entry to the GenAI sidebar navigation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

docs/sidebarsGenAI.ts

github-actions · 2026-01-18T07:13:03Z

Documentation preview for e2e739f is available at:

https://pr-20103--mlflow-docs-preview.netlify.app/docs/latest/

Changed Pages (1)

genai/eval-monitor/automatic-evaluations (added)

More info

Ignore this comment if this PR does not change the documentation.
The preview is updated when a new commit is pushed to this PR.
This comment was created by this workflow run.
The documentation was built by this workflow run.

Signed-off-by: dbczumar <corey.zumar@databricks.com>

- Fix prerequisites section with arrow-style links - Add AI Gateway endpoint prerequisite - Update UI instructions with Configure the judge section - Fix SDK example to use correct register/start API pattern - Add ScorerSamplingConfig and model parameter - Change 'Streamlined Development' to 'Streamlined Quality Improvement' - Various copy improvements Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>

- Add ImageBox import - Mention Sessions tab in addition to Traces tab - Use ImageBox with 80% width and caption - Remove redundant filter text Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>

- UI: Note about clicking existing judge to edit - SDK: Add get_scorer, update, and stop examples Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>

- Judges tab icon - New LLM judge button - Evaluation settings section Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>

Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: dbczumar <corey.zumar@databricks.com>

Signed-off-by: dbczumar <corey.zumar@databricks.com>

serena-ruan · 2026-01-20T03:20:36Z

docs/docs/genai/eval-monitor/automatic-evaluations/index.mdx

+
+# Online Evaluations
+
+_Automatically evaluate traces and multi-turn conversations as they're logged - no code required_


Do we need no code required?

We've had a couple of Databricks customers who want non-coders (e.g. PMs) to be able to create and run judges, so I'd like to keep this

serena-ruan · 2026-01-20T03:23:13Z