Skip to content

Add description field to all built-in scorers#18547

Merged
BenWilson2 merged 5 commits intomlflow:masterfrom
alkispoly-db:mlflow-builtin-descriptions
Nov 3, 2025
Merged

Add description field to all built-in scorers#18547
BenWilson2 merged 5 commits intomlflow:masterfrom
alkispoly-db:mlflow-builtin-descriptions

Conversation

@alkispoly-db
Copy link
Collaborator

@alkispoly-db alkispoly-db commented Oct 28, 2025

🛠 DevTools 🛠

Open in GitHub Codespaces

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/18547/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/18547/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/18547/merge

Related Issues/PRs

N/A

What changes are proposed in this pull request?

This PR adds a description field to all 9 built-in scorer classes in MLflow to improve discoverability and documentation. Each scorer now includes a concise, human-readable description that explains what it evaluates:

  • RetrievalRelevance: Evaluate whether each retrieved context chunk is relevant to the input request
  • RetrievalSufficiency: Evaluate whether the information in the last retrieval is sufficient to generate the facts
  • RetrievalGroundedness: Assess whether the facts in the response are implied by the retrieval information (no hallucinations)
  • Guidelines: Evaluate whether the agent's response follows specific constraints or instructions
  • ExpectationsGuidelines: Evaluate whether responses follow row-specific constraints
  • RelevanceToQuery: Ensure responses directly address the user's input
  • Safety: Ensure responses do not contain harmful, offensive, or toxic content
  • Correctness: Check whether the response matches expected facts
  • Equivalence: Compare outputs against expected outputs for semantic equivalence

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Ran the full test suite for builtin_scorers: 58 tests passed, 4 skipped. All existing tests continue to pass with no breaking changes.

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

Added description fields to all built-in scorers (RetrievalRelevance, RetrievalSufficiency, RetrievalGroundedness, Guidelines, ExpectationsGuidelines, RelevanceToQuery, Safety, Correctness, Equivalence) to improve discoverability and make the API more self-documenting.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • `area/tracking`: Tracking Service, tracking client APIs, autologging
  • `area/models`: MLmodel format, model serialization/deserialization, flavors
  • `area/model-registry`: Model Registry service, APIs, and the fluent client calls for Model Registry
  • `area/scoring`: MLflow Model server, model deployment tools, Spark UDFs
  • `area/evaluation`: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • `area/gateway`: MLflow AI Gateway client APIs, server, and third-party integrations
  • `area/prompts`: MLflow prompt engineering features, prompt templates, and prompt management
  • `area/tracing`: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • `area/projects`: MLproject format, project running backends
  • `area/uiux`: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • `area/build`: Build and test infrastructure for MLflow
  • `area/docs`: MLflow documentation pages

How should the PR be classified in the release notes? Choose one:

  • `rn/none` - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • `rn/breaking-change` - The PR will be mentioned in the "Breaking Changes" section
  • `rn/feature` - A new user-facing feature worth mentioning in the release notes
  • `rn/bug-fix` - A user-facing bug fix worth mentioning in the release notes
  • `rn/documentation` - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

`Yes` should be selected for bug fixes, documentation updates, and other small changes. `No` should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

Added concise description fields to all built-in scorer classes to improve
discoverability and documentation. Each description provides a brief summary
of what the scorer evaluates:

- RetrievalRelevance: Evaluates chunk relevance to input request
- RetrievalSufficiency: Checks if retrieval info is sufficient for expected facts
- RetrievalGroundedness: Assesses if response facts are implied by retrieval (no hallucinations)
- Guidelines: Checks adherence to specified constraints/instructions
- ExpectationsGuidelines: Validates per-row guideline adherence
- RelevanceToQuery: Ensures response addresses user input without deviation
- Safety: Ensures no harmful, offensive, or toxic content
- Correctness: Verifies response matches expected facts
- Equivalence: Compares outputs for semantic equivalence

This change improves the scorer API by providing human-readable descriptions
that can be displayed in UIs and documentation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
@alkispoly-db alkispoly-db added rn/feature Mention under Features in Changelogs. area/evaluation MLflow Evaluation labels Oct 28, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Oct 31, 2025

Documentation preview for ebfde36 is available at:

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

Updated two tests that expected built-in scorers to have `description=None`.
With the addition of default descriptions to built-in scorers, these tests
now correctly verify that built-in scorers have non-empty string descriptions.

Changes:
- test_builtin_scorer_without_description: Now verifies scorers have default descriptions
- test_backward_compatibility_scorer_without_description: Updated to check that built-in
  scorers have default descriptions while custom scorers/judges still default to None
- Added clarifying comments explaining the new behavior

All 14 tests in test_scorer_description.py now pass.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
@BenWilson2 BenWilson2 added this pull request to the merge queue Nov 3, 2025
Merged via the queue into mlflow:master with commit 35cf507 Nov 3, 2025
46 of 48 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/evaluation MLflow Evaluation rn/feature Mention under Features in Changelogs.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants