Skip to content

Implement automatic discovery for builtin scorers#19443

Merged
alkispoly-db merged 5 commits intomlflow:masterfrom
alkispoly-db:mlflow-auto-builtin
Dec 17, 2025
Merged

Implement automatic discovery for builtin scorers#19443
alkispoly-db merged 5 commits intomlflow:masterfrom
alkispoly-db:mlflow-auto-builtin

Conversation

@alkispoly-db
Copy link
Collaborator

@alkispoly-db alkispoly-db commented Dec 17, 2025

🛠 DevTools 🛠

Open in GitHub Codespaces

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/19443/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/19443/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/19443/merge

Related Issues/PRs

N/A

What changes are proposed in this pull request?

This PR implements automatic discovery for builtin scorers, replacing the hardcoded list in get_all_scorers() with a dynamic introspection-based approach using Python's __subclasses__() method.

Key changes:

  • Added _get_all_concrete_builtin_scorers() helper function that recursively discovers all concrete BuiltInScorer subclasses
  • Updated get_all_scorers() to use automatic discovery instead of maintaining a hardcoded list
  • Added proper exception handling for scorers requiring constructor arguments (catches both TypeError and pydantic.ValidationError)
  • Removed unused is_databricks_uri import
  • Updated tests to expect 16+ scorers instead of the previous 9/11
  • Added test_builtin_scorer_discovery() to verify the discovery mechanism

Benefits:

  • Zero maintenance: New scorers are automatically discovered without code changes
  • Complete coverage: Now discovers all 16 instantiable scorers (vs previous 9 OSS / 11 Databricks)
  • Adds missing scorers: Includes 5+ scorers that were missing from the hardcoded list:
    • Fluency
    • Summarization
    • ConversationalSafety
    • ConversationalToolCallEfficiency
    • ConversationalRoleAdherence
  • Future-proof: Adding new scorers requires no changes to get_all_scorers()

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Testing:

  • Updated test_get_all_scorers_oss to verify 16+ scorers are discovered
  • Added test_builtin_scorer_discovery to validate discovery of all expected scorer classes
  • Verified CLI command mlflow scorers list --builtin returns all 16 scorers
  • All tests pass with both OSS and Databricks tracking URIs

Does this PR require documentation update?

  • No. You can skip the rest of this section.

The function docstring has been updated to explain the automatic discovery mechanism. No external documentation changes needed.

Release Notes

Is this a user-facing change?

  • Yes. Give a description of this change to be included in the release notes for MLflow users.

mlflow scorers list --builtin now automatically discovers and returns all builtin scorers (16 total) instead of a hardcoded subset (9 OSS / 11 Databricks). This includes 5+ previously unavailable scorers: Fluency, Summarization, ConversationalSafety, ConversationalToolCallEfficiency, and ConversationalRoleAdherence. The get_all_scorers() function now uses automatic discovery, eliminating the need to manually maintain the scorer list.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows

How should the PR be classified in the release notes? Choose one:

  • rn/feature - A new user-facing feature worth mentioning in the release notes

Should this PR be included in the next patch release?

  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

This is a feature enhancement that increases the number of available scorers, making it more appropriate for a minor release.

Replace hardcoded list in get_all_scorers() with automatic discovery
using Python's __subclasses__() introspection. This eliminates the need
to manually maintain the scorer list when adding new scorers.

Key changes:
- Add _get_all_concrete_builtin_scorers() helper function that
  recursively discovers all concrete BuiltInScorer subclasses
- Update get_all_scorers() to use automatic discovery instead of
  hardcoded list
- Handle scorers requiring constructor args (e.g., Guidelines) by
  catching both TypeError and pydantic.ValidationError
- Remove unused is_databricks_uri import
- Update tests to expect 16+ scorers instead of 9/11
- Add test_builtin_scorer_discovery() to verify discovery mechanism

Benefits:
- Zero maintenance: new scorers automatically discovered
- Complete coverage: discovers all 16 instantiable scorers (vs 9/11)
- Adds 5+ scorers previously missing from hardcoded list: Fluency,
  Summarization, ConversationalSafety, ConversationalToolCallEfficiency,
  ConversationalRoleAdherence
- Future-proof: no code changes needed for new scorers

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
@github-actions github-actions bot added area/evaluation MLflow Evaluation rn/feature Mention under Features in Changelogs. labels Dec 17, 2025
Changes based on ALKIS comments:
- Move inspect import to top-level (no local imports)
- Consolidate tests: remove redundant test_get_all_scorers_oss
- Rewrite test to use public API get_all_scorers() instead of private
  _get_all_concrete_builtin_scorers()
- Improve test to verify exact set of expected scorers
- Simplify docstring and comments

The new test_get_all_scorers() is more comprehensive:
- Tests the public API directly (better encapsulation)
- Verifies exact count and exact set of scorers
- Explicitly checks Guidelines is excluded
- No dependency on tracking URI (simpler)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
@github-actions
Copy link
Contributor

github-actions bot commented Dec 17, 2025

Documentation preview for 03ede60 is available at:

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

alkispoly-db and others added 2 commits December 17, 2025 00:46
Address ALKIS feedback:
- Remove hardcoded count (expected_count = 16) in favor of directly
  comparing scorer_class_names with expected_scorers
- Remove all explanatory comments from test body
- Remove custom assertion messages - rely on pytest introspection
- Convert scorer_names check to simpler set comparison for duplicates

The test is now more concise and focuses purely on comparing the
actual set of discovered scorers against the expected set.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
The assertions checking that BuiltInScorer, BuiltInSessionLevelScorer,
and Guidelines are not in scorer_class_names are redundant since
the equality check `scorer_class_names == expected_scorers` already
guarantees these classes are excluded.

This further simplifies the test to its essential validation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
The test_list_builtin_scorers_shows_all_available_scorers test was
trying to patch is_databricks_uri which no longer exists in
builtin_scorers.py after our refactoring.

Since get_all_scorers() now returns all scorers regardless of
environment (no Databricks-specific conditional logic), the
parameterized test checking different behaviors is obsolete.

Simplified the test to verify that the CLI returns the same scorers
as get_all_scorers() without mocking or parameterization.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
Copy link
Collaborator

@AveshCSingh AveshCSingh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@alkispoly-db alkispoly-db added this pull request to the merge queue Dec 17, 2025
Merged via the queue into mlflow:master with commit bc45779 Dec 17, 2025
47 of 49 checks passed
@alkispoly-db alkispoly-db deleted the mlflow-auto-builtin branch December 17, 2025 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/evaluation MLflow Evaluation rn/feature Mention under Features in Changelogs.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants