Skip to content

Refactor Judge utils#18622

Merged
smoorjani merged 11 commits intomlflow:masterfrom
smoorjani:refactor-judge-utils
Nov 6, 2025
Merged

Refactor Judge utils#18622
smoorjani merged 11 commits intomlflow:masterfrom
smoorjani:refactor-judge-utils

Conversation

@smoorjani
Copy link
Collaborator

@smoorjani smoorjani commented Nov 3, 2025

🛠 DevTools 🛠

Open in GitHub Codespaces

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/18622/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/18622/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/18622/merge

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Refactoring the judge utils to split up the file (1200+ lines) to make further changes.

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/tracking: Tracking Service, tracking client APIs, autologging
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • area/gateway: MLflow AI Gateway client APIs, server, and third-party integrations
  • area/prompts: MLflow prompt engineering features, prompt templates, and prompt management
  • area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • area/projects: MLproject format, project running backends
  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

@github-actions github-actions bot added area/evaluation MLflow Evaluation rn/none List under Small Changes in Changelogs. v3.5.2 labels Nov 3, 2025
@smoorjani smoorjani force-pushed the refactor-judge-utils branch from e64ae77 to 10116cf Compare November 3, 2025 16:41
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
@smoorjani smoorjani force-pushed the refactor-judge-utils branch from 10116cf to ea7af3e Compare November 3, 2025 19:51
@github-actions
Copy link
Contributor

github-actions bot commented Nov 3, 2025

Documentation preview for d41151a is available at:

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
@@ -0,0 +1 @@
"""Adapters for different LLM providers used by judges."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this docstring, but kept the init around


from databricks.rag_eval import context, env_vars

env_vars.RAG_EVAL_EVAL_SESSION_CLIENT_NAME.set(f"mlflow-judge-optimizer-v{VERSION}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this session name always correct? It seems this function is called by the _invoke_databricks_judge function as well, which is not related to judge optimizer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I set a default here and set it specifically to the judge optimizer when called from an optimization function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally the most confusing part of the original implementation was we have three very similar functions:

  • _invoke_databricks_model
  • _invoke_databricks_judge
  • _invoke_databricks_judge_model

Is there any way to get out of this state?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a rename of these to make it clearer so now it's much clearer -

  • _invoke_databricks_serving_endpoint
  • _invoke_databricks_serving_endpoint_judge
  • _invoke_databricks_default_judge

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it looks better. To make it even clearer, shall we separate managed judge and serving endpoint to different adapters? They are both Databricks, but the invocation logic is totally different.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! This will be even nicer once we do the refactor below to have a generic Adapter which is extended by all of these.

Comment on lines +14 to +21
_check_databricks_agents_installed,
_invoke_databricks_judge,
_invoke_databricks_judge_model,
_invoke_databricks_model,
_parse_databricks_judge_response,
_parse_databricks_model_response,
_record_judge_model_usage_failure_databricks_telemetry,
_record_judge_model_usage_success_databricks_telemetry,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Non blocking] I love the adapter abstraction. However, it looks a bit of code smell if we need to expose all these private utilities after splitting them. Can we update adapter abstraction such that we don't need to expose them here?

Ideally we can refactor the logic to sth like this:

  • Each adapter exposes one invoke public API.
  • Adapters also expose a helper function like is_applicable(provider, model, ...) (or some factory returns which adapter to use).
  • The top level invoke_judge_model should not contain any adapter specific logic.

Feel free to address this in a follow-up to reduce the size of change tho.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cleaned up this file, but it makes sense to have some common abstraction. Let me do this in a follow-up so it's easier to review.

.
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
@smoorjani smoorjani requested a review from B-Step62 November 4, 2025 19:35
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
@harupy
Copy link
Member

harupy commented Nov 5, 2025

/review


🚀 Review workflow started


✅ Review completed.

Review Output

Perfect! I've completed the PR review. Let me provide a summary:

PR Review Summary

I've reviewed PR #18622 "Refactor Judge utils" which splits the large judge/utils.py file (1200+ lines) into a more organized module structure with adapters and utility submodules.

Issues Found

I identified 4 style guide violations and added review comments for each:

  1. databricks_adapter.py:220-225 - Loop-and-break pattern that should use next() for finding text content in reasoning responses

  2. litellm_adapter.py:318-324 - Loop-and-break pattern that should use next() with enumerate() for finding assistant messages with tool calls

  3. litellm_adapter.py:42-43 - Redundant docstring that merely restates the function name _is_litellm_available

  4. invocation_utils.py:42-43 - Redundant docstring that merely restates the function name _is_litellm_available

All issues are minor style improvements that would make the code more concise and aligned with the project's Python style guide. The refactoring itself appears well-structured, maintains backward compatibility through the __init__.py exports, and the test files follow proper testing practices with mock assertions and parametrization.

Comment on lines +220 to +225
if isinstance(content, list):
text_content = None
for item in content:
if isinstance(item, dict) and item.get("type") == "text":
text_content = item.get("text")
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the Python style guide, consider using next() instead of a loop-and-break pattern:

Suggested change
if isinstance(content, list):
text_content = None
for item in content:
if isinstance(item, dict) and item.get("type") == "text":
text_content = item.get("text")
break
# Handle reasoning response (list of content items)
if isinstance(content, list):
text_content = next(
(item.get("text") for item in content if isinstance(item, dict) and item.get("type") == "text"),
None
)

This is more concise and functional.

🤖 Generated with Claude Code

Comment on lines +318 to +324
assistant_msg = None
assistant_idx = None
for i, msg in enumerate(pruned_messages):
if msg.role == "assistant" and msg.tool_calls:
assistant_msg = msg
assistant_idx = i
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the Python style guide, consider using next() with enumerate() instead of a loop-and-break pattern:

Suggested change
assistant_msg = None
assistant_idx = None
for i, msg in enumerate(pruned_messages):
if msg.role == "assistant" and msg.tool_calls:
assistant_msg = msg
assistant_idx = i
break
# Find first assistant message with tool calls
result = next(
((i, msg) for i, msg in enumerate(pruned_messages) if msg.role == "assistant" and msg.tool_calls),
None
)
if result is None:
break # No more tool calls to remove
assistant_idx, assistant_msg = result

This is more concise and functional while still capturing both the index and message.

🤖 Generated with Claude Code

Comment on lines +42 to +43
def _is_litellm_available() -> bool:
"""Check if LiteLLM is available for import."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the Python style guide, this docstring is redundant as it merely restates what the function name already conveys. Consider removing it:

Suggested change
def _is_litellm_available() -> bool:
"""Check if LiteLLM is available for import."""
def _is_litellm_available() -> bool:
try:

🤖 Generated with Claude Code

@mlflow mlflow deleted a comment from github-actions bot Nov 5, 2025
smoorjani and others added 2 commits November 5, 2025 17:17
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
Copy link
Collaborator

@B-Step62 B-Step62 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good once #18622 (comment) is addressed. Let's refactor the logic a bit more in the follow-up to make it easier to maintain and prevent mistake/regression.

Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
@smoorjani smoorjani enabled auto-merge November 6, 2025 18:29
.
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
@smoorjani smoorjani added this pull request to the merge queue Nov 6, 2025
Merged via the queue into mlflow:master with commit 3a2ee2d Nov 6, 2025
64 of 66 checks passed
@smoorjani smoorjani deleted the refactor-judge-utils branch November 6, 2025 21:06
@B-Step62 B-Step62 added the v3.6.0 label Nov 7, 2025
B-Step62 pushed a commit to B-Step62/mlflow that referenced this pull request Nov 7, 2025
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
@github-actions github-actions bot added v3.6.1 and removed v3.6.0 labels Nov 8, 2025
B-Step62 pushed a commit to B-Step62/mlflow that referenced this pull request Nov 11, 2025
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
B-Step62 pushed a commit that referenced this pull request Nov 11, 2025
Signed-off-by: Samraj Moorjani <samraj.moorjani@databricks.com>
@B-Step62 B-Step62 added v3.6.0 and removed v3.6.1 labels Nov 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/evaluation MLflow Evaluation rn/none List under Small Changes in Changelogs. v3.5.2 v3.6.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants