Skip to content

Support structured outputs in make_judge#18529

Merged
TomeHirata merged 25 commits intomlflow:masterfrom
TomeHirata:feat/make-judge/structured-output
Nov 3, 2025
Merged

Support structured outputs in make_judge#18529
TomeHirata merged 25 commits intomlflow:masterfrom
TomeHirata:feat/make-judge/structured-output

Conversation

@TomeHirata
Copy link
Collaborator

@TomeHirata TomeHirata commented Oct 27, 2025

🛠 DevTools 🛠

Open in GitHub Codespaces

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/18529/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/18529/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/18529/merge

Related Issues/PRs

Resolve #18262

What changes are proposed in this pull request?

Support structured output in make_judge API

  • Added result_type parameter to make_judge for specifying the type of the Feedback object's value.
  • Updated InstructionsJudge to handle the new result_type and serialize it correctly. Due to the requirements for serialization, it currently supports primitives.
  • Modified invoke_judge_model to accept a response_format parameter for structured output.
from mlflow.genai import make_judge
from typing import Literal

judge = make_judge(
    name="conciseness", 
    instructions="the response {{outputs}} is concise enough", 
    result_type=Literal['yes', 'no'], 
)

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/tracking: Tracking Service, tracking client APIs, autologging
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • area/gateway: MLflow AI Gateway client APIs, server, and third-party integrations
  • area/prompts: MLflow prompt engineering features, prompt templates, and prompt management
  • area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • area/projects: MLproject format, project running backends
  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

@github-actions github-actions bot added area/evaluation MLflow Evaluation rn/feature Mention under Features in Changelogs. v3.5.2 labels Oct 27, 2025
- Added `result_type` parameter to `make_judge` for specifying the type of the Feedback object's value.
- Implemented `_validate_result_type` function to validate supported types for serialization.
- Updated `InstructionsJudge` to handle the new `result_type` and serialize it correctly.
- Modified `invoke_judge_model` to accept a `response_format` parameter for structured output.
- Added serialization and deserialization methods for response formats in the Scorer class.
- Included tests for new functionality, ensuring proper handling of various result types and response formats.

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
@TomeHirata TomeHirata force-pushed the feat/make-judge/structured-output branch from 2a89cfb to 80e52f5 Compare October 27, 2025 06:27
@github-actions
Copy link
Contributor

github-actions bot commented Oct 27, 2025

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
@TomeHirata TomeHirata requested a review from Copilot October 27, 2025 07:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for structured outputs in the make_judge API by introducing a result_type parameter that allows users to specify the type of the judge's feedback value (e.g., int, bool, float, Literal).

Key changes:

  • Added result_type parameter to make_judge and InstructionsJudge for specifying feedback value types
  • Implemented serialization/deserialization methods for result_type to support judge persistence
  • Updated invoke_judge_model and related functions to pass response_format for structured LLM outputs

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
mlflow/genai/judges/make_judge.py Added validation for result_type parameter and passed it to InstructionsJudge
mlflow/genai/judges/instructions_judge/init.py Implemented result_type support with serialization/deserialization and dynamic Pydantic model creation
mlflow/genai/scorers/base.py Added deserialization of result_type in Scorer.model_validate
mlflow/genai/judges/utils.py Added response_format parameter to judge invocation functions
tests/genai/judges/test_make_judge.py Added comprehensive tests for result_type functionality
tests/genai/judges/test_judge_utils.py Added tests for structured output with Databricks and LiteLLM models
docs/docs/genai/eval-monitor/scorers/llm-judge/make-judge.mdx Added documentation section for output format specification

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
instructions: str,
model: str | None = None,
description: str | None = None,
result_type: type | None = None,
Copy link
Collaborator

@dbczumar dbczumar Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, this is FeedbackValueType or Literal[<list of values in PbValueType>], right? Can we change the type hint to FeedbackValueType | Literal[<list of values in PbValueType>]?

For variable naming, I'd like us to be a bit more explicit too. feedback_value_type?

cc @alkispoly-db @AveshCSingh

Copy link
Collaborator Author

@TomeHirata TomeHirata Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument value here is "type" itself, so the more accurate type hint is type[FeedbackValueType]. Yes, we can support a list of PbValueType and a dict of PbValueType to be consistent with FeedbackValueType. Do we want to support Literal in this case? I think Literal is useful for categorical responses (e.g., yes, no), but it violates the type hint type[FeedbackValueType] and there's nothing like type[Literal] in Python iirc, resulting in Any type hint which causes the type accuracy concern.

Copy link
Collaborator

@dbczumar dbczumar Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sounds good! We should also support literal

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbczumar Supported a list and dict of PbValueType, can you take another look?

Comment on lines +738 to +750
# Build request payload
payload = {
"messages": [
{
"role": "user",
"content": prompt,
}
],
}

# Add response_schema if provided
if response_format is not None:
payload["response_schema"] = response_format.model_json_schema()
Copy link
Collaborator

@dbczumar dbczumar Oct 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's test this on Databricks to make sure it works with GPT / GPT OSS, Anthropic, Llama. We should introduce some retry logic if response schema is not supported

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tested FMAPI models and structured output worked well!

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
@TomeHirata TomeHirata requested a review from dbczumar October 27, 2025 23:36
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
)

@staticmethod
def _serialize_response_format(response_format: type) -> dict[str, Any]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate what is missing if we simply serialize the ResponseFormat pydantic model using .model_json_schema()? We still need to implement deserialization but it allow us bypassing serialization and maintaining our own format.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no native way to convert json schema back to Pydantic base class. We can simplify the stored artifact and deserialization logic if we have our own simple format rather than using JSON schema. But I agree with the value of using a standard schema, so I’m fine with either way.

Copy link
Collaborator Author

@TomeHirata TomeHirata Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, to clarify, we just need to store the feedback_value_type rather than the entire response_format for LLM which is dynamically constructed (renamed the serialize method to clarify). Then the feedback value type itself is not Pydantic base model. To use JSON schema, we need to store the entire response_format, which contains some duplicated information.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea no strong preference here. Just thought using Pydantic serialization reduces complexity and spec ambiguity, even if we need to implement deserialization ourselves. Not a blocker.

instructions: str,
model: str | None = None,
description: str | None = None,
feedback_value_type: Any = str,
Copy link
Collaborator

@B-Step62 B-Step62 Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it is too aggressive if we default to boolean or mandate this field? The current default setting (free-form text) is usable; users will get inconsistent results and broken aggregation. For example, when I tried it first, I got three different words representing "false". This happens to high-end LLMs as well because there is no mechanism to enforce consistency by default.

It is not great that the default experience gives impression of "make_judge is broken". I would rather harness it or adding extra step, such that users can get meaningful results from the default journey.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str is for backward compatibility. It's not a good experience if the users' existing judge changes the response type when upgrading MLflow version. Assuming the low adoption of make_judge, I'm fine with making boolean or Literal["yes", "on"] our default. What do you think? cc: @dbczumar @alkispoly-db

Copy link
Collaborator

@dbczumar dbczumar Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're definitely going to break existing judges if we do this; let's keep a string default but update documentation to explain how to use the structured outputs and use it in all of our examples.

It is not great that the default experience gives impression of "make_judge is broken"

I haven't experienced this issue if I tell the judge to "return one of the following values exactly: ..."

Copy link
Collaborator

@B-Step62 B-Step62 Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many "existing judge" users do we have? It is much better to break earlier than later if we believe it is makes a product better. The API is marked experimental too.

I haven't experienced this issue if I tell the judge to "return one of the following values exactly: ..."

Do competitors ask similar thing to users? Also structured outputs is a theoretically guaranteed method to enforce outputs, so is more stable than prompting.

Copy link
Collaborator

@B-Step62 B-Step62 Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a data point, the oss usage is less than 5 sessions per day in average. Isn't it a good trade-off to make default experience better for hundreds of future users who will be using the API?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dbczumar can you take a look at Yuki's comment above?

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
…rata/mlflow into feat/make-judge/structured-output
Comment on lines +41 to +61
# Check for dict[str, PbValueType]
if origin is dict:
args = get_args(feedback_value_type)
if len(args) == 2:
key_type, value_type = args
# Key must be str
if key_type != str:
from mlflow.exceptions import MlflowException

raise MlflowException.invalid_parameter_value(
f"dict key type must be str, got {key_type}"
)
# Value must be a PbValueType
if value_type not in (str, int, float, bool):
from mlflow.exceptions import MlflowException

raise MlflowException.invalid_parameter_value(
"The `feedback_value_type` argument does not support a dict type"
f"with non-primitive values, but got {value_type.__name__}"
)
return
Copy link
Collaborator

@dbczumar dbczumar Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can pydantic help with this validation, or is it too difficult to express in pydantic because we have to support v1 and v2?

Copy link
Collaborator Author

@TomeHirata TomeHirata Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ifaik, Pydantic helps validate an object on types, but this validation is for validating a type on types. So I'm not sure if we can simplify this logic using Pydantic.

Comment on lines +405 to +412
response_format = pydantic.create_model(
"ResponseFormat",
result=(
self._feedback_value_type or str,
pydantic.Field(description=self.description or "The result of the evaluation"),
),
rationale=(str, pydantic.Field(description="The rationale for the evaluation")),
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomeHirata Anything we can do to make the is_trace_based == True case work with Databricks gpt-oss, which doesn't support structured outputs and tool calling together?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we could accept the response format as an argument to the _build_system_message method and use it to adjust the evaluation_rating_fields?

    def _build_system_message(self, is_trace_based: bool) -> str:
        """Build the system message based on whether this is trace-based or field-based."""
        output_fields = self.get_output_fields()

        if is_trace_based:
            evaluation_rating_fields = "\n".join(
                [f"- {field.name}: {field.description}" for field in output_fields]
            )
            return INSTRUCTIONS_JUDGE_TRACE_PROMPT_TEMPLATE.format(
                evaluation_rating_fields=evaluation_rating_fields,
                instructions=self._instructions,
            )
        else:
            base_prompt = format_prompt(
                INSTRUCTIONS_JUDGE_SYSTEM_PROMPT, instructions=self._instructions
            )
            return add_output_format_instructions(base_prompt, output_fields=output_fields)

Copy link
Collaborator Author

@TomeHirata TomeHirata Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree to include the type information in instruction when is_trace_based == True. Added the type information to the evaluation rating field.

Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @TomeHirata ! Just a few small comments (pydantic one isn't blocking, but we should do it if it reduces code complexity)

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
@TomeHirata TomeHirata force-pushed the feat/make-judge/structured-output branch from 505504d to 48a618f Compare October 31, 2025 00:12
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
@TomeHirata TomeHirata requested a review from dbczumar October 31, 2025 07:15
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @TomeHirata ! Can you do some manual testing with GPT OSS (and ideally at least one other model) on Databricks before merge?

@TomeHirata
Copy link
Collaborator Author

TomeHirata commented Nov 3, 2025

Yes, I've tested the following code works.

from mlflow.genai.judges import make_judge
from typing import Literal
import mlflow
import time

performance_judge = make_judge(
    name="performance_analyzer",
    instructions=(
        "Analyze the {{ trace }} for performance issues.\n\n"
        "Check for:\n"
        "- Operations taking longer than 2 seconds\n"
        "- Redundant API calls or database queries\n"
        "- Inefficient data processing patterns\n"
        "- Proper use of caching mechanisms\n\n"
        "Rate as: 'optimal', 'acceptable', or 'needs_improvement'"
    ),
    model="databricks:/databricks-gpt-oss-20b",
    feedback_value_type=Literal['optimal', 'acceptable', 'needs_improvement']
)


@mlflow.trace
def slow_data_processor(query: str):
    """Example application with performance bottlenecks."""
    with mlflow.start_span("fetch_data") as span:
        time.sleep(2.5)
        span.set_inputs({"query": query})
        span.set_outputs({"data": ["item1", "item2", "item3"]})

    with mlflow.start_span("process_data") as span:
        for i in range(3):
            with mlflow.start_span(f"redundant_api_call_{i}"):
                time.sleep(0.5)
        span.set_outputs({"processed": "results"})

    return "Processing complete"


result = slow_data_processor("SELECT * FROM users")
trace_id = mlflow.get_last_active_trace_id()
trace = mlflow.get_trace(trace_id)

feedback = performance_judge(trace=trace)

print(f"Performance Rating: {feedback.value}")
print(f"Analysis: {feedback.rationale}")
from mlflow.genai import make_judge
from typing import Literal

judge = make_judge(
    name="conciseness", instructions="the response {{outputs}} is concise enough", result_type=Literal['yes', 'no'], model="openai:/gpt-5-mini")
result = judge(outputs="The capital of France is Paris")
result

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
@TomeHirata TomeHirata disabled auto-merge November 3, 2025 14:13
@TomeHirata TomeHirata enabled auto-merge November 3, 2025 14:16
@TomeHirata TomeHirata added this pull request to the merge queue Nov 3, 2025
Merged via the queue into mlflow:master with commit 0ead451 Nov 3, 2025
48 checks passed
@TomeHirata TomeHirata deleted the feat/make-judge/structured-output branch November 3, 2025 14:52
@B-Step62 B-Step62 added the v3.6.0 label Nov 7, 2025
B-Step62 pushed a commit to B-Step62/mlflow that referenced this pull request Nov 7, 2025
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
Signed-off-by: TomuHirata <tomu.hirata@gmail.com>
@github-actions github-actions bot added v3.6.1 and removed v3.6.0 labels Nov 8, 2025
B-Step62 pushed a commit to B-Step62/mlflow that referenced this pull request Nov 11, 2025
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
Signed-off-by: TomuHirata <tomu.hirata@gmail.com>
B-Step62 pushed a commit that referenced this pull request Nov 11, 2025
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
Signed-off-by: TomuHirata <tomu.hirata@gmail.com>
@B-Step62 B-Step62 added v3.6.0 and removed v3.6.1 labels Nov 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/evaluation MLflow Evaluation rn/feature Mention under Features in Changelogs. v3.5.2 v3.6.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FR] Add typing for judge scores in templated make_judge items

4 participants