Support structured outputs in make_judge by TomeHirata · Pull Request #18529 · mlflow/mlflow

TomeHirata · 2025-10-27T06:16:36Z

🛠 DevTools 🛠

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/18529/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/18529/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/18529/merge

Related Issues/PRs

Resolve #18262

What changes are proposed in this pull request?

Support structured output in make_judge API

Added result_type parameter to make_judge for specifying the type of the Feedback object's value.
Updated InstructionsJudge to handle the new result_type and serialize it correctly. Due to the requirements for serialization, it currently supports primitives.
Modified invoke_judge_model to accept a response_format parameter for structured output.

from mlflow.genai import make_judge
from typing import Literal

judge = make_judge(
    name="conciseness", 
    instructions="the response {{outputs}} is concise enough", 
    result_type=Literal['yes', 'no'], 
)

How is this PR tested?

Existing unit/integration tests
New unit/integration tests
Manual tests

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

How should the PR be classified in the release notes? Choose one:

rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?

Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
Bug fixes, doc updates and new features usually go into minor releases.
Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
Bug fixes and doc updates usually go into patch releases.

Yes (this PR will be cherry-picked and included in the next patch release)
No (this PR will be included in the next minor release)

- Added `result_type` parameter to `make_judge` for specifying the type of the Feedback object's value. - Implemented `_validate_result_type` function to validate supported types for serialization. - Updated `InstructionsJudge` to handle the new `result_type` and serialize it correctly. - Modified `invoke_judge_model` to accept a `response_format` parameter for structured output. - Added serialization and deserialization methods for response formats in the Scorer class. - Included tests for new functionality, ensuring proper handling of various result types and response formats. Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>

github-actions · 2025-10-27T06:40:08Z

Documentation preview for 60e55f3 is available at:

https://pr-18529--mlflow-docs-preview.netlify.app/docs/latest/

Changed Pages (9)

More info

Ignore this comment if this PR does not change the documentation.
The preview is updated when a new commit is pushed to this PR.
This comment was created by this workflow run.
The documentation was built by this workflow run.

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>

Copilot

Pull Request Overview

This PR adds support for structured outputs in the make_judge API by introducing a result_type parameter that allows users to specify the type of the judge's feedback value (e.g., int, bool, float, Literal).

Key changes:

Added result_type parameter to make_judge and InstructionsJudge for specifying feedback value types
Implemented serialization/deserialization methods for result_type to support judge persistence
Updated invoke_judge_model and related functions to pass response_format for structured LLM outputs

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
mlflow/genai/judges/make_judge.py	Added validation for `result_type` parameter and passed it to `InstructionsJudge`
mlflow/genai/judges/instructions_judge/init.py	Implemented `result_type` support with serialization/deserialization and dynamic Pydantic model creation
mlflow/genai/scorers/base.py	Added deserialization of `result_type` in `Scorer.model_validate`
mlflow/genai/judges/utils.py	Added `response_format` parameter to judge invocation functions
tests/genai/judges/test_make_judge.py	Added comprehensive tests for `result_type` functionality
tests/genai/judges/test_judge_utils.py	Added tests for structured output with Databricks and LiteLLM models
docs/docs/genai/eval-monitor/scorers/llm-judge/make-judge.mdx	Added documentation section for output format specification

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mlflow/genai/judges/instructions_judge/__init__.py

mlflow/genai/judges/make_judge.py

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>

mlflow/genai/judges/instructions_judge/__init__.py

dbczumar · 2025-10-27T19:49:21Z

mlflow/genai/judges/instructions_judge/__init__.py

        instructions: str,
        model: str | None = None,
        description: str | None = None,
+        result_type: type | None = None,


Specifically, this is FeedbackValueType or Literal[<list of values in PbValueType>], right? Can we change the type hint to FeedbackValueType | Literal[<list of values in PbValueType>]?

For variable naming, I'd like us to be a bit more explicit too. feedback_value_type?

cc @alkispoly-db @AveshCSingh

The argument value here is "type" itself, so the more accurate type hint is type[FeedbackValueType]. Yes, we can support a list of PbValueType and a dict of PbValueType to be consistent with FeedbackValueType. Do we want to support Literal in this case? I think Literal is useful for categorical responses (e.g., yes, no), but it violates the type hint type[FeedbackValueType] and there's nothing like type[Literal] in Python iirc, resulting in Any type hint which causes the type accuracy concern.

Yeah, sounds good! We should also support literal

@dbczumar Supported a list and dict of PbValueType, can you take another look?

mlflow/genai/judges/instructions_judge/__init__.py

mlflow/genai/judges/make_judge.py

dbczumar · 2025-10-27T19:54:20Z

mlflow/genai/judges/utils.py

+            # Build request payload
+            payload = {
+                "messages": [
+                    {
+                        "role": "user",
+                        "content": prompt,
+                    }
+                ],
+            }
+
+            # Add response_schema if provided
+            if response_format is not None:
+                payload["response_schema"] = response_format.model_json_schema()


Let's test this on Databricks to make sure it works with GPT / GPT OSS, Anthropic, Llama. We should introduce some retry logic if response schema is not supported

Yes, I tested FMAPI models and structured output worked well!

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>

B-Step62 · 2025-10-28T03:34:19Z

mlflow/genai/judges/instructions_judge/__init__.py

        )

+    @staticmethod
+    def _serialize_response_format(response_format: type) -> dict[str, Any]:


Could you elaborate what is missing if we simply serialize the ResponseFormat pydantic model using .model_json_schema()? We still need to implement deserialization but it allow us bypassing serialization and maintaining our own format.

There's no native way to convert json schema back to Pydantic base class. We can simplify the stored artifact and deserialization logic if we have our own simple format rather than using JSON schema. But I agree with the value of using a standard schema, so I’m fine with either way.

Btw, to clarify, we just need to store the feedback_value_type rather than the entire response_format for LLM which is dynamically constructed (renamed the serialize method to clarify). Then the feedback value type itself is not Pydantic base model. To use JSON schema, we need to store the entire response_format, which contains some duplicated information.

Yea no strong preference here. Just thought using Pydantic serialization reduces complexity and spec ambiguity, even if we need to implement deserialization ourselves. Not a blocker.

mlflow/genai/judges/make_judge.py

B-Step62 · 2025-10-28T03:41:51Z

mlflow/genai/judges/make_judge.py

    instructions: str,
    model: str | None = None,
    description: str | None = None,
+    feedback_value_type: Any = str,


Do you think it is too aggressive if we default to boolean or mandate this field? The current default setting (free-form text) is usable; users will get inconsistent results and broken aggregation. For example, when I tried it first, I got three different words representing "false". This happens to high-end LLMs as well because there is no mechanism to enforce consistency by default.

It is not great that the default experience gives impression of "make_judge is broken". I would rather harness it or adding extra step, such that users can get meaningful results from the default journey.

str is for backward compatibility. It's not a good experience if the users' existing judge changes the response type when upgrading MLflow version. Assuming the low adoption of make_judge, I'm fine with making boolean or Literal["yes", "on"] our default. What do you think? cc: @dbczumar @alkispoly-db

We're definitely going to break existing judges if we do this; let's keep a string default but update documentation to explain how to use the structured outputs and use it in all of our examples.

It is not great that the default experience gives impression of "make_judge is broken"

I haven't experienced this issue if I tell the judge to "return one of the following values exactly: ..."

How many "existing judge" users do we have? It is much better to break earlier than later if we believe it is makes a product better. The API is marked experimental too.

I haven't experienced this issue if I tell the judge to "return one of the following values exactly: ..."

Do competitors ask similar thing to users? Also structured outputs is a theoretically guaranteed method to enforce outputs, so is more stable than prompting.

As a data point, the oss usage is less than 5 sessions per day in average. Isn't it a good trade-off to make default experience better for hundreds of future users who will be using the API?

@dbczumar can you take a look at Yuki's comment above?

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>

…rata/mlflow into feat/make-judge/structured-output

mlflow/genai/judges/instructions_judge/__init__.py

dbczumar · 2025-10-30T15:18:59Z

mlflow/genai/judges/make_judge.py

+    # Check for dict[str, PbValueType]
+    if origin is dict:
+        args = get_args(feedback_value_type)
+        if len(args) == 2:
+            key_type, value_type = args
+            # Key must be str
+            if key_type != str:
+                from mlflow.exceptions import MlflowException
+
+                raise MlflowException.invalid_parameter_value(
+                    f"dict key type must be str, got {key_type}"
+                )
+            # Value must be a PbValueType
+            if value_type not in (str, int, float, bool):
+                from mlflow.exceptions import MlflowException
+
+                raise MlflowException.invalid_parameter_value(
+                    "The `feedback_value_type` argument does not support a dict type"
+                    f"with non-primitive values, but got {value_type.__name__}"
+                )
+            return


Can pydantic help with this validation, or is it too difficult to express in pydantic because we have to support v1 and v2?

Ifaik, Pydantic helps validate an object on types, but this validation is for validating a type on types. So I'm not sure if we can simplify this logic using Pydantic.

dbczumar · 2025-10-30T15:22:05Z

mlflow/genai/judges/instructions_judge/__init__.py

+        response_format = pydantic.create_model(
+            "ResponseFormat",
+            result=(
+                self._feedback_value_type or str,
+                pydantic.Field(description=self.description or "The result of the evaluation"),
+            ),
+            rationale=(str, pydantic.Field(description="The rationale for the evaluation")),
+        )


@TomeHirata Anything we can do to make the is_trace_based == True case work with Databricks gpt-oss, which doesn't support structured outputs and tool calling together?

It seems like we could accept the response format as an argument to the _build_system_message method and use it to adjust the evaluation_rating_fields?

def _build_system_message(self, is_trace_based: bool) -> str: """Build the system message based on whether this is trace-based or field-based.""" output_fields = self.get_output_fields() if is_trace_based: evaluation_rating_fields = "\n".join( [f"- {field.name}: {field.description}" for field in output_fields] ) return INSTRUCTIONS_JUDGE_TRACE_PROMPT_TEMPLATE.format( evaluation_rating_fields=evaluation_rating_fields, instructions=self._instructions, ) else: base_prompt = format_prompt( INSTRUCTIONS_JUDGE_SYSTEM_PROMPT, instructions=self._instructions ) return add_output_format_instructions(base_prompt, output_fields=output_fields)

Agree to include the type information in instruction when is_trace_based == True. Added the type information to the evaluation rating field.

dbczumar

Thanks @TomeHirata ! Just a few small comments (pydantic one isn't blocking, but we should do it if it reduces code complexity)

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>

dbczumar

LGTM! Thanks @TomeHirata ! Can you do some manual testing with GPT OSS (and ideally at least one other model) on Databricks before merge?

TomeHirata · 2025-11-03T14:12:27Z

Yes, I've tested the following code works.

from mlflow.genai.judges import make_judge
from typing import Literal
import mlflow
import time

performance_judge = make_judge(
    name="performance_analyzer",
    instructions=(
        "Analyze the {{ trace }} for performance issues.\n\n"
        "Check for:\n"
        "- Operations taking longer than 2 seconds\n"
        "- Redundant API calls or database queries\n"
        "- Inefficient data processing patterns\n"
        "- Proper use of caching mechanisms\n\n"
        "Rate as: 'optimal', 'acceptable', or 'needs_improvement'"
    ),
    model="databricks:/databricks-gpt-oss-20b",
    feedback_value_type=Literal['optimal', 'acceptable', 'needs_improvement']
)


@mlflow.trace
def slow_data_processor(query: str):
    """Example application with performance bottlenecks."""
    with mlflow.start_span("fetch_data") as span:
        time.sleep(2.5)
        span.set_inputs({"query": query})
        span.set_outputs({"data": ["item1", "item2", "item3"]})

    with mlflow.start_span("process_data") as span:
        for i in range(3):
            with mlflow.start_span(f"redundant_api_call_{i}"):
                time.sleep(0.5)
        span.set_outputs({"processed": "results"})

    return "Processing complete"


result = slow_data_processor("SELECT * FROM users")
trace_id = mlflow.get_last_active_trace_id()
trace = mlflow.get_trace(trace_id)

feedback = performance_judge(trace=trace)

print(f"Performance Rating: {feedback.value}")
print(f"Analysis: {feedback.rationale}")

from mlflow.genai import make_judge
from typing import Literal

judge = make_judge(
    name="conciseness", instructions="the response {{outputs}} is concise enough", result_type=Literal['yes', 'no'], model="openai:/gpt-5-mini")
result = judge(outputs="The capital of France is Paris")
result

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com> Signed-off-by: TomuHirata <tomu.hirata@gmail.com>

github-actions bot added area/evaluation MLflow Evaluation rn/feature Mention under Features in Changelogs. v3.5.2 labels Oct 27, 2025

TomeHirata added 3 commits October 27, 2025 15:26

Include documentation

3497004

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>

lint

80e52f5

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>

TomeHirata force-pushed the feat/make-judge/structured-output branch from 2a89cfb to 80e52f5 Compare October 27, 2025 06:27

fix test

59ca303

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>

TomeHirata requested a review from Copilot October 27, 2025 07:16

Copilot AI reviewed Oct 27, 2025

View reviewed changes

mlflow/genai/judges/instructions_judge/__init__.py Outdated Show resolved Hide resolved

mlflow/genai/judges/make_judge.py Outdated Show resolved Hide resolved

TomeHirata added 2 commits October 27, 2025 16:44

comment

296636f

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>

lint

dedbab6

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>

TomeHirata requested review from B-Step62, alkispoly-db and dbczumar October 27, 2025 09:15

dbczumar reviewed Oct 27, 2025

View reviewed changes

mlflow/genai/judges/instructions_judge/__init__.py Show resolved Hide resolved

dbczumar reviewed Oct 27, 2025

View reviewed changes

mlflow/genai/judges/instructions_judge/__init__.py Outdated Show resolved Hide resolved

dbczumar reviewed Oct 27, 2025

View reviewed changes

mlflow/genai/judges/make_judge.py Outdated Show resolved Hide resolved

dbczumar reviewed Oct 27, 2025

View reviewed changes

TomeHirata added 3 commits October 28, 2025 08:12

support list[PbValueType] and dict[str, PbValueType]

e09da8b

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>

update type hint

e4567da

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>

null handling

f0e4f38

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>

TomeHirata requested a review from dbczumar October 27, 2025 23:36

test

ced9c19

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>

B-Step62 reviewed Oct 28, 2025

View reviewed changes

mlflow/genai/judges/make_judge.py Outdated Show resolved Hide resolved

B-Step62 reviewed Oct 28, 2025

View reviewed changes

comment

91d8611

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>

TomeHirata added 2 commits October 30, 2025 16:28

fix test

fe8bbcf

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>

Merge branch 'feat/make-judge/structured-output' of github.com:TomeHi…

46acd7a

…rata/mlflow into feat/make-judge/structured-output

dbczumar reviewed Oct 30, 2025

View reviewed changes

mlflow/genai/judges/instructions_judge/__init__.py Outdated Show resolved Hide resolved

dbczumar reviewed Oct 30, 2025

View reviewed changes

dbczumar requested changes Oct 30, 2025

View reviewed changes

doc update

48a618f

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>

TomeHirata force-pushed the feat/make-judge/structured-output branch from 505504d to 48a618f Compare October 31, 2025 00:12

Comments

5b8261c

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>

TomeHirata requested a review from dbczumar October 31, 2025 07:15

TomeHirata added 2 commits October 31, 2025 16:20

docstring

993e5ac

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>

update tests

e730be8

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>

dbczumar approved these changes Oct 31, 2025

View reviewed changes

dbczumar mentioned this pull request Nov 3, 2025

[FR] Add typing for judge scores in templated make_judge items #18262

Closed

14 tasks

fix doc error

60e55f3

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>

TomeHirata disabled auto-merge November 3, 2025 14:13

TomeHirata enabled auto-merge November 3, 2025 14:16

TomeHirata added this pull request to the merge queue Nov 3, 2025

Merged via the queue into mlflow:master with commit 0ead451 Nov 3, 2025
48 checks passed

TomeHirata deleted the feat/make-judge/structured-output branch November 3, 2025 14:52

B-Step62 added the v3.6.0 label Nov 7, 2025

B-Step62 pushed a commit to B-Step62/mlflow that referenced this pull request Nov 7, 2025

Support structured outputs in make_judge (mlflow#18529)

523eb08

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com> Signed-off-by: TomuHirata <tomu.hirata@gmail.com>

github-actions bot added v3.6.1 and removed v3.6.0 labels Nov 8, 2025

B-Step62 pushed a commit to B-Step62/mlflow that referenced this pull request Nov 11, 2025

Support structured outputs in make_judge (mlflow#18529)

2d4efa5

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com> Signed-off-by: TomuHirata <tomu.hirata@gmail.com>

B-Step62 pushed a commit that referenced this pull request Nov 11, 2025

Support structured outputs in make_judge (#18529)

8b900fa

Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com> Signed-off-by: TomuHirata <tomu.hirata@gmail.com>

B-Step62 added v3.6.0 and removed v3.6.1 labels Nov 11, 2025

Conversation

TomeHirata commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Install mlflow from this PR

Related Issues/PRs

What changes are proposed in this pull request?

How is this PR tested?

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

Should this PR be included in the next patch release?

Uh oh!

github-actions bot commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dbczumar Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomeHirata Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dbczumar Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dbczumar Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomeHirata Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

B-Step62 Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dbczumar Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

B-Step62 Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

B-Step62 Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dbczumar Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

TomeHirata commented Oct 27, 2025 •

edited

Loading

github-actions bot commented Oct 27, 2025 •

edited

Loading

dbczumar Oct 27, 2025 •

edited

Loading

TomeHirata Oct 27, 2025 •

edited

Loading

dbczumar Oct 27, 2025 •

edited

Loading

dbczumar Oct 27, 2025 •

edited

Loading

TomeHirata Oct 28, 2025 •

edited

Loading

B-Step62 Oct 28, 2025 •

edited

Loading

dbczumar Oct 28, 2025 •

edited

Loading

B-Step62 Oct 28, 2025 •

edited

Loading

B-Step62 Oct 28, 2025 •

edited

Loading

dbczumar Oct 30, 2025 •

edited

Loading

TomeHirata Oct 31, 2025 •

edited

Loading

TomeHirata Oct 31, 2025 •

edited

Loading

TomeHirata commented Nov 3, 2025 •

edited

Loading