Support structured outputs in make_judge#18529
Conversation
- Added `result_type` parameter to `make_judge` for specifying the type of the Feedback object's value. - Implemented `_validate_result_type` function to validate supported types for serialization. - Updated `InstructionsJudge` to handle the new `result_type` and serialize it correctly. - Modified `invoke_judge_model` to accept a `response_format` parameter for structured output. - Added serialization and deserialization methods for response formats in the Scorer class. - Included tests for new functionality, ensuring proper handling of various result types and response formats. Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
2a89cfb to
80e52f5
Compare
|
Documentation preview for 60e55f3 is available at: Changed Pages (9)
More info
|
There was a problem hiding this comment.
Pull Request Overview
This PR adds support for structured outputs in the make_judge API by introducing a result_type parameter that allows users to specify the type of the judge's feedback value (e.g., int, bool, float, Literal).
Key changes:
- Added
result_typeparameter tomake_judgeandInstructionsJudgefor specifying feedback value types - Implemented serialization/deserialization methods for
result_typeto support judge persistence - Updated
invoke_judge_modeland related functions to passresponse_formatfor structured LLM outputs
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| mlflow/genai/judges/make_judge.py | Added validation for result_type parameter and passed it to InstructionsJudge |
| mlflow/genai/judges/instructions_judge/init.py | Implemented result_type support with serialization/deserialization and dynamic Pydantic model creation |
| mlflow/genai/scorers/base.py | Added deserialization of result_type in Scorer.model_validate |
| mlflow/genai/judges/utils.py | Added response_format parameter to judge invocation functions |
| tests/genai/judges/test_make_judge.py | Added comprehensive tests for result_type functionality |
| tests/genai/judges/test_judge_utils.py | Added tests for structured output with Databricks and LiteLLM models |
| docs/docs/genai/eval-monitor/scorers/llm-judge/make-judge.mdx | Added documentation section for output format specification |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| instructions: str, | ||
| model: str | None = None, | ||
| description: str | None = None, | ||
| result_type: type | None = None, |
There was a problem hiding this comment.
Specifically, this is FeedbackValueType or Literal[<list of values in PbValueType>], right? Can we change the type hint to FeedbackValueType | Literal[<list of values in PbValueType>]?
For variable naming, I'd like us to be a bit more explicit too. feedback_value_type?
There was a problem hiding this comment.
The argument value here is "type" itself, so the more accurate type hint is type[FeedbackValueType]. Yes, we can support a list of PbValueType and a dict of PbValueType to be consistent with FeedbackValueType. Do we want to support Literal in this case? I think Literal is useful for categorical responses (e.g., yes, no), but it violates the type hint type[FeedbackValueType] and there's nothing like type[Literal] in Python iirc, resulting in Any type hint which causes the type accuracy concern.
There was a problem hiding this comment.
Yeah, sounds good! We should also support literal
There was a problem hiding this comment.
@dbczumar Supported a list and dict of PbValueType, can you take another look?
| # Build request payload | ||
| payload = { | ||
| "messages": [ | ||
| { | ||
| "role": "user", | ||
| "content": prompt, | ||
| } | ||
| ], | ||
| } | ||
|
|
||
| # Add response_schema if provided | ||
| if response_format is not None: | ||
| payload["response_schema"] = response_format.model_json_schema() |
There was a problem hiding this comment.
Let's test this on Databricks to make sure it works with GPT / GPT OSS, Anthropic, Llama. We should introduce some retry logic if response schema is not supported
There was a problem hiding this comment.
Yes, I tested FMAPI models and structured output worked well!
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
| ) | ||
|
|
||
| @staticmethod | ||
| def _serialize_response_format(response_format: type) -> dict[str, Any]: |
There was a problem hiding this comment.
Could you elaborate what is missing if we simply serialize the ResponseFormat pydantic model using .model_json_schema()? We still need to implement deserialization but it allow us bypassing serialization and maintaining our own format.
There was a problem hiding this comment.
There's no native way to convert json schema back to Pydantic base class. We can simplify the stored artifact and deserialization logic if we have our own simple format rather than using JSON schema. But I agree with the value of using a standard schema, so I’m fine with either way.
There was a problem hiding this comment.
Btw, to clarify, we just need to store the feedback_value_type rather than the entire response_format for LLM which is dynamically constructed (renamed the serialize method to clarify). Then the feedback value type itself is not Pydantic base model. To use JSON schema, we need to store the entire response_format, which contains some duplicated information.
There was a problem hiding this comment.
Yea no strong preference here. Just thought using Pydantic serialization reduces complexity and spec ambiguity, even if we need to implement deserialization ourselves. Not a blocker.
mlflow/genai/judges/make_judge.py
Outdated
| instructions: str, | ||
| model: str | None = None, | ||
| description: str | None = None, | ||
| feedback_value_type: Any = str, |
There was a problem hiding this comment.
Do you think it is too aggressive if we default to boolean or mandate this field? The current default setting (free-form text) is usable; users will get inconsistent results and broken aggregation. For example, when I tried it first, I got three different words representing "false". This happens to high-end LLMs as well because there is no mechanism to enforce consistency by default.
It is not great that the default experience gives impression of "make_judge is broken". I would rather harness it or adding extra step, such that users can get meaningful results from the default journey.
There was a problem hiding this comment.
str is for backward compatibility. It's not a good experience if the users' existing judge changes the response type when upgrading MLflow version. Assuming the low adoption of make_judge, I'm fine with making boolean or Literal["yes", "on"] our default. What do you think? cc: @dbczumar @alkispoly-db
There was a problem hiding this comment.
We're definitely going to break existing judges if we do this; let's keep a string default but update documentation to explain how to use the structured outputs and use it in all of our examples.
It is not great that the default experience gives impression of "make_judge is broken"
I haven't experienced this issue if I tell the judge to "return one of the following values exactly: ..."
There was a problem hiding this comment.
How many "existing judge" users do we have? It is much better to break earlier than later if we believe it is makes a product better. The API is marked experimental too.
I haven't experienced this issue if I tell the judge to "return one of the following values exactly: ..."
Do competitors ask similar thing to users? Also structured outputs is a theoretically guaranteed method to enforce outputs, so is more stable than prompting.
There was a problem hiding this comment.
As a data point, the oss usage is less than 5 sessions per day in average. Isn't it a good trade-off to make default experience better for hundreds of future users who will be using the API?
There was a problem hiding this comment.
@dbczumar can you take a look at Yuki's comment above?
…rata/mlflow into feat/make-judge/structured-output
| # Check for dict[str, PbValueType] | ||
| if origin is dict: | ||
| args = get_args(feedback_value_type) | ||
| if len(args) == 2: | ||
| key_type, value_type = args | ||
| # Key must be str | ||
| if key_type != str: | ||
| from mlflow.exceptions import MlflowException | ||
|
|
||
| raise MlflowException.invalid_parameter_value( | ||
| f"dict key type must be str, got {key_type}" | ||
| ) | ||
| # Value must be a PbValueType | ||
| if value_type not in (str, int, float, bool): | ||
| from mlflow.exceptions import MlflowException | ||
|
|
||
| raise MlflowException.invalid_parameter_value( | ||
| "The `feedback_value_type` argument does not support a dict type" | ||
| f"with non-primitive values, but got {value_type.__name__}" | ||
| ) | ||
| return |
There was a problem hiding this comment.
Can pydantic help with this validation, or is it too difficult to express in pydantic because we have to support v1 and v2?
There was a problem hiding this comment.
Ifaik, Pydantic helps validate an object on types, but this validation is for validating a type on types. So I'm not sure if we can simplify this logic using Pydantic.
| response_format = pydantic.create_model( | ||
| "ResponseFormat", | ||
| result=( | ||
| self._feedback_value_type or str, | ||
| pydantic.Field(description=self.description or "The result of the evaluation"), | ||
| ), | ||
| rationale=(str, pydantic.Field(description="The rationale for the evaluation")), | ||
| ) |
There was a problem hiding this comment.
@TomeHirata Anything we can do to make the is_trace_based == True case work with Databricks gpt-oss, which doesn't support structured outputs and tool calling together?
There was a problem hiding this comment.
It seems like we could accept the response format as an argument to the _build_system_message method and use it to adjust the evaluation_rating_fields?
def _build_system_message(self, is_trace_based: bool) -> str:
"""Build the system message based on whether this is trace-based or field-based."""
output_fields = self.get_output_fields()
if is_trace_based:
evaluation_rating_fields = "\n".join(
[f"- {field.name}: {field.description}" for field in output_fields]
)
return INSTRUCTIONS_JUDGE_TRACE_PROMPT_TEMPLATE.format(
evaluation_rating_fields=evaluation_rating_fields,
instructions=self._instructions,
)
else:
base_prompt = format_prompt(
INSTRUCTIONS_JUDGE_SYSTEM_PROMPT, instructions=self._instructions
)
return add_output_format_instructions(base_prompt, output_fields=output_fields)
There was a problem hiding this comment.
Agree to include the type information in instruction when is_trace_based == True. Added the type information to the evaluation rating field.
dbczumar
left a comment
There was a problem hiding this comment.
Thanks @TomeHirata ! Just a few small comments (pydantic one isn't blocking, but we should do it if it reduces code complexity)
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
505504d to
48a618f
Compare
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
dbczumar
left a comment
There was a problem hiding this comment.
LGTM! Thanks @TomeHirata ! Can you do some manual testing with GPT OSS (and ideally at least one other model) on Databricks before merge?
|
Yes, I've tested the following code works. |
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com>
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com> Signed-off-by: TomuHirata <tomu.hirata@gmail.com>
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com> Signed-off-by: TomuHirata <tomu.hirata@gmail.com>
Signed-off-by: Tomu Hirata <tomu.hirata@gmail.com> Signed-off-by: TomuHirata <tomu.hirata@gmail.com>
🛠 DevTools 🛠
Install mlflow from this PR
For Databricks, use the following command:
Related Issues/PRs
Resolve #18262
What changes are proposed in this pull request?
Support structured output in
make_judgeAPIresult_typeparameter tomake_judgefor specifying the type of the Feedback object's value.InstructionsJudgeto handle the newresult_typeand serialize it correctly. Due to the requirements for serialization, it currently supports primitives.invoke_judge_modelto accept aresponse_formatparameter for structured output.How is this PR tested?
Does this PR require documentation update?
Release Notes
Is this a user-facing change?
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/tracking: Tracking Service, tracking client APIs, autologgingarea/models: MLmodel format, model serialization/deserialization, flavorsarea/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registryarea/scoring: MLflow Model server, model deployment tools, Spark UDFsarea/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflowsarea/gateway: MLflow AI Gateway client APIs, server, and third-party integrationsarea/prompts: MLflow prompt engineering features, prompt templates, and prompt managementarea/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionalityarea/projects: MLproject format, project running backendsarea/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/build: Build and test infrastructure for MLflowarea/docs: MLflow documentation pagesHow should the PR be classified in the release notes? Choose one:
rn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/breaking-change- The PR will be mentioned in the "Breaking Changes" sectionrn/feature- A new user-facing feature worth mentioning in the release notesrn/bug-fix- A user-facing bug fix worth mentioning in the release notesrn/documentation- A user-facing documentation change worth mentioning in the release notesShould this PR be included in the next patch release?
Yesshould be selected for bug fixes, documentation updates, and other small changes.Noshould be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.What is a minor/patch release?
Bug fixes, doc updates and new features usually go into minor releases.
Bug fixes and doc updates usually go into patch releases.