Fix InstructionsJudge using scorer description as assessment value#19121
Fix InstructionsJudge using scorer description as assessment value#19121alkispoly-db merged 5 commits intomlflow:masterfrom
Conversation
When InstructionsJudge scorers had custom descriptions, the LLM would echo back the scorer's description as the assessment value instead of generating an actual evaluation rating. Root cause: The scorer's description was being used in both the JSON schema (response_format) and system prompt instructions, causing the LLM to interpret it as what the result field should contain. Fix: Use generic field descriptions (_RESULT_FIELD_DESCRIPTION) instead of the scorer's description. Refactored to use get_output_fields() as the single source of truth for field definitions. Example: - Before: value="Evaluates if the answer is concise" (description string) - After: value=4 (actual rating) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
Simplify test function docstrings per project guidelines to avoid redundant documentation that merely repeats what the test does. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
Remove explanatory comments for cleaner code per project style. The implementation is self-documenting: get_output_fields() is clearly the single source of truth. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
|
Documentation preview for 4d1a2da is available at: More info
|
|
|
||
| fields = ( | ||
| {"rationale": rationale_field, "result": result_field} | ||
| if self._generate_rationale_first |
There was a problem hiding this comment.
Shouldn't we keep the _generate_rationale_first logic?
There was a problem hiding this comment.
My bad -- I will add back the logic for this.
There was a problem hiding this comment.
Actually, this logic is now refactored into get_output_fields(), so it is correctly maintained.
|
I'm curious which model causes the echo. I tested the following code using GPT, and it correctly returns the eval result without echoing. Also, did it happen even when from mlflow.genai import make_judge
judge = make_judge(
name="conciseness",
instructions="the response {{outputs}} is concise enough",
description="Evaluates if the answer is concise",
model="openai:/gpt-5-mini")
result = judge(outputs="The capital of France is Paris")
result.value
-> "Yes" |
| # NOT the scorer's description | ||
| result_description = schema["properties"]["result"]["description"] | ||
| assert result_description == _RESULT_FIELD_DESCRIPTION, ( | ||
| f"Response format should use generic field description, not scorer description.\n" |
There was a problem hiding this comment.
nit: probably we don't need assertion message
There was a problem hiding this comment.
Agred -- will remove.
| output_fields = judge.get_output_fields() | ||
| result_field = next(f for f in output_fields if f.name == "result") | ||
| assert result_field.description == _RESULT_FIELD_DESCRIPTION, ( | ||
| f"Output fields should use generic description in system prompt.\n" |
TomeHirata
left a comment
There was a problem hiding this comment.
Left some questions/comments. But agree to use hardcoded result field description to get constant outputs.
| assert output_fields_rationale_first[1].value_type == Literal["good", "bad"] # result | ||
|
|
||
|
|
||
| def test_response_format_uses_generic_description_when_scorer_has_description(): |
There was a problem hiding this comment.
Can we combine the two tests? The logic is identical except for the description field.
There was a problem hiding this comment.
Good idea -- done.
Combine test_response_format_uses_generic_description_when_scorer_has_description and test_response_format_uses_generic_description_when_scorer_has_no_description into a single parameterized test for better maintainability. The parameterized approach tests both scenarios (with and without custom description) using the same test logic, reducing code duplication. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>
It happened with gpt4o-mini which is the current model used in Databricks. |
…lflow#19121) Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com> Co-authored-by: Claude <noreply@anthropic.com>
…19121) Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com> Co-authored-by: Claude <noreply@anthropic.com>
🛠 DevTools 🛠
Install mlflow from this PR
For Databricks, use the following command:
Related Issues/PRs
N/A - Bug fix discovered during development
What changes are proposed in this pull request?
This PR fixes a bug where
InstructionsJudgescorers with custom descriptions would cause the LLM to echo back the scorer's description as the assessment value instead of generating an actual evaluation rating.Root Cause:
The scorer's description (e.g., "Evaluates if the answer is concise") was being used in:
response_format) sent to the LLMThis caused the LLM to interpret the description as what the "result" field should contain, leading it to echo the description instead of generating an assessment.
Solution:
_RESULT_FIELD_DESCRIPTION = "The evaluation rating/result") instead of the scorer's description_create_response_format_model()to useget_output_fields()as the single source of truth for field definitionsExample Impact:
FeedbackValue(value="Evaluates if the answer is concise")❌FeedbackValue(value=4)✅Files Changed:
mlflow/genai/judges/instructions_judge/__init__.py: Fixed field description logic in two methodstests/genai/judges/test_make_judge.py: Added comprehensive test coverageHow is this PR tested?
test_response_format_uses_generic_description_when_scorer_has_descriptiontest_response_format_uses_generic_description_when_scorer_has_no_descriptionDoes this PR require documentation update?
Release Notes
Is this a user-facing change?
Description: Fixed a bug where LLM judge scorers (InstructionsJudge) with custom descriptions would return the description text as the assessment value instead of generating actual evaluation ratings. Scorers now correctly return numeric or categorical assessments as intended.
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflowsarea/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionalityHow should the PR be classified in the release notes? Choose one:
rn/bug-fix- A user-facing bug fix worth mentioning in the release notesShould this PR be included in the next patch release?
Rationale: This is a bug fix that affects judge/scorer functionality. Users relying on custom-described judges are getting incorrect assessment values.