Fix InstructionsJudge using scorer description as assessment value by alkispoly-db · Pull Request #19121 · mlflow/mlflow

alkispoly-db · 2025-11-30T23:44:28Z

🛠 DevTools 🛠

Install mlflow from this PR

# mlflow
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/19121/merge
# mlflow-skinny
pip install git+https://github.com/mlflow/mlflow.git@refs/pull/19121/merge#subdirectory=libs/skinny

For Databricks, use the following command:

%sh curl -LsSf https://raw.githubusercontent.com/mlflow/mlflow/HEAD/dev/install-skinny.sh | sh -s pull/19121/merge

Related Issues/PRs

N/A - Bug fix discovered during development

What changes are proposed in this pull request?

This PR fixes a bug where InstructionsJudge scorers with custom descriptions would cause the LLM to echo back the scorer's description as the assessment value instead of generating an actual evaluation rating.

Root Cause:
The scorer's description (e.g., "Evaluates if the answer is concise") was being used in:

The JSON schema (response_format) sent to the LLM
The system prompt instructions

This caused the LLM to interpret the description as what the "result" field should contain, leading it to echo the description instead of generating an assessment.

Solution:

Use generic field description (_RESULT_FIELD_DESCRIPTION = "The evaluation rating/result") instead of the scorer's description
Refactored _create_response_format_model() to use get_output_fields() as the single source of truth for field definitions
Added explanatory comments to prevent future regressions

Example Impact:

Before: FeedbackValue(value="Evaluates if the answer is concise") ❌
After: FeedbackValue(value=4) ✅

Files Changed:

mlflow/genai/judges/instructions_judge/__init__.py: Fixed field description logic in two methods
tests/genai/judges/test_make_judge.py: Added comprehensive test coverage

How is this PR tested?

Existing unit/integration tests (all pass)
New unit/integration tests
- test_response_format_uses_generic_description_when_scorer_has_description
- test_response_format_uses_generic_description_when_scorer_has_no_description
Manual tests (verified with Databricks integration - scorer now returns numeric ratings instead of description strings)

Does this PR require documentation update?

No. You can skip the rest of this section.

Release Notes

Is this a user-facing change?

Yes. Give a description of this change to be included in the release notes for MLflow users.

Description: Fixed a bug where LLM judge scorers (InstructionsJudge) with custom descriptions would return the description text as the assessment value instead of generating actual evaluation ratings. Scorers now correctly return numeric or categorical assessments as intended.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality

How should the PR be classified in the release notes? Choose one:

rn/bug-fix - A user-facing bug fix worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes (this PR will be cherry-picked and included in the next patch release)

Rationale: This is a bug fix that affects judge/scorer functionality. Users relying on custom-described judges are getting incorrect assessment values.

When InstructionsJudge scorers had custom descriptions, the LLM would echo back the scorer's description as the assessment value instead of generating an actual evaluation rating. Root cause: The scorer's description was being used in both the JSON schema (response_format) and system prompt instructions, causing the LLM to interpret it as what the result field should contain. Fix: Use generic field descriptions (_RESULT_FIELD_DESCRIPTION) instead of the scorer's description. Refactored to use get_output_fields() as the single source of truth for field definitions. Example: - Before: value="Evaluates if the answer is concise" (description string) - After: value=4 (actual rating) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>

Simplify test function docstrings per project guidelines to avoid redundant documentation that merely repeats what the test does. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>

Remove explanatory comments for cleaner code per project style. The implementation is self-documenting: get_output_fields() is clearly the single source of truth. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>

github-actions · 2025-12-01T00:00:00Z

Documentation preview for 4d1a2da is available at:

https://pr-19121--mlflow-docs-preview.netlify.app/docs/latest/

More info

Ignore this comment if this PR does not change the documentation.
The preview is updated when a new commit is pushed to this PR.
This comment was created by this workflow run.
The documentation was built by this workflow run.

TomeHirata · 2025-12-01T01:16:33Z

mlflow/genai/judges/instructions_judge/__init__.py


-        fields = (
-            {"rationale": rationale_field, "result": result_field}
-            if self._generate_rationale_first


Shouldn't we keep the _generate_rationale_first logic?

My bad -- I will add back the logic for this.

Actually, this logic is now refactored into get_output_fields(), so it is correctly maintained.

TomeHirata · 2025-12-01T01:25:34Z

I'm curious which model causes the echo. I tested the following code using GPT, and it correctly returns the eval result without echoing. Also, did it happen even when feedback_value_type is specified?

from mlflow.genai import make_judge

judge = make_judge(
    name="conciseness", 
    instructions="the response {{outputs}} is concise enough", 
    description="Evaluates if the answer is concise",
    model="openai:/gpt-5-mini")
result = judge(outputs="The capital of France is Paris")
result.value
-> "Yes"

TomeHirata · 2025-12-01T01:28:47Z

tests/genai/judges/test_make_judge.py

+    # NOT the scorer's description
+    result_description = schema["properties"]["result"]["description"]
+    assert result_description == _RESULT_FIELD_DESCRIPTION, (
+        f"Response format should use generic field description, not scorer description.\n"


nit: probably we don't need assertion message

Agred -- will remove.

TomeHirata · 2025-12-01T01:29:11Z

tests/genai/judges/test_make_judge.py

+    output_fields = judge.get_output_fields()
+    result_field = next(f for f in output_fields if f.name == "result")
+    assert result_field.description == _RESULT_FIELD_DESCRIPTION, (
+        f"Output fields should use generic description in system prompt.\n"


TomeHirata

Left some questions/comments. But agree to use hardcoded result field description to get constant outputs.

TomeHirata · 2025-12-01T01:33:00Z

tests/genai/judges/test_make_judge.py

    assert output_fields_rationale_first[1].value_type == Literal["good", "bad"]  # result
+
+
+def test_response_format_uses_generic_description_when_scorer_has_description():


Can we combine the two tests? The logic is identical except for the description field.

Good idea -- done.

Combine test_response_format_uses_generic_description_when_scorer_has_description and test_response_format_uses_generic_description_when_scorer_has_no_description into a single parameterized test for better maintainability. The parameterized approach tests both scenarios (with and without custom description) using the same test logic, reducing code duplication. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com>

alkispoly-db · 2025-12-01T01:58:27Z

I'm curious which model causes the echo. I tested the following code using GPT, and it correctly returns the eval result without echoing. Also, did it happen even when feedback_value_type is specified?
from mlflow.genai import make_judge

judge = make_judge(
    name="conciseness", 
    instructions="the response {{outputs}} is concise enough", 
    description="Evaluates if the answer is concise",
    model="openai:/gpt-5-mini")
result = judge(outputs="The capital of France is Paris")
result.value
-> "Yes"

It happened with gpt4o-mini which is the current model used in Databricks.

…lflow#19121) Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com> Co-authored-by: Claude <noreply@anthropic.com>

…19121) Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com> Co-authored-by: Claude <noreply@anthropic.com>

github-actions bot added v3.7.0 area/evaluation MLflow Evaluation area/tracing MLflow Tracing and its integrations rn/bug-fix Mention under Bug Fixes in Changelogs. labels Nov 30, 2025

alkispoly-db and others added 2 commits November 30, 2025 23:49

alkispoly-db requested a review from B-Step62 November 30, 2025 23:52

TomeHirata reviewed Dec 1, 2025

View reviewed changes

TomeHirata approved these changes Dec 1, 2025

View reviewed changes

TomeHirata reviewed Dec 1, 2025

View reviewed changes

alkispoly-db requested a review from serena-ruan December 1, 2025 01:47

Merge branch 'master' into scorer-bug

4d1a2da

alkispoly-db enabled auto-merge December 1, 2025 01:59

alkispoly-db added this pull request to the merge queue Dec 1, 2025

Merged via the queue into mlflow:master with commit 3c22da7 Dec 1, 2025
46 checks passed

alkispoly-db deleted the scorer-bug branch December 1, 2025 02:40

BenWilson2 pushed a commit that referenced this pull request Dec 4, 2025

Fix InstructionsJudge using scorer description as assessment value (#…

8a522c8

…19121) Signed-off-by: Alkis Polyzotis <alkis.polyzotis@databricks.com> Co-authored-by: Claude <noreply@anthropic.com>

		assert output_fields_rationale_first[1].value_type == Literal["good", "bad"] # result


		def test_response_format_uses_generic_description_when_scorer_has_description():

Conversation

alkispoly-db commented Nov 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Install mlflow from this PR

Related Issues/PRs

What changes are proposed in this pull request?

How is this PR tested?

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

Should this PR be included in the next patch release?

Uh oh!

github-actions bot commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomeHirata commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomeHirata left a comment

Choose a reason for hiding this comment

Uh oh!

TomeHirata Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alkispoly-db commented Dec 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alkispoly-db commented Nov 30, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Dec 1, 2025 •

edited

Loading

TomeHirata commented Dec 1, 2025 •

edited

Loading

TomeHirata Dec 1, 2025 •

edited

Loading