Skip to content

[Agentic Judges] Support expectation comparison for ToolCallCorrectness#19413

Closed
xsh310 wants to merge 1 commit intomlflow:masterfrom
xsh310:stack/agentic-judges-support-expectation-comparison-for-tool-call-correctness
Closed

[Agentic Judges] Support expectation comparison for ToolCallCorrectness#19413
xsh310 wants to merge 1 commit intomlflow:masterfrom
xsh310:stack/agentic-judges-support-expectation-comparison-for-tool-call-correctness

Conversation

@xsh310
Copy link
Collaborator

@xsh310 xsh310 commented Dec 15, 2025

🥞 Stacked PR

Use this link to review incremental changes.


What changes are proposed in this pull request?

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

New Unit Tests:

Added new unit tests under builtin_scorers.py

Manual Test

Tested a few scenarios using an example trace, return the expected values

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/tracking: Tracking Service, tracking client APIs, autologging
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • area/gateway: MLflow AI Gateway client APIs, server, and third-party integrations
  • area/prompts: MLflow prompt engineering features, prompt templates, and prompt management
  • area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • area/projects: MLproject format, project running backends
  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

@xsh310 xsh310 force-pushed the stack/agentic-judges-support-expectation-comparison-for-tool-call-correctness branch 2 times, most recently from a7bf46d to e8cb9e3 Compare December 15, 2025 22:52
@xsh310 xsh310 requested review from AveshCSingh, B-Step62, alkispoly-db, dbczumar and smoorjani and removed request for smoorjani December 15, 2025 22:55
@xsh310 xsh310 marked this pull request as ready for review December 15, 2025 22:56
@github-actions github-actions bot added area/evaluation MLflow Evaluation rn/none List under Small Changes in Changelogs. labels Dec 15, 2025
@xsh310 xsh310 force-pushed the stack/agentic-judges-support-expectation-comparison-for-tool-call-correctness branch from e8cb9e3 to 300bad2 Compare December 15, 2025 23:42
@github-actions
Copy link
Contributor

github-actions bot commented Dec 15, 2025

Documentation preview for b70bf27 is available at:

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

@xsh310 xsh310 force-pushed the stack/agentic-judges-support-expectation-comparison-for-tool-call-correctness branch 6 times, most recently from 23844e6 to 857f649 Compare December 16, 2025 06:34
"""
if check_order:
if len(actual_tools_called) < len(expected_tool_calls):
return Feedback(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the feedback construction here seems like a lot of redundant code, can we modularize it?


**Evaluation Modes:**

1. **With Expectations (Deterministic):** If you provide expected tool calls via the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this require name and args? or can it just take name?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, args is not required and you can just pass the name

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clarify the behavior when you don't specify args in this docstring?

),
)

# Check input parameters (subset check)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the tool call correctness judge, do we expect users to provide exact values even in the arguments? curious if you've tried some examples to help determine whether using an LLM is necessary. cc @AveshCSingh @alkispoly-db

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current implementation performs exact checks only on the arguments explicitly specified by the user in the expectation. If an argument is not included in the expectation, it is not validated against the actual tool call. If the user doesn't provide any argument, we only check that the tool name matches.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a few things here. My question above is asking whether exact checks are actually effective? I think this may work with simple args, but if there's a string, then I wonder how useful this is (exact equality vs. semantic equality)

If an argument is not included in the expectation, it is not validated against the actual tool call.
I think we should consistently evaluate both the tools called and the arguments passed. Even if an argument is not passed in the expectation, we should still validate it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think an exact match will work for arguments that require semantic equivalence. That’s why we allow users to provide only a subset of arguments, so they don’t have to explicitly specify these semantic fields.

cc @AveshCSingh , we discussed this last week, and I recall you suggesting not to rely on an LLM when there are expectations. Do you think it would make sense to fall back to an LLM-based comparison if the programmatic check fails?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think fallback makes sense, since we should clearly delineate between fuzzy matching and exact match.

There's use cases for both exact match and fuzzy matching. A tool call like search_database(table="users", id=123) requires exact match, whereas search_web("databricks series L") needs fuzzy matching. We may eventually need multiple judges. I would start with the exact match though, since you can get at fuzzy match with agent-as-a-judge. I also suspect that by providing just tool call names to the tool call correctness judge and not checking arguments is sufficient for many of these cases.

found_match = True
break

if not found_match:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just do a set equality here instead? this way our error message can point to all missing/extra tool calls, instead of one at a time

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you mostly concerned about having too many Feedback constructions here? I can extract that to a helper if that's the case.

I don't think set equality helps here since it will force us to exact match everything?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't that what we already do? this logic seems pretty complex for what I imagine is:

expected_tool_calls = set(ToolCall1, ToolCall2, ...)
actual_tool_calls = set(ToolCall1, ToolCall3, ToolCall2, ...)
expected_tool_calls == actual_tool_calls

^ for the unordered case. For the ordered case, it's just a matter of using a list

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smoorjani

If my understanding is correct, the code that you provided is different from my code. Consider the following example, your code will say these doesn't match, while my code will say they do match.

expected_tool_calls: 
{
    FunctionCall(name='tool_1'), 
    FunctionCall(name='tool_2')
}

actual_tool_calls: 
{
    FunctionCall(name='tool_1', arguments={'param_1': 'value_1'}),
    FunctionCall(name='tool_2', arguments={'param_2': 'value_2'})
}

So as I described in my other comment, it only checks the fields explicitly specified by the user in the expectation.

Let me know if this makes sense to you

),
)

# Check each expected tool call in order
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here about using list equality instead - that way we can show all mismatches instead of one at a time


# If expectations are provided, validate tool calls against expectations
if fields.expectations:
if "expected_tool_calls" not in fields.expectations:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we not error here? just fall back to the ground-truth free case

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smoorjani I believe throwing an exception is the better approach. Given that the user has already provided an expectation, we should explicitly notify them that the provided format is incorrect.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expectation can be provided for another metric. For instance, this would not work well if we used a correctness judge and passed expectations = {"expected_response": "..."} and also wanted to use a tool call correctness judge, but without ground truth.

@xsh310 xsh310 force-pushed the stack/agentic-judges-support-expectation-comparison-for-tool-call-correctness branch 3 times, most recently from d8be033 to 770f1b7 Compare December 17, 2025 00:12
),
)

# Check input parameters (subset check)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess a few things here. My question above is asking whether exact checks are actually effective? I think this may work with simple args, but if there's a string, then I wonder how useful this is (exact equality vs. semantic equality)

If an argument is not included in the expectation, it is not validated against the actual tool call.
I think we should consistently evaluate both the tools called and the arguments passed. Even if an argument is not passed in the expectation, we should still validate it.

found_match = True
break

if not found_match:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't that what we already do? this logic seems pretty complex for what I imagine is:

expected_tool_calls = set(ToolCall1, ToolCall2, ...)
actual_tool_calls = set(ToolCall1, ToolCall3, ToolCall2, ...)
expected_tool_calls == actual_tool_calls

^ for the unordered case. For the ordered case, it's just a matter of using a list


**Evaluation Modes:**

1. **With Expectations (Deterministic):** If you provide expected tool calls via the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clarify the behavior when you don't specify args in this docstring?


# If expectations are provided, validate tool calls against expectations
if fields.expectations:
if "expected_tool_calls" not in fields.expectations:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expectation can be provided for another metric. For instance, this would not work well if we used a correctness judge and passed expectations = {"expected_response": "..."} and also wanted to use a tool call correctness judge, but without ground truth.

@xsh310 xsh310 force-pushed the stack/agentic-judges-support-expectation-comparison-for-tool-call-correctness branch 2 times, most recently from f6290f8 to a112293 Compare December 17, 2025 06:38
@xsh310 xsh310 force-pushed the stack/agentic-judges-support-expectation-comparison-for-tool-call-correctness branch 4 times, most recently from eb9bee6 to a0a45bc Compare December 18, 2025 06:28
Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
@xsh310 xsh310 force-pushed the stack/agentic-judges-support-expectation-comparison-for-tool-call-correctness branch from a0a45bc to b70bf27 Compare December 18, 2025 17:17
@xsh310
Copy link
Collaborator Author

xsh310 commented Dec 18, 2025

Synced with @smoorjani and @AveshCSingh offline about expectation behavior. Updated the PR logic as the following for now to unblock before we fully finalize expectation behaviors:

  • If no expectation is provided, expectation is provided without the required key
    • Use the LLM judge to evaluate correctness.
  • If any tool call specifies arguments, all arguments must match exactly.
    • The actual tool call must have the same parameters with the same values (no extra or
      missing parameters allowed).
  • If all expected tool calls omit arguments (tool call name only),
    • The scorer validates that the tool names match, then defers to the LLM judge to
      evaluate correctness.

@xsh310
Copy link
Collaborator Author

xsh310 commented Dec 19, 2025

Closing: team aligned to have more discussion on the expectation behavior

@xsh310 xsh310 closed this Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/evaluation MLflow Evaluation rn/none List under Small Changes in Changelogs.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants