[Agentic Judges] Support expectation comparison for ToolCallCorrectness by xsh310 · Pull Request #19413 · mlflow/mlflow

xsh310 · 2025-12-15T22:05:31Z

🥞 Stacked PR

Use this link to review incremental changes.

stack/agentic-judges-support-expectation-comparison-for-tool-call-correctness [Files changed]

What changes are proposed in this pull request?

How is this PR tested?

Existing unit/integration tests
New unit/integration tests
Manual tests

New Unit Tests:

Added new unit tests under builtin_scorers.py

Manual Test

Tested a few scenarios using an example trace, return the expected values

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

How should the PR be classified in the release notes? Choose one:

rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?

Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
Bug fixes, doc updates and new features usually go into minor releases.
Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
Bug fixes and doc updates usually go into patch releases.

Yes (this PR will be cherry-picked and included in the next patch release)
No (this PR will be included in the next minor release)

github-actions · 2025-12-15T23:50:36Z

Documentation preview for b70bf27 is available at:

https://pr-19413--mlflow-docs-preview.netlify.app/docs/latest/

More info

Ignore this comment if this PR does not change the documentation.
The preview is updated when a new commit is pushed to this PR.
This comment was created by this workflow run.
The documentation was built by this workflow run.

smoorjani · 2025-12-16T05:42:24Z

mlflow/genai/scorers/builtin_scorers.py

+    """
+    if check_order:
+        if len(actual_tools_called) < len(expected_tool_calls):
+            return Feedback(


the feedback construction here seems like a lot of redundant code, can we modularize it?

smoorjani · 2025-12-16T05:42:56Z

mlflow/genai/scorers/builtin_scorers.py


+    **Evaluation Modes:**
+
+    1. **With Expectations (Deterministic):** If you provide expected tool calls via the


does this require name and args? or can it just take name?

No, args is not required and you can just pass the name

Can we clarify the behavior when you don't specify args in this docstring?

smoorjani · 2025-12-16T05:44:03Z

mlflow/genai/scorers/builtin_scorers.py

+                    ),
+                )
+
+            # Check input parameters (subset check)


for the tool call correctness judge, do we expect users to provide exact values even in the arguments? curious if you've tried some examples to help determine whether using an LLM is necessary. cc @AveshCSingh @alkispoly-db

My current implementation performs exact checks only on the arguments explicitly specified by the user in the expectation. If an argument is not included in the expectation, it is not validated against the actual tool call. If the user doesn't provide any argument, we only check that the tool name matches.

I guess a few things here. My question above is asking whether exact checks are actually effective? I think this may work with simple args, but if there's a string, then I wonder how useful this is (exact equality vs. semantic equality)

If an argument is not included in the expectation, it is not validated against the actual tool call.
I think we should consistently evaluate both the tools called and the arguments passed. Even if an argument is not passed in the expectation, we should still validate it.

I don’t think an exact match will work for arguments that require semantic equivalence. That’s why we allow users to provide only a subset of arguments, so they don’t have to explicitly specify these semantic fields.

cc @AveshCSingh , we discussed this last week, and I recall you suggesting not to rely on an LLM when there are expectations. Do you think it would make sense to fall back to an LLM-based comparison if the programmatic check fails?

I don't think fallback makes sense, since we should clearly delineate between fuzzy matching and exact match.

There's use cases for both exact match and fuzzy matching. A tool call like search_database(table="users", id=123) requires exact match, whereas search_web("databricks series L") needs fuzzy matching. We may eventually need multiple judges. I would start with the exact match though, since you can get at fuzzy match with agent-as-a-judge. I also suspect that by providing just tool call names to the tool call correctness judge and not checking arguments is sufficient for many of these cases.

smoorjani · 2025-12-16T20:23:58Z

mlflow/genai/scorers/builtin_scorers.py

+                    found_match = True
+                    break
+
+            if not found_match:


can we just do a set equality here instead? this way our error message can point to all missing/extra tool calls, instead of one at a time

Are you mostly concerned about having too many Feedback constructions here? I can extract that to a helper if that's the case.

I don't think set equality helps here since it will force us to exact match everything?

isn't that what we already do? this logic seems pretty complex for what I imagine is:

expected_tool_calls = set(ToolCall1, ToolCall2, ...) actual_tool_calls = set(ToolCall1, ToolCall3, ToolCall2, ...) expected_tool_calls == actual_tool_calls

^ for the unordered case. For the ordered case, it's just a matter of using a list

@smoorjani

If my understanding is correct, the code that you provided is different from my code. Consider the following example, your code will say these doesn't match, while my code will say they do match.

expected_tool_calls: { FunctionCall(name='tool_1'), FunctionCall(name='tool_2') } actual_tool_calls: { FunctionCall(name='tool_1', arguments={'param_1': 'value_1'}), FunctionCall(name='tool_2', arguments={'param_2': 'value_2'}) }

So as I described in my other comment, it only checks the fields explicitly specified by the user in the expectation.

Let me know if this makes sense to you

smoorjani · 2025-12-16T20:26:35Z

mlflow/genai/scorers/builtin_scorers.py

+                ),
+            )
+
+        # Check each expected tool call in order


same here about using list equality instead - that way we can show all mismatches instead of one at a time

smoorjani · 2025-12-16T20:28:14Z

mlflow/genai/scorers/builtin_scorers.py

+
+        # If expectations are provided, validate tool calls against expectations
+        if fields.expectations:
+            if "expected_tool_calls" not in fields.expectations:


can we not error here? just fall back to the ground-truth free case

@smoorjani I believe throwing an exception is the better approach. Given that the user has already provided an expectation, we should explicitly notify them that the provided format is incorrect.

The expectation can be provided for another metric. For instance, this would not work well if we used a correctness judge and passed expectations = {"expected_response": "..."} and also wanted to use a tool call correctness judge, but without ground truth.

tests/genai/scorers/test_builtin_scorers.py

smoorjani · 2025-12-17T01:43:09Z

mlflow/genai/scorers/builtin_scorers.py

+                    ),
+                )
+
+            # Check input parameters (subset check)


I guess a few things here. My question above is asking whether exact checks are actually effective? I think this may work with simple args, but if there's a string, then I wonder how useful this is (exact equality vs. semantic equality)

If an argument is not included in the expectation, it is not validated against the actual tool call.
I think we should consistently evaluate both the tools called and the arguments passed. Even if an argument is not passed in the expectation, we should still validate it.

smoorjani · 2025-12-17T01:44:27Z

mlflow/genai/scorers/builtin_scorers.py

+                    found_match = True
+                    break
+
+            if not found_match:


isn't that what we already do? this logic seems pretty complex for what I imagine is:

expected_tool_calls = set(ToolCall1, ToolCall2, ...) actual_tool_calls = set(ToolCall1, ToolCall3, ToolCall2, ...) expected_tool_calls == actual_tool_calls

^ for the unordered case. For the ordered case, it's just a matter of using a list

smoorjani · 2025-12-17T01:44:43Z

mlflow/genai/scorers/builtin_scorers.py


+    **Evaluation Modes:**
+
+    1. **With Expectations (Deterministic):** If you provide expected tool calls via the


Can we clarify the behavior when you don't specify args in this docstring?

smoorjani · 2025-12-17T01:45:27Z

mlflow/genai/scorers/builtin_scorers.py

+
+        # If expectations are provided, validate tool calls against expectations
+        if fields.expectations:
+            if "expected_tool_calls" not in fields.expectations:


The expectation can be provided for another metric. For instance, this would not work well if we used a correctness judge and passed expectations = {"expected_response": "..."} and also wanted to use a tool call correctness judge, but without ground truth.

Signed-off-by: Xiang Shen <xshen.shc@gmail.com>

xsh310 · 2025-12-18T17:21:48Z

Synced with @smoorjani and @AveshCSingh offline about expectation behavior. Updated the PR logic as the following for now to unblock before we fully finalize expectation behaviors:

If no expectation is provided, expectation is provided without the required key
- Use the LLM judge to evaluate correctness.
If any tool call specifies arguments, all arguments must match exactly.
- The actual tool call must have the same parameters with the same values (no extra or
  missing parameters allowed).
If all expected tool calls omit arguments (tool call name only),
- The scorer validates that the tool names match, then defers to the LLM judge to
  evaluate correctness.

xsh310 · 2025-12-19T19:09:56Z

Closing: team aligned to have more discussion on the expectation behavior

xsh310 force-pushed the stack/agentic-judges-support-expectation-comparison-for-tool-call-correctness branch 2 times, most recently from a7bf46d to e8cb9e3 Compare December 15, 2025 22:52

xsh310 requested review from AveshCSingh, B-Step62, alkispoly-db, dbczumar and smoorjani and removed request for smoorjani December 15, 2025 22:55

xsh310 marked this pull request as ready for review December 15, 2025 22:56

github-actions bot added area/evaluation MLflow Evaluation rn/none List under Small Changes in Changelogs. labels Dec 15, 2025

xsh310 force-pushed the stack/agentic-judges-support-expectation-comparison-for-tool-call-correctness branch from e8cb9e3 to 300bad2 Compare December 15, 2025 23:42

xsh310 force-pushed the stack/agentic-judges-support-expectation-comparison-for-tool-call-correctness branch 6 times, most recently from 23844e6 to 857f649 Compare December 16, 2025 06:34

smoorjani requested changes Dec 16, 2025

View reviewed changes

github-actions bot assigned smoorjani Dec 16, 2025

xsh310 force-pushed the stack/agentic-judges-support-expectation-comparison-for-tool-call-correctness branch 3 times, most recently from d8be033 to 770f1b7 Compare December 17, 2025 00:12

smoorjani requested changes Dec 17, 2025

View reviewed changes

xsh310 force-pushed the stack/agentic-judges-support-expectation-comparison-for-tool-call-correctness branch 2 times, most recently from f6290f8 to a112293 Compare December 17, 2025 06:38

github-actions bot assigned AveshCSingh Dec 17, 2025

xsh310 force-pushed the stack/agentic-judges-support-expectation-comparison-for-tool-call-correctness branch 4 times, most recently from eb9bee6 to a0a45bc Compare December 18, 2025 06:28

[Agentic Judges] Support expectation comparison for ToolCallCorrectness

b70bf27

Signed-off-by: Xiang Shen <xshen.shc@gmail.com>

xsh310 force-pushed the stack/agentic-judges-support-expectation-comparison-for-tool-call-correctness branch from a0a45bc to b70bf27 Compare December 18, 2025 17:17

xsh310 closed this Dec 19, 2025


		Evaluation Modes:

		1. With Expectations (Deterministic): If you provide expected tool calls via the

Conversation

xsh310 commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🥞 Stacked PR

What changes are proposed in this pull request?

How is this PR tested?

New Unit Tests:

Manual Test

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

Should this PR be included in the next patch release?

Uh oh!

github-actions bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xsh310 commented Dec 18, 2025

Uh oh!

xsh310 commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xsh310 commented Dec 15, 2025 •

edited

Loading

github-actions bot commented Dec 15, 2025 •

edited

Loading