[Agentic Judges] Support expectation comparison for ToolCallCorrectness#19413
Conversation
a7bf46d to
e8cb9e3
Compare
e8cb9e3 to
300bad2
Compare
|
Documentation preview for b70bf27 is available at: More info
|
23844e6 to
857f649
Compare
| """ | ||
| if check_order: | ||
| if len(actual_tools_called) < len(expected_tool_calls): | ||
| return Feedback( |
There was a problem hiding this comment.
the feedback construction here seems like a lot of redundant code, can we modularize it?
|
|
||
| **Evaluation Modes:** | ||
|
|
||
| 1. **With Expectations (Deterministic):** If you provide expected tool calls via the |
There was a problem hiding this comment.
does this require name and args? or can it just take name?
There was a problem hiding this comment.
No, args is not required and you can just pass the name
There was a problem hiding this comment.
Can we clarify the behavior when you don't specify args in this docstring?
| ), | ||
| ) | ||
|
|
||
| # Check input parameters (subset check) |
There was a problem hiding this comment.
for the tool call correctness judge, do we expect users to provide exact values even in the arguments? curious if you've tried some examples to help determine whether using an LLM is necessary. cc @AveshCSingh @alkispoly-db
There was a problem hiding this comment.
My current implementation performs exact checks only on the arguments explicitly specified by the user in the expectation. If an argument is not included in the expectation, it is not validated against the actual tool call. If the user doesn't provide any argument, we only check that the tool name matches.
There was a problem hiding this comment.
I guess a few things here. My question above is asking whether exact checks are actually effective? I think this may work with simple args, but if there's a string, then I wonder how useful this is (exact equality vs. semantic equality)
If an argument is not included in the expectation, it is not validated against the actual tool call.
I think we should consistently evaluate both the tools called and the arguments passed. Even if an argument is not passed in the expectation, we should still validate it.
There was a problem hiding this comment.
I don’t think an exact match will work for arguments that require semantic equivalence. That’s why we allow users to provide only a subset of arguments, so they don’t have to explicitly specify these semantic fields.
cc @AveshCSingh , we discussed this last week, and I recall you suggesting not to rely on an LLM when there are expectations. Do you think it would make sense to fall back to an LLM-based comparison if the programmatic check fails?
There was a problem hiding this comment.
I don't think fallback makes sense, since we should clearly delineate between fuzzy matching and exact match.
There's use cases for both exact match and fuzzy matching. A tool call like search_database(table="users", id=123) requires exact match, whereas search_web("databricks series L") needs fuzzy matching. We may eventually need multiple judges. I would start with the exact match though, since you can get at fuzzy match with agent-as-a-judge. I also suspect that by providing just tool call names to the tool call correctness judge and not checking arguments is sufficient for many of these cases.
| found_match = True | ||
| break | ||
|
|
||
| if not found_match: |
There was a problem hiding this comment.
can we just do a set equality here instead? this way our error message can point to all missing/extra tool calls, instead of one at a time
There was a problem hiding this comment.
Are you mostly concerned about having too many Feedback constructions here? I can extract that to a helper if that's the case.
I don't think set equality helps here since it will force us to exact match everything?
There was a problem hiding this comment.
isn't that what we already do? this logic seems pretty complex for what I imagine is:
expected_tool_calls = set(ToolCall1, ToolCall2, ...)
actual_tool_calls = set(ToolCall1, ToolCall3, ToolCall2, ...)
expected_tool_calls == actual_tool_calls
^ for the unordered case. For the ordered case, it's just a matter of using a list
There was a problem hiding this comment.
If my understanding is correct, the code that you provided is different from my code. Consider the following example, your code will say these doesn't match, while my code will say they do match.
expected_tool_calls:
{
FunctionCall(name='tool_1'),
FunctionCall(name='tool_2')
}
actual_tool_calls:
{
FunctionCall(name='tool_1', arguments={'param_1': 'value_1'}),
FunctionCall(name='tool_2', arguments={'param_2': 'value_2'})
}
So as I described in my other comment, it only checks the fields explicitly specified by the user in the expectation.
Let me know if this makes sense to you
| ), | ||
| ) | ||
|
|
||
| # Check each expected tool call in order |
There was a problem hiding this comment.
same here about using list equality instead - that way we can show all mismatches instead of one at a time
|
|
||
| # If expectations are provided, validate tool calls against expectations | ||
| if fields.expectations: | ||
| if "expected_tool_calls" not in fields.expectations: |
There was a problem hiding this comment.
can we not error here? just fall back to the ground-truth free case
There was a problem hiding this comment.
@smoorjani I believe throwing an exception is the better approach. Given that the user has already provided an expectation, we should explicitly notify them that the provided format is incorrect.
There was a problem hiding this comment.
The expectation can be provided for another metric. For instance, this would not work well if we used a correctness judge and passed expectations = {"expected_response": "..."} and also wanted to use a tool call correctness judge, but without ground truth.
d8be033 to
770f1b7
Compare
| ), | ||
| ) | ||
|
|
||
| # Check input parameters (subset check) |
There was a problem hiding this comment.
I guess a few things here. My question above is asking whether exact checks are actually effective? I think this may work with simple args, but if there's a string, then I wonder how useful this is (exact equality vs. semantic equality)
If an argument is not included in the expectation, it is not validated against the actual tool call.
I think we should consistently evaluate both the tools called and the arguments passed. Even if an argument is not passed in the expectation, we should still validate it.
| found_match = True | ||
| break | ||
|
|
||
| if not found_match: |
There was a problem hiding this comment.
isn't that what we already do? this logic seems pretty complex for what I imagine is:
expected_tool_calls = set(ToolCall1, ToolCall2, ...)
actual_tool_calls = set(ToolCall1, ToolCall3, ToolCall2, ...)
expected_tool_calls == actual_tool_calls
^ for the unordered case. For the ordered case, it's just a matter of using a list
|
|
||
| **Evaluation Modes:** | ||
|
|
||
| 1. **With Expectations (Deterministic):** If you provide expected tool calls via the |
There was a problem hiding this comment.
Can we clarify the behavior when you don't specify args in this docstring?
|
|
||
| # If expectations are provided, validate tool calls against expectations | ||
| if fields.expectations: | ||
| if "expected_tool_calls" not in fields.expectations: |
There was a problem hiding this comment.
The expectation can be provided for another metric. For instance, this would not work well if we used a correctness judge and passed expectations = {"expected_response": "..."} and also wanted to use a tool call correctness judge, but without ground truth.
f6290f8 to
a112293
Compare
eb9bee6 to
a0a45bc
Compare
Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
a0a45bc to
b70bf27
Compare
|
Synced with @smoorjani and @AveshCSingh offline about expectation behavior. Updated the PR logic as the following for now to unblock before we fully finalize expectation behaviors:
|
|
Closing: team aligned to have more discussion on the expectation behavior |
🥞 Stacked PR
Use this link to review incremental changes.
What changes are proposed in this pull request?
How is this PR tested?
New Unit Tests:
Added new unit tests under
builtin_scorers.pyManual Test
Tested a few scenarios using an example trace, return the expected values
Does this PR require documentation update?
Release Notes
Is this a user-facing change?
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/tracking: Tracking Service, tracking client APIs, autologgingarea/models: MLmodel format, model serialization/deserialization, flavorsarea/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registryarea/scoring: MLflow Model server, model deployment tools, Spark UDFsarea/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflowsarea/gateway: MLflow AI Gateway client APIs, server, and third-party integrationsarea/prompts: MLflow prompt engineering features, prompt templates, and prompt managementarea/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionalityarea/projects: MLproject format, project running backendsarea/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/build: Build and test infrastructure for MLflowarea/docs: MLflow documentation pagesHow should the PR be classified in the release notes? Choose one:
rn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/breaking-change- The PR will be mentioned in the "Breaking Changes" sectionrn/feature- A new user-facing feature worth mentioning in the release notesrn/bug-fix- A user-facing bug fix worth mentioning in the release notesrn/documentation- A user-facing documentation change worth mentioning in the release notesShould this PR be included in the next patch release?
Yesshould be selected for bug fixes, documentation updates, and other small changes.Noshould be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.What is a minor/patch release?
Bug fixes, doc updates and new features usually go into minor releases.
Bug fixes and doc updates usually go into patch releases.