[Agentic Judges] Introduce tool call efficiency builtin judge#19358
Conversation
2f41148 to
490a86d
Compare
490a86d to
7a76f45
Compare
e35b252 to
c0a22d7
Compare
c0a22d7 to
31fc708
Compare
|
Documentation preview for 436bc45 is available at: More info
|
31fc708 to
21e7edb
Compare
mlflow/genai/utils/trace_utils.py
Outdated
|
|
||
|
|
||
| def parse_tool_calls_from_trace(trace: Trace) -> list[dict[str, str]]: | ||
| class ToolCallInfo(BaseModel): |
There was a problem hiding this comment.
I was discussing the dataclass to use for tool call judges with @smoorjani and @alkispoly-db.
cc @dbczumar , @B-Step62 does this dataclass structure looks good to you? We will be likely be reusing this for future tool call related judges as well.
There was a problem hiding this comment.
fwiw it'd be great if there's an existing interface we can use - e.g., FunctionToolCallArguments renamespaced or ToolCall.from_call_args(name=..., arguments={...})
There was a problem hiding this comment.
Removed ToolCallInfo and added FunctionCall instead that extends Function in chat.py.
4baa3c7 to
cd095ba
Compare
mlflow/genai/utils/trace_utils.py
Outdated
|
|
||
|
|
||
| def parse_tool_calls_from_trace(trace: Trace) -> list[dict[str, str]]: | ||
| class ToolCallInfo(BaseModel): |
There was a problem hiding this comment.
fwiw it'd be great if there's an existing interface we can use - e.g., FunctionToolCallArguments renamespaced or ToolCall.from_call_args(name=..., arguments={...})
mlflow/genai/utils/trace_utils.py
Outdated
| tool_spans = trace.search_spans(span_type=SpanType.TOOL) | ||
|
|
||
| for tool_span in sorted(tool_spans, key=lambda s: s.start_time_ns or 0): | ||
| if not _is_valid_str_dict(tool_span.inputs): |
There was a problem hiding this comment.
jc why is this something we need to validate? I can imagine we can check for a lot of invalid formats, but curious why we check this
There was a problem hiding this comment.
I'm setting input_parameters: dict[str, Any] in dataclass ToolCallInfo, so want to valid tool_span.inputs is indeed this type
There was a problem hiding this comment.
IMO this is something we can assume is already correctly formatted (it would break OTel if it wasn't) so I don't think we need this.
41ab7ab to
d9e99f3
Compare
36c701a to
208384a
Compare
| from mlflow.genai.judges.prompts.tool_call_efficiency import ( | ||
| TOOL_CALL_EFFICIENCY_FEEDBACK_NAME, | ||
| get_prompt, | ||
| ) |
There was a problem hiding this comment.
Any reason to inline the import?
There was a problem hiding this comment.
It should be safe to move this to the top level, but most other functions use inline imports, so I’m following the existing pattern. Since this isn’t reused elsewhere, an inline import seems fine.
There was a problem hiding this comment.
The reason we inline is because each module has a get_prompt so the other option is we rename. IMO the existing pattern is ok.
|
|
||
|
|
||
| @format_docstring(_MODEL_API_DOC) | ||
| def is_tool_call_efficient( |
There was a problem hiding this comment.
Why are we adding legacy built-in judges? Can users instead just call the Scorers interface?
There was a problem hiding this comment.
Interested in your thoughts as well here @smoorjani
There was a problem hiding this comment.
I think it's ok to do this - especially if someone doesn't like some aspect built-in implementation, it's easy to reuse components rather than writing everything from scratch
| return tools_called | ||
|
|
||
|
|
||
| def parse_tool_call_messages_from_trace(trace: Trace) -> list[dict[str, str]]: |
There was a problem hiding this comment.
Why do we need this function?
There was a problem hiding this comment.
I think this is this created for the multi turn tool call judges.
208384a to
12cc9db
Compare
|
Updated the PR to address @smoorjani 's comments |
smoorjani
left a comment
There was a problem hiding this comment.
left a few minor comments, otherwise looks good
| from mlflow.genai.judges.prompts.tool_call_efficiency import ( | ||
| TOOL_CALL_EFFICIENCY_FEEDBACK_NAME, | ||
| get_prompt, | ||
| ) |
There was a problem hiding this comment.
The reason we inline is because each module has a get_prompt so the other option is we rename. IMO the existing pattern is ok.
|
|
||
|
|
||
| @format_docstring(_MODEL_API_DOC) | ||
| def is_tool_call_efficient( |
There was a problem hiding this comment.
we should mark this as experimental
|
|
||
|
|
||
| @format_docstring(_MODEL_API_DOC) | ||
| def is_tool_call_efficient( |
There was a problem hiding this comment.
I think it's ok to do this - especially if someone doesn't like some aspect built-in implementation, it's easy to reuse components rather than writing everything from scratch
| Returns: | ||
| A formatted string containing exception information if found, None otherwise. | ||
| """ | ||
| exception_events = [event for event in span.events if event.name == "exception"] |
There was a problem hiding this comment.
do we have a common constant for this in MLflow?
There was a problem hiding this comment.
I don't think we have one for this right now
mlflow/types/chat.py
Outdated
| return ToolCall(id=id, type="function", function=self) | ||
|
|
||
|
|
||
| class FunctionCall(Function): |
There was a problem hiding this comment.
since this is not a generic/broadly used dataclass, let's just keep this in mlflow.genai.judges somewhere
There was a problem hiding this comment.
I'm moving it to mlflow/genai/utils/type.py next to trace_utils.py
mlflow/genai/utils/trace_utils.py
Outdated
| tool_spans = trace.search_spans(span_type=SpanType.TOOL) | ||
|
|
||
| for tool_span in sorted(tool_spans, key=lambda s: s.start_time_ns or 0): | ||
| if not _is_valid_str_dict(tool_span.inputs): |
There was a problem hiding this comment.
IMO this is something we can assume is already correctly formatted (it would break OTel if it wasn't) so I don't think we need this.
12cc9db to
2b95efc
Compare
|
Updated PR to address @smoorjani 's comments. |
Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
2b95efc to
da29789
Compare
Co-authored-by: Samraj Moorjani <samrajmoorjani@gmail.com> Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
…#19358) Signed-off-by: Xiang Shen <xshen.shc@gmail.com> Co-authored-by: Samraj Moorjani <samrajmoorjani@gmail.com>
Signed-off-by: Xiang Shen <xshen.shc@gmail.com> Co-authored-by: Samraj Moorjani <samrajmoorjani@gmail.com>
🥞 Stacked PR
Use this link to review incremental changes.
What changes are proposed in this pull request?
This PR introduces a new P0 agentic judge
ToolCallEfficiencyfor measuring whether the tool call pattern is efficient given the user query and detects the agent makes redundant/ineffcient tool calls.How is this PR tested?
Manual Testing
Tested with 5 inefficient patterns
And then tested with 5 efficient patterns:
Does this PR require documentation update?
Release Notes
Is this a user-facing change?
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/tracking: Tracking Service, tracking client APIs, autologgingarea/models: MLmodel format, model serialization/deserialization, flavorsarea/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registryarea/scoring: MLflow Model server, model deployment tools, Spark UDFsarea/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflowsarea/gateway: MLflow AI Gateway client APIs, server, and third-party integrationsarea/prompts: MLflow prompt engineering features, prompt templates, and prompt managementarea/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionalityarea/projects: MLproject format, project running backendsarea/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/build: Build and test infrastructure for MLflowarea/docs: MLflow documentation pagesHow should the PR be classified in the release notes? Choose one:
rn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/breaking-change- The PR will be mentioned in the "Breaking Changes" sectionrn/feature- A new user-facing feature worth mentioning in the release notesrn/bug-fix- A user-facing bug fix worth mentioning in the release notesrn/documentation- A user-facing documentation change worth mentioning in the release notesShould this PR be included in the next patch release?
Yesshould be selected for bug fixes, documentation updates, and other small changes.Noshould be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.What is a minor/patch release?
Bug fixes, doc updates and new features usually go into minor releases.
Bug fixes and doc updates usually go into patch releases.