[Agentic Judges] Introduce tool call efficiency builtin judge by xsh310 · Pull Request #19358 · mlflow/mlflow

xsh310 · 2025-12-12T18:06:54Z

🥞 Stacked PR

Use this link to review incremental changes.

stack/ML-59978-introduce-tool-call-efficiency-builtin-judge [Files changed]
- stack/agentic-judges-introduce-tool-call-correctness-builtin-judge [Files changed]
  - stack/agentic-judges-support-expectation-comparison-for-tool-call-correctness [Files changed]

What changes are proposed in this pull request?

This PR introduces a new P0 agentic judge ToolCallEfficiency for measuring whether the tool call pattern is efficient given the user query and detects the agent makes redundant/ineffcient tool calls.

How is this PR tested?

Existing unit/integration tests
New unit/integration tests
Manual tests

Manual Testing

Tested with 5 inefficient patterns

========== Inefficient Test Case 1: Inefficient Pagination ==========
User Query: 
{'query': 'Show me items with IDs 1, 2, and 3.'}


Available Tools: 
- get_items: Get items by ID range. Can retrieve multiple items efficiently.

    Args:
        start_id: Starting item ID (inclusive)
        end_id: Ending item ID (inclusive). If None, returns only the item with start_id

    Returns:
        Item information as a string
    - start_id (required): integer
    - end_id (optional): integer

- count_items: Get the total count of items directly.

    Returns:
        Total number of items


Tools Called: 
Tool Call 1: get_items
  Input Arguments: {'start_id': 1}
  Output: {'content': 'ID: 1, Name: Apple, Price: $1.5', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_M7rWzRvWYacmQGo78B4d6iaR', 'artifact': None, 'status': 'success'}

Tool Call 2: get_items
  Input Arguments: {'start_id': 2}
  Output: {'content': 'ID: 2, Name: Banana, Price: $0.8', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_i8bvWsW8cY7HXErEolAFmemo', 'artifact': None, 'status': 'success'}

Tool Call 3: get_items
  Input Arguments: {'start_id': 3}
  Output: {'content': 'ID: 3, Name: Orange, Price: $1.2', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_AuSHkEykAXvCQB60wnV7vG2u', 'artifact': None, 'status': 'success'}


____________Evaluation Result_________
Efficiency Prediction: no
Rationale: The user requested items with IDs 1, 2, and 3. The available tool 'get_items' supports fetching multiple items efficiently by specifying a start_id and an end_id covering a range. However, the agent called 'get_items' three separate times each with a single ID (start_id 1, 2, then 3). Instead of making three separate calls, the agent could have made a single call to 'get_items' with start_id 1 and end_id 3 to fetch all requested items at once. This approach reduces the number of calls and improves efficiency. Therefore, the agent's tool usage is redundant and not efficient.




========== Inefficient Test Case 2: Redundant Verification ==========
User Query: 
{'query': 'What is item 2?'}


Available Tools: 
- get_items: Get items by ID range. Can retrieve multiple items efficiently.

    Args:
        start_id: Starting item ID (inclusive)
        end_id: Ending item ID (inclusive). If None, returns only the item with start_id

    Returns:
        Item information as a string
    - start_id (required): integer
    - end_id (optional): integer

- count_items: Get the total count of items directly.

    Returns:
        Total number of items


Tools Called: 
Tool Call 1: get_items
  Input Arguments: {'start_id': 2}
  Output: {'content': 'ID: 2, Name: Banana, Price: $0.8', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_bLBfETEqZirZdQywV1BLpovA', 'artifact': None, 'status': 'success'}

Tool Call 2: get_items
  Input Arguments: {'start_id': 2}
  Output: {'content': 'ID: 2, Name: Banana, Price: $0.8', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_nKwOLIUdsCPh3IK6H08g3sWW', 'artifact': None, 'status': 'success'}

Tool Call 3: get_items
  Input Arguments: {'start_id': 2}
  Output: {'content': 'ID: 2, Name: Banana, Price: $0.8', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_38fdEtQ6iiPdwa8ZYEkw0M2l', 'artifact': None, 'status': 'success'}


____________Evaluation Result_________
Efficiency Prediction: no
Rationale: The user's query is to find information about item 2. The agent called the 'get_items' tool three times with the exact same argument {'start_id': 2}, and each call returned the same information about item 2: 'ID: 2, Name: Banana, Price: $0.8'. There is no indication any call failed and needed to be retried due to transient errors. Furthermore, since the tool allows retrieving multiple items efficiently, but here the query was about a single item, a single call with 'start_id': 2 would have sufficed. Therefore, the second and third calls to 'get_items' with identical arguments are redundant and unnecessary. The agent's tool usage includes redundant calls that could have been avoided by calling the tool once, making the usage inefficient.




========== Inefficient Test Case 3: Overlapping Ranges ==========
User Query: 
{'query': 'Get items in the range 1 to 3.'}


Available Tools: 
- get_items: Get items by ID range. Can retrieve multiple items efficiently.

    Args:
        start_id: Starting item ID (inclusive)
        end_id: Ending item ID (inclusive). If None, returns only the item with start_id

    Returns:
        Item information as a string
    - start_id (required): integer
    - end_id (optional): integer

- count_items: Get the total count of items directly.

    Returns:
        Total number of items


Tools Called: 
Tool Call 1: get_items
  Input Arguments: {'start_id': 1, 'end_id': 2}
  Output: {'content': 'ID: 1, Name: Apple, Price: $1.5\nID: 2, Name: Banana, Price: $0.8', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_Hujezi8qDM6GkvtG7zGi6GI3', 'artifact': None, 'status': 'success'}

Tool Call 2: get_items
  Input Arguments: {'start_id': 2, 'end_id': 3}
  Output: {'content': 'ID: 2, Name: Banana, Price: $0.8\nID: 3, Name: Orange, Price: $1.2', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_NM1RVIP8tixQTkjN6TlIUXTs', 'artifact': None, 'status': 'success'}


____________Evaluation Result_________
Efficiency Prediction: no
Rationale: The user requested items in the range 1 to 3. The agent made two get_items calls: first for range 1 to 2, then for range 2 to 3. Both calls include item with ID 2, so there is redundancy in retrieving the item with ID 2 twice. Moreover, the get_items tool supports retrieving multiple items efficiently in a single call by specifying the start and end IDs. Therefore, the agent could have made a single call get_items with start_id=1 and end_id=3, which would be more efficient and not redundant. Hence, the tool usage is not efficient and includes redundant calls.




========== Inefficient Test Case 4: Manual Counting ==========
User Query: 
{'query': 'How many items are in the database?'}


Available Tools: 
- get_items: Get items by ID range. Can retrieve multiple items efficiently.

    Args:
        start_id: Starting item ID (inclusive)
        end_id: Ending item ID (inclusive). If None, returns only the item with start_id

    Returns:
        Item information as a string
    - start_id (required): integer
    - end_id (optional): integer

- count_items: Get the total count of items directly.

    Returns:
        Total number of items


Tools Called: 
Tool Call 1: get_items
  Input Arguments: {'start_id': 1}
  Output: {'content': 'ID: 1, Name: Apple, Price: $1.5', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_pNGduBXYnu68F14tbOjLvxFG', 'artifact': None, 'status': 'success'}

Tool Call 2: get_items
  Input Arguments: {'start_id': 2}
  Output: {'content': 'ID: 2, Name: Banana, Price: $0.8', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_HdIUtp7L2gZU4smjuJgJxZED', 'artifact': None, 'status': 'success'}

Tool Call 3: get_items
  Input Arguments: {'start_id': 3}
  Output: {'content': 'ID: 3, Name: Orange, Price: $1.2', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_9rvmf3ULcqDUDBv3ATUTMXDQ', 'artifact': None, 'status': 'success'}

Tool Call 4: get_items
  Input Arguments: {'start_id': 4}
  Output: {'content': 'ID: 4, Name: Grape, Price: $2.5', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_xXZx8zOLoWbJy1xo6V8KgvE5', 'artifact': None, 'status': 'success'}

Tool Call 5: get_items
  Input Arguments: {'start_id': 5}
  Output: {'content': 'ID: 5, Name: Mango, Price: $3.0', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_GfL8Qx4HYvb1zBaMZtLyPEbP', 'artifact': None, 'status': 'success'}

Tool Call 6: get_items
  Input Arguments: {'start_id': 6}
  Output: {'content': 'ERROR 404: No items found', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_NfXdl899Z84BTJTemnBZpoZX', 'artifact': None, 'status': 'success'}


____________Evaluation Result_________
Efficiency Prediction: no
Rationale: The user query is to find out how many items are in the database. There is a tool called count_items available that directly provides the total count of items, which would be the most efficient and straightforward approach. However, the agent instead made multiple calls to get_items with start_id ranging from 1 through 6, retrieving items one at a time by specifying a single start_id and no end_id. This approach is inefficient because get_items can retrieve multiple items efficiently if a range (start_id and end_id) is specified, but the agent did not consolidate these calls into a single call. Additionally, calling get_items six times for individual item IDs instead of one call to count_items or one call to get_items with a range results in redundancy and inefficiency. Therefore, the tool usage is not efficient and exhibits redundancy.




========== Inefficient Test Case 5: Retrying Non-Transient Errors ==========
User Query: 
{'query': 'What is item 99?'}


Available Tools: 
- get_items: Get items by ID range. Can retrieve multiple items efficiently.

    Args:
        start_id: Starting item ID (inclusive)
        end_id: Ending item ID (inclusive). If None, returns only the item with start_id

    Returns:
        Item information as a string
    - start_id (required): integer
    - end_id (optional): integer

- count_items: Get the total count of items directly.

    Returns:
        Total number of items


Tools Called: 
Tool Call 1: get_items
  Input Arguments: {'start_id': 99}
  Output: {'content': 'ERROR 404: No items found', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_Z63Wr7p44R0A7ka9MNJYEfuJ', 'artifact': None, 'status': 'success'}

Tool Call 2: get_items
  Input Arguments: {'start_id': 99}
  Output: {'content': 'ERROR 404: No items found', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_6xNbNQmqxvmCjSrCfWkZXCf6', 'artifact': None, 'status': 'success'}

Tool Call 3: get_items
  Input Arguments: {'start_id': 99}
  Output: {'content': 'ERROR 404: No items found', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_EXRCGwqJ4RqJFMqkZ3AkGLq5', 'artifact': None, 'status': 'success'}

Tool Call 4: get_items
  Input Arguments: {'start_id': 99}
  Output: {'content': 'ERROR 404: No items found', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_15Xjcwt4LA2WH77ZiYrCOV4B', 'artifact': None, 'status': 'success'}

Tool Call 5: get_items
  Input Arguments: {'start_id': 99}
  Output: {'content': 'ERROR 404: No items found', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_H7EX8AIunwj5C4UCjeF55mTw', 'artifact': None, 'status': 'success'}


____________Evaluation Result_________
Efficiency Prediction: no
Rationale: The agent made five calls to the get_items tool, each with the exact same argument {'start_id': 99}. Each call returned an identical error message indicating no item 99 was found. Since the parameters and responses were the same each time, it is clear that these repeated calls are redundant and unnecessary. None of these calls appear to be retries due to transient failures, as the status is 'success' and the error is consistent, suggesting the item simply doesn't exist. Therefore, the agent could have avoided multiple identical calls and instead stopped after the first failed attempt, making the process more efficient. Hence, the tool usage is not efficient and contains redundancy.

And then tested with 5 efficient patterns:

========== Efficient Test Case 1: Efficient Pagination ==========
User Query: 
{'query': 'Show me items with IDs 1, 2, and 3.'}


Available Tools: 
- get_items: Get items by ID range. Can retrieve multiple items efficiently.

    Args:
        start_id: Starting item ID (inclusive)
        end_id: Ending item ID (inclusive). If None, returns only the item with start_id

    Returns:
        Item information as a string
    - start_id (required): integer
    - end_id (optional): integer

- count_items: Get the total count of items directly.

    Returns:
        Total number of items


Tools Called: 
Tool Call 1: get_items
  Input Arguments: {'start_id': 1, 'end_id': 3}
  Output: {'content': 'ID: 1, Name: Apple, Price: $1.5\nID: 2, Name: Banana, Price: $0.8\nID: 3, Name: Orange, Price: $1.2', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_3csb1jvhA18HYZmXUWRI8wyN', 'artifact': None, 'status': 'success'}


____________Evaluation Result_________
Efficiency Prediction: yes
Rationale: The user's request is to show items with IDs 1, 2, and 3. The agent made a single call to the get_items tool with start_id=1 and end_id=3, which retrieves all requested items in one efficient call. There were no repeated or similar calls, and no redundant or consolidatable calls. Therefore, the agent's tool usage is efficient and free of redundancy.




========== Efficient Test Case 2: No Redundant Verification ==========
User Query: 
{'query': 'What is item 2?'}


Available Tools: 
- get_items: Get items by ID range. Can retrieve multiple items efficiently.

    Args:
        start_id: Starting item ID (inclusive)
        end_id: Ending item ID (inclusive). If None, returns only the item with start_id

    Returns:
        Item information as a string
    - start_id (required): integer
    - end_id (optional): integer

- count_items: Get the total count of items directly.

    Returns:
        Total number of items


Tools Called: 
Tool Call 1: get_items
  Input Arguments: {'start_id': 2}
  Output: {'content': 'ID: 2, Name: Banana, Price: $0.8', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_junTVuvvp8YUu7QsMetxpuCD', 'artifact': None, 'status': 'success'}


____________Evaluation Result_________
Efficiency Prediction: yes
Rationale: The user requested the description of item 2. The agent called the tool get_items once with start_id 2 and no end_id, correctly retrieving the information for item 2 only. There were no repeated calls, no calls with overlapping or similar arguments, and no unnecessary consolidation because only one item was requested. The call was efficient and without redundancy.




========== Efficient Test Case 3: Single Range Call ==========
User Query: 
{'query': 'Get items in the range 1 to 3.'}


Available Tools: 
- get_items: Get items by ID range. Can retrieve multiple items efficiently.

    Args:
        start_id: Starting item ID (inclusive)
        end_id: Ending item ID (inclusive). If None, returns only the item with start_id

    Returns:
        Item information as a string
    - start_id (required): integer
    - end_id (optional): integer

- count_items: Get the total count of items directly.

    Returns:
        Total number of items


Tools Called: 
Tool Call 1: get_items
  Input Arguments: {'start_id': 1, 'end_id': 3}
  Output: {'content': 'ID: 1, Name: Apple, Price: $1.5\nID: 2, Name: Banana, Price: $0.8\nID: 3, Name: Orange, Price: $1.2', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_QT6Z9TRmR1NpMjSXioDZlCuQ', 'artifact': None, 'status': 'success'}


____________Evaluation Result_________
Efficiency Prediction: yes
Rationale: The user's query was to get items in the range 1 to 3. The agent made a single call to the get_items tool with start_id 1 and end_id 3, which directly satisfies the user's request. There were no repeated calls to the same tool with identical or similar arguments, nor multiple calls that could have been consolidated. No retries due to errors were present. Therefore, the tool usage is efficient and free of redundancy.




========== Efficient Test Case 4: Use Direct Count Tool ==========
User Query: 
{'query': 'How many items are in the database?'}


Available Tools: 
- get_items: Get items by ID range. Can retrieve multiple items efficiently.

    Args:
        start_id: Starting item ID (inclusive)
        end_id: Ending item ID (inclusive). If None, returns only the item with start_id

    Returns:
        Item information as a string
    - start_id (required): integer
    - end_id (optional): integer

- count_items: Get the total count of items directly.

    Returns:
        Total number of items


Tools Called: 
Tool Call 1: count_items
  Input Arguments: {}
  Output: {'content': 'Total items: 5', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'count_items', 'id': None, 'tool_call_id': 'call_4dXBGuI8hS73iHnpYk16t97y', 'artifact': None, 'status': 'success'}


____________Evaluation Result_________
Efficiency Prediction: yes
Rationale: The user's request is to find out how many items are in the database. There are two tools available: count_items, which directly returns the total count, and get_items, which fetches items by ID range. The agent called count_items once, which is the most efficient way to answer the query since it directly returns the total count without needing to retrieve any items. No redundant or unnecessary tool calls were made, and no retries or inefficiencies were introduced. Thus, the tool usage is efficient and free of redundancy.




========== Efficient Test Case 5: Efficient Transient Error Retry ==========
User Query: 
{'query': 'What is item 3?'}


Available Tools: 
- try_get_item: Try to get an item by its ID.
    Args:
        item_id: The ID of the item to retrieve

    Returns:
        Item information or transient error
    - item_id (required): integer

- count_items: Get the total count of items directly.

    Returns:
        Total number of items


Tools Called: 
Tool Call 1: try_get_item
  Input Arguments: {'item_id': 3}
  Output: {'content': 'ERROR 503: Service temporarily unavailable. Please retry.', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'try_get_item', 'id': None, 'tool_call_id': 'call_ai99CKJmYsokP4NW8fJD2oib', 'artifact': None, 'status': 'success'}

Tool Call 2: try_get_item
  Input Arguments: {'item_id': 3}
  Output: {'content': 'ID: 3, Name: Orange, Price: $1.2', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'try_get_item', 'id': None, 'tool_call_id': 'call_fhnV08iiSsvPi7thP9qQfJ00', 'artifact': None, 'status': 'success'}


____________Evaluation Result_________
Efficiency Prediction: yes
Rationale: The user requested information about item 3, so the agent correctly used the try_get_item tool with item_id 3. The first call resulted in a transient error (ERROR 503), which is recognized as a temporary failure and is not considered inefficient or redundant. The agent retried the same call, which then succeeded in returning the item details. Since retries due to transient errors are explicitly allowed and are not considered redundant, the two calls are justified. No other unnecessary calls or consolidations were possible given the user's request and available tools. Therefore, the tool usage is efficient and free of redundancy.

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

How should the PR be classified in the release notes? Choose one:

rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?

Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
Bug fixes, doc updates and new features usually go into minor releases.
Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
Bug fixes and doc updates usually go into patch releases.

Yes (this PR will be cherry-picked and included in the next patch release)
No (this PR will be included in the next minor release)

github-actions · 2025-12-15T17:01:14Z

Documentation preview for 436bc45 is available at:

https://pr-19358--mlflow-docs-preview.netlify.app/docs/latest/

More info

Ignore this comment if this PR does not change the documentation.
The preview is updated when a new commit is pushed to this PR.
This comment was created by this workflow run.
The documentation was built by this workflow run.

xsh310 · 2025-12-15T23:38:47Z

mlflow/genai/utils/trace_utils.py



-def parse_tool_calls_from_trace(trace: Trace) -> list[dict[str, str]]:
+class ToolCallInfo(BaseModel):


I was discussing the dataclass to use for tool call judges with @smoorjani and @alkispoly-db.

cc @dbczumar , @B-Step62 does this dataclass structure looks good to you? We will be likely be reusing this for future tool call related judges as well.

fwiw it'd be great if there's an existing interface we can use - e.g., FunctionToolCallArguments renamespaced or ToolCall.from_call_args(name=..., arguments={...})

Removed ToolCallInfo and added FunctionCall instead that extends Function in chat.py.

mlflow/genai/judges/prompts/tool_call_efficiency.py

mlflow/genai/scorers/builtin_scorers.py

smoorjani · 2025-12-16T05:31:37Z

mlflow/genai/utils/trace_utils.py



-def parse_tool_calls_from_trace(trace: Trace) -> list[dict[str, str]]:
+class ToolCallInfo(BaseModel):


fwiw it'd be great if there's an existing interface we can use - e.g., FunctionToolCallArguments renamespaced or ToolCall.from_call_args(name=..., arguments={...})

mlflow/genai/utils/trace_utils.py

smoorjani · 2025-12-16T05:32:57Z

mlflow/genai/utils/trace_utils.py

+    tool_spans = trace.search_spans(span_type=SpanType.TOOL)
+
+    for tool_span in sorted(tool_spans, key=lambda s: s.start_time_ns or 0):
+        if not _is_valid_str_dict(tool_span.inputs):


jc why is this something we need to validate? I can imagine we can check for a lot of invalid formats, but curious why we check this

I'm setting input_parameters: dict[str, Any] in dataclass ToolCallInfo, so want to valid tool_span.inputs is indeed this type

IMO this is something we can assume is already correctly formatted (it would break OTel if it wasn't) so I don't think we need this.

mlflow/genai/scorers/builtin_scorers.py

AveshCSingh · 2025-12-16T15:21:37Z

mlflow/genai/judges/builtin.py

+    from mlflow.genai.judges.prompts.tool_call_efficiency import (
+        TOOL_CALL_EFFICIENCY_FEEDBACK_NAME,
+        get_prompt,
+    )


Any reason to inline the import?

It should be safe to move this to the top level, but most other functions use inline imports, so I’m following the existing pattern. Since this isn’t reused elsewhere, an inline import seems fine.

The reason we inline is because each module has a get_prompt so the other option is we rename. IMO the existing pattern is ok.

AveshCSingh · 2025-12-16T15:22:27Z

mlflow/genai/judges/builtin.py



+@format_docstring(_MODEL_API_DOC)
+def is_tool_call_efficient(


Why are we adding legacy built-in judges? Can users instead just call the Scorers interface?

Interested in your thoughts as well here @smoorjani

I think it's ok to do this - especially if someone doesn't like some aspect built-in implementation, it's easy to reuse components rather than writing everything from scratch

AveshCSingh · 2025-12-16T15:28:03Z

mlflow/genai/utils/trace_utils.py

+    return tools_called
+
+
+def parse_tool_call_messages_from_trace(trace: Trace) -> list[dict[str, str]]:


Why do we need this function?

I think this is this created for the multi turn tool call judges.

xsh310 · 2025-12-16T23:30:57Z

Updated the PR to address @smoorjani 's comments

smoorjani

left a few minor comments, otherwise looks good

smoorjani · 2025-12-17T00:46:18Z

mlflow/genai/judges/builtin.py

+    from mlflow.genai.judges.prompts.tool_call_efficiency import (
+        TOOL_CALL_EFFICIENCY_FEEDBACK_NAME,
+        get_prompt,
+    )


The reason we inline is because each module has a get_prompt so the other option is we rename. IMO the existing pattern is ok.

smoorjani · 2025-12-17T00:47:32Z

mlflow/genai/judges/builtin.py



+@format_docstring(_MODEL_API_DOC)
+def is_tool_call_efficient(


we should mark this as experimental

smoorjani · 2025-12-17T00:47:59Z

mlflow/genai/judges/builtin.py



+@format_docstring(_MODEL_API_DOC)
+def is_tool_call_efficient(


I think it's ok to do this - especially if someone doesn't like some aspect built-in implementation, it's easy to reuse components rather than writing everything from scratch

smoorjani · 2025-12-17T00:51:24Z

mlflow/genai/utils/trace_utils.py

+    Returns:
+        A formatted string containing exception information if found, None otherwise.
+    """
+    exception_events = [event for event in span.events if event.name == "exception"]


do we have a common constant for this in MLflow?

I don't think we have one for this right now

smoorjani · 2025-12-17T00:53:15Z

mlflow/types/chat.py

        return ToolCall(id=id, type="function", function=self)


+class FunctionCall(Function):


since this is not a generic/broadly used dataclass, let's just keep this in mlflow.genai.judges somewhere

I'm moving it to mlflow/genai/utils/type.py next to trace_utils.py

smoorjani · 2025-12-17T00:55:31Z

mlflow/genai/utils/trace_utils.py

+    tool_spans = trace.search_spans(span_type=SpanType.TOOL)
+
+    for tool_span in sorted(tool_spans, key=lambda s: s.start_time_ns or 0):
+        if not _is_valid_str_dict(tool_span.inputs):


IMO this is something we can assume is already correctly formatted (it would break OTel if it wasn't) so I don't think we need this.

xsh310 · 2025-12-17T04:42:30Z

Updated PR to address @smoorjani 's comments.

Signed-off-by: Xiang Shen <xshen.shc@gmail.com>

mlflow/genai/utils/trace_utils.py

Co-authored-by: Samraj Moorjani <samrajmoorjani@gmail.com> Signed-off-by: Xiang Shen <xshen.shc@gmail.com>

…#19358) Signed-off-by: Xiang Shen <xshen.shc@gmail.com> Co-authored-by: Samraj Moorjani <samrajmoorjani@gmail.com>

Signed-off-by: Xiang Shen <xshen.shc@gmail.com> Co-authored-by: Samraj Moorjani <samrajmoorjani@gmail.com>

This was referenced Dec 12, 2025

[Agentic Judges] Add util function to retrieve available tools from trace #19307

Merged

[Agentic Judges] Add fallback retrieval for available tools using LLM #19322

Merged

xsh310 changed the title ~~[ML-59978] Introduce tool call efficiency builtin judge~~ [Agentic Judges] Introduce tool call efficiency builtin judge Dec 12, 2025

xsh310 force-pushed the stack/ML-59978-introduce-tool-call-efficiency-builtin-judge branch from 2f41148 to 490a86d Compare December 12, 2025 20:14

xsh310 requested review from AveshCSingh, B-Step62, alkispoly-db, dbczumar and smoorjani and removed request for AveshCSingh December 12, 2025 20:15

xsh310 marked this pull request as ready for review December 12, 2025 20:16

github-actions bot added area/evaluation MLflow Evaluation rn/none List under Small Changes in Changelogs. labels Dec 12, 2025

xsh310 force-pushed the stack/ML-59978-introduce-tool-call-efficiency-builtin-judge branch from 490a86d to 7a76f45 Compare December 12, 2025 20:25

github-actions bot added the v3.7.1 label Dec 12, 2025

xsh310 added v3.8.0 and removed v3.7.1 labels Dec 12, 2025

xsh310 force-pushed the stack/ML-59978-introduce-tool-call-efficiency-builtin-judge branch 2 times, most recently from e35b252 to c0a22d7 Compare December 15, 2025 06:17

xsh310 mentioned this pull request Dec 15, 2025

[Agentic Judges] Introduce tool call correctness builtin judge #19391

Merged

29 tasks

xsh310 force-pushed the stack/ML-59978-introduce-tool-call-efficiency-builtin-judge branch from c0a22d7 to 31fc708 Compare December 15, 2025 16:53

xsh310 force-pushed the stack/ML-59978-introduce-tool-call-efficiency-builtin-judge branch from 31fc708 to 21e7edb Compare December 15, 2025 22:05

xsh310 mentioned this pull request Dec 15, 2025

[Agentic Judges] Support expectation comparison for ToolCallCorrectness #19413

Closed

29 tasks

xsh310 commented Dec 15, 2025

View reviewed changes

xsh310 force-pushed the stack/ML-59978-introduce-tool-call-efficiency-builtin-judge branch 3 times, most recently from 4baa3c7 to cd095ba Compare December 16, 2025 05:07

smoorjani requested changes Dec 16, 2025

View reviewed changes

xsh310 force-pushed the stack/ML-59978-introduce-tool-call-efficiency-builtin-judge branch 2 times, most recently from 41ab7ab to d9e99f3 Compare December 16, 2025 06:00

github-actions bot assigned smoorjani Dec 16, 2025

xsh310 force-pushed the stack/ML-59978-introduce-tool-call-efficiency-builtin-judge branch 2 times, most recently from 36c701a to 208384a Compare December 16, 2025 06:34

AveshCSingh reviewed Dec 16, 2025

View reviewed changes

github-actions bot assigned AveshCSingh Dec 16, 2025

xsh310 force-pushed the stack/ML-59978-introduce-tool-call-efficiency-builtin-judge branch from 208384a to 12cc9db Compare December 16, 2025 22:24

smoorjani requested changes Dec 17, 2025

View reviewed changes

xsh310 force-pushed the stack/ML-59978-introduce-tool-call-efficiency-builtin-judge branch from 12cc9db to 2b95efc Compare December 17, 2025 03:12

[ML-59978] Introduce tool call efficiency builtin judge

da29789

Signed-off-by: Xiang Shen <xshen.shc@gmail.com>

xsh310 force-pushed the stack/ML-59978-introduce-tool-call-efficiency-builtin-judge branch from 2b95efc to da29789 Compare December 17, 2025 06:38

smoorjani approved these changes Dec 17, 2025

View reviewed changes

mlflow/genai/utils/trace_utils.py Outdated Show resolved Hide resolved

Update mlflow/genai/utils/trace_utils.py

436bc45

Co-authored-by: Samraj Moorjani <samrajmoorjani@gmail.com> Signed-off-by: Xiang Shen <xshen.shc@gmail.com>

xsh310 added this pull request to the merge queue Dec 17, 2025

Merged via the queue into mlflow:master with commit 5f86f1f Dec 17, 2025
46 checks passed

xsh310 deleted the stack/ML-59978-introduce-tool-call-efficiency-builtin-judge branch December 17, 2025 18:08

WeichenXu123 pushed a commit that referenced this pull request Dec 19, 2025

[Agentic Judges] Introduce tool call efficiency builtin judge (#19358)

c724976

Signed-off-by: Xiang Shen <xshen.shc@gmail.com> Co-authored-by: Samraj Moorjani <samrajmoorjani@gmail.com>

smoorjani mentioned this pull request Dec 19, 2025

Add documentation for new tool calling scorers #19539

Merged

29 tasks



		def parse_tool_calls_from_trace(trace: Trace) -> list[dict[str, str]]:
		class ToolCallInfo(BaseModel):



		@format_docstring(_MODEL_API_DOC)
		def is_tool_call_efficient(

		return tools_called


		def parse_tool_call_messages_from_trace(trace: Trace) -> list[dict[str, str]]:

		return ToolCall(id=id, type="function", function=self)


		class FunctionCall(Function):

Conversation

xsh310 commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🥞 Stacked PR

What changes are proposed in this pull request?

How is this PR tested?

Manual Testing

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

Should this PR be included in the next patch release?

Uh oh!

github-actions bot commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xsh310 commented Dec 16, 2025

Uh oh!

smoorjani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xsh310 commented Dec 17, 2025

Uh oh!

Uh oh!

Uh oh!

xsh310 commented Dec 12, 2025 •

edited

Loading

github-actions bot commented Dec 15, 2025 •

edited

Loading