Skip to content

[Agentic Judges] Introduce tool call efficiency builtin judge#19358

Merged
xsh310 merged 2 commits intomlflow:masterfrom
xsh310:stack/ML-59978-introduce-tool-call-efficiency-builtin-judge
Dec 17, 2025
Merged

[Agentic Judges] Introduce tool call efficiency builtin judge#19358
xsh310 merged 2 commits intomlflow:masterfrom
xsh310:stack/ML-59978-introduce-tool-call-efficiency-builtin-judge

Conversation

@xsh310
Copy link
Collaborator

@xsh310 xsh310 commented Dec 12, 2025

🥞 Stacked PR

Use this link to review incremental changes.


What changes are proposed in this pull request?

This PR introduces a new P0 agentic judge ToolCallEfficiency for measuring whether the tool call pattern is efficient given the user query and detects the agent makes redundant/ineffcient tool calls.

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Manual Testing

Tested with 5 inefficient patterns

========== Inefficient Test Case 1: Inefficient Pagination ==========
User Query: 
{'query': 'Show me items with IDs 1, 2, and 3.'}


Available Tools: 
- get_items: Get items by ID range. Can retrieve multiple items efficiently.

    Args:
        start_id: Starting item ID (inclusive)
        end_id: Ending item ID (inclusive). If None, returns only the item with start_id

    Returns:
        Item information as a string
    - start_id (required): integer
    - end_id (optional): integer

- count_items: Get the total count of items directly.

    Returns:
        Total number of items


Tools Called: 
Tool Call 1: get_items
  Input Arguments: {'start_id': 1}
  Output: {'content': 'ID: 1, Name: Apple, Price: $1.5', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_M7rWzRvWYacmQGo78B4d6iaR', 'artifact': None, 'status': 'success'}

Tool Call 2: get_items
  Input Arguments: {'start_id': 2}
  Output: {'content': 'ID: 2, Name: Banana, Price: $0.8', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_i8bvWsW8cY7HXErEolAFmemo', 'artifact': None, 'status': 'success'}

Tool Call 3: get_items
  Input Arguments: {'start_id': 3}
  Output: {'content': 'ID: 3, Name: Orange, Price: $1.2', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_AuSHkEykAXvCQB60wnV7vG2u', 'artifact': None, 'status': 'success'}


____________Evaluation Result_________
Efficiency Prediction: no
Rationale: The user requested items with IDs 1, 2, and 3. The available tool 'get_items' supports fetching multiple items efficiently by specifying a start_id and an end_id covering a range. However, the agent called 'get_items' three separate times each with a single ID (start_id 1, 2, then 3). Instead of making three separate calls, the agent could have made a single call to 'get_items' with start_id 1 and end_id 3 to fetch all requested items at once. This approach reduces the number of calls and improves efficiency. Therefore, the agent's tool usage is redundant and not efficient.




========== Inefficient Test Case 2: Redundant Verification ==========
User Query: 
{'query': 'What is item 2?'}


Available Tools: 
- get_items: Get items by ID range. Can retrieve multiple items efficiently.

    Args:
        start_id: Starting item ID (inclusive)
        end_id: Ending item ID (inclusive). If None, returns only the item with start_id

    Returns:
        Item information as a string
    - start_id (required): integer
    - end_id (optional): integer

- count_items: Get the total count of items directly.

    Returns:
        Total number of items


Tools Called: 
Tool Call 1: get_items
  Input Arguments: {'start_id': 2}
  Output: {'content': 'ID: 2, Name: Banana, Price: $0.8', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_bLBfETEqZirZdQywV1BLpovA', 'artifact': None, 'status': 'success'}

Tool Call 2: get_items
  Input Arguments: {'start_id': 2}
  Output: {'content': 'ID: 2, Name: Banana, Price: $0.8', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_nKwOLIUdsCPh3IK6H08g3sWW', 'artifact': None, 'status': 'success'}

Tool Call 3: get_items
  Input Arguments: {'start_id': 2}
  Output: {'content': 'ID: 2, Name: Banana, Price: $0.8', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_38fdEtQ6iiPdwa8ZYEkw0M2l', 'artifact': None, 'status': 'success'}


____________Evaluation Result_________
Efficiency Prediction: no
Rationale: The user's query is to find information about item 2. The agent called the 'get_items' tool three times with the exact same argument {'start_id': 2}, and each call returned the same information about item 2: 'ID: 2, Name: Banana, Price: $0.8'. There is no indication any call failed and needed to be retried due to transient errors. Furthermore, since the tool allows retrieving multiple items efficiently, but here the query was about a single item, a single call with 'start_id': 2 would have sufficed. Therefore, the second and third calls to 'get_items' with identical arguments are redundant and unnecessary. The agent's tool usage includes redundant calls that could have been avoided by calling the tool once, making the usage inefficient.




========== Inefficient Test Case 3: Overlapping Ranges ==========
User Query: 
{'query': 'Get items in the range 1 to 3.'}


Available Tools: 
- get_items: Get items by ID range. Can retrieve multiple items efficiently.

    Args:
        start_id: Starting item ID (inclusive)
        end_id: Ending item ID (inclusive). If None, returns only the item with start_id

    Returns:
        Item information as a string
    - start_id (required): integer
    - end_id (optional): integer

- count_items: Get the total count of items directly.

    Returns:
        Total number of items


Tools Called: 
Tool Call 1: get_items
  Input Arguments: {'start_id': 1, 'end_id': 2}
  Output: {'content': 'ID: 1, Name: Apple, Price: $1.5\nID: 2, Name: Banana, Price: $0.8', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_Hujezi8qDM6GkvtG7zGi6GI3', 'artifact': None, 'status': 'success'}

Tool Call 2: get_items
  Input Arguments: {'start_id': 2, 'end_id': 3}
  Output: {'content': 'ID: 2, Name: Banana, Price: $0.8\nID: 3, Name: Orange, Price: $1.2', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_NM1RVIP8tixQTkjN6TlIUXTs', 'artifact': None, 'status': 'success'}


____________Evaluation Result_________
Efficiency Prediction: no
Rationale: The user requested items in the range 1 to 3. The agent made two get_items calls: first for range 1 to 2, then for range 2 to 3. Both calls include item with ID 2, so there is redundancy in retrieving the item with ID 2 twice. Moreover, the get_items tool supports retrieving multiple items efficiently in a single call by specifying the start and end IDs. Therefore, the agent could have made a single call get_items with start_id=1 and end_id=3, which would be more efficient and not redundant. Hence, the tool usage is not efficient and includes redundant calls.




========== Inefficient Test Case 4: Manual Counting ==========
User Query: 
{'query': 'How many items are in the database?'}


Available Tools: 
- get_items: Get items by ID range. Can retrieve multiple items efficiently.

    Args:
        start_id: Starting item ID (inclusive)
        end_id: Ending item ID (inclusive). If None, returns only the item with start_id

    Returns:
        Item information as a string
    - start_id (required): integer
    - end_id (optional): integer

- count_items: Get the total count of items directly.

    Returns:
        Total number of items


Tools Called: 
Tool Call 1: get_items
  Input Arguments: {'start_id': 1}
  Output: {'content': 'ID: 1, Name: Apple, Price: $1.5', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_pNGduBXYnu68F14tbOjLvxFG', 'artifact': None, 'status': 'success'}

Tool Call 2: get_items
  Input Arguments: {'start_id': 2}
  Output: {'content': 'ID: 2, Name: Banana, Price: $0.8', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_HdIUtp7L2gZU4smjuJgJxZED', 'artifact': None, 'status': 'success'}

Tool Call 3: get_items
  Input Arguments: {'start_id': 3}
  Output: {'content': 'ID: 3, Name: Orange, Price: $1.2', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_9rvmf3ULcqDUDBv3ATUTMXDQ', 'artifact': None, 'status': 'success'}

Tool Call 4: get_items
  Input Arguments: {'start_id': 4}
  Output: {'content': 'ID: 4, Name: Grape, Price: $2.5', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_xXZx8zOLoWbJy1xo6V8KgvE5', 'artifact': None, 'status': 'success'}

Tool Call 5: get_items
  Input Arguments: {'start_id': 5}
  Output: {'content': 'ID: 5, Name: Mango, Price: $3.0', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_GfL8Qx4HYvb1zBaMZtLyPEbP', 'artifact': None, 'status': 'success'}

Tool Call 6: get_items
  Input Arguments: {'start_id': 6}
  Output: {'content': 'ERROR 404: No items found', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_NfXdl899Z84BTJTemnBZpoZX', 'artifact': None, 'status': 'success'}


____________Evaluation Result_________
Efficiency Prediction: no
Rationale: The user query is to find out how many items are in the database. There is a tool called count_items available that directly provides the total count of items, which would be the most efficient and straightforward approach. However, the agent instead made multiple calls to get_items with start_id ranging from 1 through 6, retrieving items one at a time by specifying a single start_id and no end_id. This approach is inefficient because get_items can retrieve multiple items efficiently if a range (start_id and end_id) is specified, but the agent did not consolidate these calls into a single call. Additionally, calling get_items six times for individual item IDs instead of one call to count_items or one call to get_items with a range results in redundancy and inefficiency. Therefore, the tool usage is not efficient and exhibits redundancy.




========== Inefficient Test Case 5: Retrying Non-Transient Errors ==========
User Query: 
{'query': 'What is item 99?'}


Available Tools: 
- get_items: Get items by ID range. Can retrieve multiple items efficiently.

    Args:
        start_id: Starting item ID (inclusive)
        end_id: Ending item ID (inclusive). If None, returns only the item with start_id

    Returns:
        Item information as a string
    - start_id (required): integer
    - end_id (optional): integer

- count_items: Get the total count of items directly.

    Returns:
        Total number of items


Tools Called: 
Tool Call 1: get_items
  Input Arguments: {'start_id': 99}
  Output: {'content': 'ERROR 404: No items found', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_Z63Wr7p44R0A7ka9MNJYEfuJ', 'artifact': None, 'status': 'success'}

Tool Call 2: get_items
  Input Arguments: {'start_id': 99}
  Output: {'content': 'ERROR 404: No items found', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_6xNbNQmqxvmCjSrCfWkZXCf6', 'artifact': None, 'status': 'success'}

Tool Call 3: get_items
  Input Arguments: {'start_id': 99}
  Output: {'content': 'ERROR 404: No items found', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_EXRCGwqJ4RqJFMqkZ3AkGLq5', 'artifact': None, 'status': 'success'}

Tool Call 4: get_items
  Input Arguments: {'start_id': 99}
  Output: {'content': 'ERROR 404: No items found', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_15Xjcwt4LA2WH77ZiYrCOV4B', 'artifact': None, 'status': 'success'}

Tool Call 5: get_items
  Input Arguments: {'start_id': 99}
  Output: {'content': 'ERROR 404: No items found', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_H7EX8AIunwj5C4UCjeF55mTw', 'artifact': None, 'status': 'success'}


____________Evaluation Result_________
Efficiency Prediction: no
Rationale: The agent made five calls to the get_items tool, each with the exact same argument {'start_id': 99}. Each call returned an identical error message indicating no item 99 was found. Since the parameters and responses were the same each time, it is clear that these repeated calls are redundant and unnecessary. None of these calls appear to be retries due to transient failures, as the status is 'success' and the error is consistent, suggesting the item simply doesn't exist. Therefore, the agent could have avoided multiple identical calls and instead stopped after the first failed attempt, making the process more efficient. Hence, the tool usage is not efficient and contains redundancy.

And then tested with 5 efficient patterns:

========== Efficient Test Case 1: Efficient Pagination ==========
User Query: 
{'query': 'Show me items with IDs 1, 2, and 3.'}


Available Tools: 
- get_items: Get items by ID range. Can retrieve multiple items efficiently.

    Args:
        start_id: Starting item ID (inclusive)
        end_id: Ending item ID (inclusive). If None, returns only the item with start_id

    Returns:
        Item information as a string
    - start_id (required): integer
    - end_id (optional): integer

- count_items: Get the total count of items directly.

    Returns:
        Total number of items


Tools Called: 
Tool Call 1: get_items
  Input Arguments: {'start_id': 1, 'end_id': 3}
  Output: {'content': 'ID: 1, Name: Apple, Price: $1.5\nID: 2, Name: Banana, Price: $0.8\nID: 3, Name: Orange, Price: $1.2', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_3csb1jvhA18HYZmXUWRI8wyN', 'artifact': None, 'status': 'success'}


____________Evaluation Result_________
Efficiency Prediction: yes
Rationale: The user's request is to show items with IDs 1, 2, and 3. The agent made a single call to the get_items tool with start_id=1 and end_id=3, which retrieves all requested items in one efficient call. There were no repeated or similar calls, and no redundant or consolidatable calls. Therefore, the agent's tool usage is efficient and free of redundancy.




========== Efficient Test Case 2: No Redundant Verification ==========
User Query: 
{'query': 'What is item 2?'}


Available Tools: 
- get_items: Get items by ID range. Can retrieve multiple items efficiently.

    Args:
        start_id: Starting item ID (inclusive)
        end_id: Ending item ID (inclusive). If None, returns only the item with start_id

    Returns:
        Item information as a string
    - start_id (required): integer
    - end_id (optional): integer

- count_items: Get the total count of items directly.

    Returns:
        Total number of items


Tools Called: 
Tool Call 1: get_items
  Input Arguments: {'start_id': 2}
  Output: {'content': 'ID: 2, Name: Banana, Price: $0.8', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_junTVuvvp8YUu7QsMetxpuCD', 'artifact': None, 'status': 'success'}


____________Evaluation Result_________
Efficiency Prediction: yes
Rationale: The user requested the description of item 2. The agent called the tool get_items once with start_id 2 and no end_id, correctly retrieving the information for item 2 only. There were no repeated calls, no calls with overlapping or similar arguments, and no unnecessary consolidation because only one item was requested. The call was efficient and without redundancy.




========== Efficient Test Case 3: Single Range Call ==========
User Query: 
{'query': 'Get items in the range 1 to 3.'}


Available Tools: 
- get_items: Get items by ID range. Can retrieve multiple items efficiently.

    Args:
        start_id: Starting item ID (inclusive)
        end_id: Ending item ID (inclusive). If None, returns only the item with start_id

    Returns:
        Item information as a string
    - start_id (required): integer
    - end_id (optional): integer

- count_items: Get the total count of items directly.

    Returns:
        Total number of items


Tools Called: 
Tool Call 1: get_items
  Input Arguments: {'start_id': 1, 'end_id': 3}
  Output: {'content': 'ID: 1, Name: Apple, Price: $1.5\nID: 2, Name: Banana, Price: $0.8\nID: 3, Name: Orange, Price: $1.2', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'get_items', 'id': None, 'tool_call_id': 'call_QT6Z9TRmR1NpMjSXioDZlCuQ', 'artifact': None, 'status': 'success'}


____________Evaluation Result_________
Efficiency Prediction: yes
Rationale: The user's query was to get items in the range 1 to 3. The agent made a single call to the get_items tool with start_id 1 and end_id 3, which directly satisfies the user's request. There were no repeated calls to the same tool with identical or similar arguments, nor multiple calls that could have been consolidated. No retries due to errors were present. Therefore, the tool usage is efficient and free of redundancy.




========== Efficient Test Case 4: Use Direct Count Tool ==========
User Query: 
{'query': 'How many items are in the database?'}


Available Tools: 
- get_items: Get items by ID range. Can retrieve multiple items efficiently.

    Args:
        start_id: Starting item ID (inclusive)
        end_id: Ending item ID (inclusive). If None, returns only the item with start_id

    Returns:
        Item information as a string
    - start_id (required): integer
    - end_id (optional): integer

- count_items: Get the total count of items directly.

    Returns:
        Total number of items


Tools Called: 
Tool Call 1: count_items
  Input Arguments: {}
  Output: {'content': 'Total items: 5', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'count_items', 'id': None, 'tool_call_id': 'call_4dXBGuI8hS73iHnpYk16t97y', 'artifact': None, 'status': 'success'}


____________Evaluation Result_________
Efficiency Prediction: yes
Rationale: The user's request is to find out how many items are in the database. There are two tools available: count_items, which directly returns the total count, and get_items, which fetches items by ID range. The agent called count_items once, which is the most efficient way to answer the query since it directly returns the total count without needing to retrieve any items. No redundant or unnecessary tool calls were made, and no retries or inefficiencies were introduced. Thus, the tool usage is efficient and free of redundancy.




========== Efficient Test Case 5: Efficient Transient Error Retry ==========
User Query: 
{'query': 'What is item 3?'}


Available Tools: 
- try_get_item: Try to get an item by its ID.
    Args:
        item_id: The ID of the item to retrieve

    Returns:
        Item information or transient error
    - item_id (required): integer

- count_items: Get the total count of items directly.

    Returns:
        Total number of items


Tools Called: 
Tool Call 1: try_get_item
  Input Arguments: {'item_id': 3}
  Output: {'content': 'ERROR 503: Service temporarily unavailable. Please retry.', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'try_get_item', 'id': None, 'tool_call_id': 'call_ai99CKJmYsokP4NW8fJD2oib', 'artifact': None, 'status': 'success'}

Tool Call 2: try_get_item
  Input Arguments: {'item_id': 3}
  Output: {'content': 'ID: 3, Name: Orange, Price: $1.2', 'additional_kwargs': {}, 'response_metadata': {}, 'type': 'tool', 'name': 'try_get_item', 'id': None, 'tool_call_id': 'call_fhnV08iiSsvPi7thP9qQfJ00', 'artifact': None, 'status': 'success'}


____________Evaluation Result_________
Efficiency Prediction: yes
Rationale: The user requested information about item 3, so the agent correctly used the try_get_item tool with item_id 3. The first call resulted in a transient error (ERROR 503), which is recognized as a temporary failure and is not considered inefficient or redundant. The agent retried the same call, which then succeeded in returning the item details. Since retries due to transient errors are explicitly allowed and are not considered redundant, the two calls are justified. No other unnecessary calls or consolidations were possible given the user's request and available tools. Therefore, the tool usage is efficient and free of redundancy.

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/tracking: Tracking Service, tracking client APIs, autologging
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflows
  • area/gateway: MLflow AI Gateway client APIs, server, and third-party integrations
  • area/prompts: MLflow prompt engineering features, prompt templates, and prompt management
  • area/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionality
  • area/projects: MLproject format, project running backends
  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?
  • Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
    Bug fixes, doc updates and new features usually go into minor releases.
  • Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
    Bug fixes and doc updates usually go into patch releases.
  • Yes (this PR will be cherry-picked and included in the next patch release)
  • No (this PR will be included in the next minor release)

@xsh310 xsh310 changed the title [ML-59978] Introduce tool call efficiency builtin judge [Agentic Judges] Introduce tool call efficiency builtin judge Dec 12, 2025
@xsh310 xsh310 force-pushed the stack/ML-59978-introduce-tool-call-efficiency-builtin-judge branch from 2f41148 to 490a86d Compare December 12, 2025 20:14
@xsh310 xsh310 requested review from AveshCSingh, B-Step62, alkispoly-db, dbczumar and smoorjani and removed request for AveshCSingh December 12, 2025 20:15
@xsh310 xsh310 marked this pull request as ready for review December 12, 2025 20:16
@github-actions github-actions bot added area/evaluation MLflow Evaluation rn/none List under Small Changes in Changelogs. labels Dec 12, 2025
@xsh310 xsh310 force-pushed the stack/ML-59978-introduce-tool-call-efficiency-builtin-judge branch from 490a86d to 7a76f45 Compare December 12, 2025 20:25
@xsh310 xsh310 added v3.8.0 and removed v3.7.1 labels Dec 12, 2025
@xsh310 xsh310 force-pushed the stack/ML-59978-introduce-tool-call-efficiency-builtin-judge branch 2 times, most recently from e35b252 to c0a22d7 Compare December 15, 2025 06:17
@xsh310 xsh310 force-pushed the stack/ML-59978-introduce-tool-call-efficiency-builtin-judge branch from c0a22d7 to 31fc708 Compare December 15, 2025 16:53
@github-actions
Copy link
Contributor

github-actions bot commented Dec 15, 2025

Documentation preview for 436bc45 is available at:

More info
  • Ignore this comment if this PR does not change the documentation.
  • The preview is updated when a new commit is pushed to this PR.
  • This comment was created by this workflow run.
  • The documentation was built by this workflow run.

@xsh310 xsh310 force-pushed the stack/ML-59978-introduce-tool-call-efficiency-builtin-judge branch from 31fc708 to 21e7edb Compare December 15, 2025 22:05


def parse_tool_calls_from_trace(trace: Trace) -> list[dict[str, str]]:
class ToolCallInfo(BaseModel):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was discussing the dataclass to use for tool call judges with @smoorjani and @alkispoly-db.

cc @dbczumar , @B-Step62 does this dataclass structure looks good to you? We will be likely be reusing this for future tool call related judges as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw it'd be great if there's an existing interface we can use - e.g., FunctionToolCallArguments renamespaced or ToolCall.from_call_args(name=..., arguments={...})

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed ToolCallInfo and added FunctionCall instead that extends Function in chat.py.

@xsh310 xsh310 force-pushed the stack/ML-59978-introduce-tool-call-efficiency-builtin-judge branch 3 times, most recently from 4baa3c7 to cd095ba Compare December 16, 2025 05:07


def parse_tool_calls_from_trace(trace: Trace) -> list[dict[str, str]]:
class ToolCallInfo(BaseModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw it'd be great if there's an existing interface we can use - e.g., FunctionToolCallArguments renamespaced or ToolCall.from_call_args(name=..., arguments={...})

tool_spans = trace.search_spans(span_type=SpanType.TOOL)

for tool_span in sorted(tool_spans, key=lambda s: s.start_time_ns or 0):
if not _is_valid_str_dict(tool_span.inputs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jc why is this something we need to validate? I can imagine we can check for a lot of invalid formats, but curious why we check this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm setting input_parameters: dict[str, Any] in dataclass ToolCallInfo, so want to valid tool_span.inputs is indeed this type

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is something we can assume is already correctly formatted (it would break OTel if it wasn't) so I don't think we need this.

@xsh310 xsh310 force-pushed the stack/ML-59978-introduce-tool-call-efficiency-builtin-judge branch 2 times, most recently from 41ab7ab to d9e99f3 Compare December 16, 2025 06:00
@xsh310 xsh310 force-pushed the stack/ML-59978-introduce-tool-call-efficiency-builtin-judge branch 2 times, most recently from 36c701a to 208384a Compare December 16, 2025 06:34
Comment on lines +449 to +452
from mlflow.genai.judges.prompts.tool_call_efficiency import (
TOOL_CALL_EFFICIENCY_FEEDBACK_NAME,
get_prompt,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to inline the import?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be safe to move this to the top level, but most other functions use inline imports, so I’m following the existing pattern. Since this isn’t reused elsewhere, an inline import seems fine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we inline is because each module has a get_prompt so the other option is we rename. IMO the existing pattern is ok.



@format_docstring(_MODEL_API_DOC)
def is_tool_call_efficient(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we adding legacy built-in judges? Can users instead just call the Scorers interface?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interested in your thoughts as well here @smoorjani

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok to do this - especially if someone doesn't like some aspect built-in implementation, it's easy to reuse components rather than writing everything from scratch

return tools_called


def parse_tool_call_messages_from_trace(trace: Trace) -> list[dict[str, str]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is this created for the multi turn tool call judges.

@xsh310 xsh310 force-pushed the stack/ML-59978-introduce-tool-call-efficiency-builtin-judge branch from 208384a to 12cc9db Compare December 16, 2025 22:24
@xsh310
Copy link
Collaborator Author

xsh310 commented Dec 16, 2025

Updated the PR to address @smoorjani 's comments

Copy link
Collaborator

@smoorjani smoorjani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few minor comments, otherwise looks good

Comment on lines +449 to +452
from mlflow.genai.judges.prompts.tool_call_efficiency import (
TOOL_CALL_EFFICIENCY_FEEDBACK_NAME,
get_prompt,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we inline is because each module has a get_prompt so the other option is we rename. IMO the existing pattern is ok.



@format_docstring(_MODEL_API_DOC)
def is_tool_call_efficient(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should mark this as experimental



@format_docstring(_MODEL_API_DOC)
def is_tool_call_efficient(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok to do this - especially if someone doesn't like some aspect built-in implementation, it's easy to reuse components rather than writing everything from scratch

Returns:
A formatted string containing exception information if found, None otherwise.
"""
exception_events = [event for event in span.events if event.name == "exception"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a common constant for this in MLflow?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have one for this right now

return ToolCall(id=id, type="function", function=self)


class FunctionCall(Function):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is not a generic/broadly used dataclass, let's just keep this in mlflow.genai.judges somewhere

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm moving it to mlflow/genai/utils/type.py next to trace_utils.py

tool_spans = trace.search_spans(span_type=SpanType.TOOL)

for tool_span in sorted(tool_spans, key=lambda s: s.start_time_ns or 0):
if not _is_valid_str_dict(tool_span.inputs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this is something we can assume is already correctly formatted (it would break OTel if it wasn't) so I don't think we need this.

@xsh310 xsh310 force-pushed the stack/ML-59978-introduce-tool-call-efficiency-builtin-judge branch from 12cc9db to 2b95efc Compare December 17, 2025 03:12
@xsh310
Copy link
Collaborator Author

xsh310 commented Dec 17, 2025

Updated PR to address @smoorjani 's comments.

Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
@xsh310 xsh310 force-pushed the stack/ML-59978-introduce-tool-call-efficiency-builtin-judge branch from 2b95efc to da29789 Compare December 17, 2025 06:38
Co-authored-by: Samraj Moorjani <samrajmoorjani@gmail.com>
Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
@xsh310 xsh310 added this pull request to the merge queue Dec 17, 2025
Merged via the queue into mlflow:master with commit 5f86f1f Dec 17, 2025
46 checks passed
@xsh310 xsh310 deleted the stack/ML-59978-introduce-tool-call-efficiency-builtin-judge branch December 17, 2025 18:08
WeichenXu123 pushed a commit to WeichenXu123/mlflow that referenced this pull request Dec 19, 2025
…#19358)

Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
Co-authored-by: Samraj Moorjani <samrajmoorjani@gmail.com>
WeichenXu123 pushed a commit that referenced this pull request Dec 19, 2025
Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
Co-authored-by: Samraj Moorjani <samrajmoorjani@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/evaluation MLflow Evaluation rn/none List under Small Changes in Changelogs. v3.8.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants