[Agentic Judges] Add fallback retrieval for available tools using LLM by xsh310 · Pull Request #19322 · mlflow/mlflow

xsh310 · 2025-12-10T22:26:21Z

🥞 Stacked PR

Use this link to review incremental changes.

stack/ML-59978-add-fallback-for-available-tools-retrieval-using-llm [Files changed]
- stack/ML-59978-introduce-tool-call-efficiency-builtin-judge [Files changed]
  - stack/agentic-judges-introduce-tool-call-correctness-builtin-judge [Files changed]
    - stack/agentic-judges-support-expectation-comparison-for-tool-call-correctness [Files changed]

What changes are proposed in this pull request?

Adding a fallback solution for parsing available_tools by using a tracing parsing agent.

How is this PR tested?

Existing unit/integration tests
New unit/integration tests
Manual tests

Manual Testing

Tested on Langchain and OpenAI agent:

Response from gpt-4o-mini (sometimes won't return all the tools available):

==================== Lanchain Agent ====================
Tool Available 1: type='function' function=FunctionToolDefinition(name='calculator', description=None, parameters=None, strict=None)

Tool Available 2: type='function' function=FunctionToolDefinition(name='reverse_string', description=None, parameters=None, strict=None)

==================== OpenAI Agent ====================
Tool Available 1: type='function' function=FunctionToolDefinition(name='_calculator', description=None, parameters=FunctionParams(properties={'operation': ParamProperty(type='string', description='The mathematical operation to perform', enum=None, items=None)}, type='object', required=['operation'], additionalProperties=None), strict=None)

Tool Available 2: type='function' function=FunctionToolDefinition(name='_reverse_string', description=None, parameters=FunctionParams(properties={'text': ParamProperty(type='string', description='The text to reverse', enum=None, items=None)}, type='object', required=['text'], additionalProperties=None), strict=None)

Response form gpt-4.1-mini (returns all the tools available with my example trace):

==================== Lanchain Agent ====================
Tool Available 1: type='function' function=FunctionToolDefinition(name='calculator', description="Performs basic mathematical operations. \n\n    Args:\n        operation: A mathematical expression like '5 + 3' or '10 * 2'", parameters=FunctionParams(properties={'operation': ParamProperty(type='string', description=None, enum=None, items=None)}, type='object', required=['operation'], additionalProperties=None), strict=None)

Tool Available 2: type='function' function=FunctionToolDefinition(name='get_word_length', description='Returns the length of a given word.\n\n    Args:\n        word: The word to count characters in', parameters=FunctionParams(properties={'word': ParamProperty(type='string', description=None, enum=None, items=None)}, type='object', required=['word'], additionalProperties=None), strict=None)

Tool Available 3: type='function' function=FunctionToolDefinition(name='reverse_string', description='Reverses the given text.\n\n    Args:\n        text: The text to reverse', parameters=FunctionParams(properties={'text': ParamProperty(type='string', description=None, enum=None, items=None)}, type='object', required=['text'], additionalProperties=None), strict=None)

==================== OpenAI Agent ====================
Tool Available 1: type='function' function=FunctionToolDefinition(name='_calculator', description='Performs basic mathematical operations.', parameters=FunctionParams(properties={'operation': ParamProperty(type='string', description="A mathematical expression like '5 + 3' or '10 * 2'", enum=None, items=None)}, type='object', required=['operation'], additionalProperties=False), strict=None)

Tool Available 2: type='function' function=FunctionToolDefinition(name='_get_word_length', description='Returns the length of a given word.', parameters=FunctionParams(properties={'word': ParamProperty(type='string', description='The word to count characters in', enum=None, items=None)}, type='object', required=['word'], additionalProperties=False), strict=None)

Tool Available 3: type='function' function=FunctionToolDefinition(name='_reverse_string', description='Reverses the given text.', parameters=FunctionParams(properties={'text': ParamProperty(type='string', description='The text to reverse', enum=None, items=None)}, type='object', required=['text'], additionalProperties=False), strict=None)

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

How should the PR be classified in the release notes? Choose one:

rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Should this PR be included in the next patch release?

Yes should be selected for bug fixes, documentation updates, and other small changes. No should be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.

What is a minor/patch release?

Minor release: a release that increments the second part of the version number (e.g., 1.2.0 -> 1.3.0).
Bug fixes, doc updates and new features usually go into minor releases.
Patch release: a release that increments the third part of the version number (e.g., 1.2.0 -> 1.2.1).
Bug fixes and doc updates usually go into patch releases.

Yes (this PR will be cherry-picked and included in the next patch release)
No (this PR will be included in the next minor release)

xsh310 · 2025-12-12T18:35:00Z

I'm not sure whether we should be concerned about cost here since @smoorjani you mentioned that ideally we don't want to use agent for built in scorer. With the current approach, for any trace that doesn't match the standard format, we might be running an agent call for each scorer call.

I'm wondering whether we should make this fallback approach optional so that user can choose whether or not to opt in based on their cost tolerance level and what a good way to expose this option might be.

cc @dbczumar , @smoorjani , @AveshCSingh

github-actions · 2025-12-12T18:43:50Z

Documentation preview for 11452fe is available at:

https://pr-19322--mlflow-docs-preview.netlify.app/docs/latest/

More info

Ignore this comment if this PR does not change the documentation.
The preview is updated when a new commit is pushed to this PR.
This comment was created by this workflow run.
The documentation was built by this workflow run.

mlflow/genai/utils/prompts/__init__.py

mlflow/genai/utils/prompts/available_tools_extraction.py

smoorjani · 2025-12-13T05:31:05Z

mlflow/genai/utils/trace_utils.py

            _logger.warning(f"Failed to link batch of traces to run: {e}")
+
+
+class ExtractedToolFunction(BaseModel):


why do we need new classes here? can we use the existing ones defined in mlflow.types.llm? or extend if absolutely necessary

Removed all of these except for ExtractedToolsFromTrace which wraps a list of ChatTool. We need this because get_chat_completions_with_structured_output's output_schema only accept a BaseModel type

mlflow/genai/utils/trace_utils.py

smoorjani · 2025-12-16T00:12:54Z

tests/genai/utils/test_trace_utils.py

+
+    assert len(extracted_tools) == 2
+    tool_names = [t.function.name for t in extracted_tools]
+    assert "add" in tool_names


nit: can we assert on the exact format of extracted tools? this is generally safer than asserting on a portion of extracted_tools

tests/genai/utils/test_trace_utils.py

smoorjani · 2025-12-16T00:14:35Z

tests/genai/utils/test_trace_utils.py

+    assert extracted_tools[0].function.name == "hard_to_extract_tool"
+
+
+def test_extract_available_tools_llm_fallback_not_triggered_when_tools_found():


I think this test is redundant with the happy path above?

I think this is different from the test above. The one above tests the case where tool is not programmatically parsable and assert that fallback is triggered. This one tests the case where tool is programmatically parsable and assert that fallback is not triggered.

what's the difference between this and test_extract_available_tools_from_trace_with_multiple_spans or test_extract_available_tools_from_trace_basic? if there is none, let's remove this test

xsh310 · 2025-12-16T06:38:27Z

Updated the PR to address @smoorjani 's comments

smoorjani

left a few comments to address before merging, otherwise LGTM

smoorjani · 2025-12-16T23:56:32Z

mlflow/genai/utils/trace_utils.py

+    """
+    if model is None:
+        if is_databricks_uri(mlflow.get_tracking_uri()):
+            # TODO: Add support for Databricks tool extraction with LLM fallback.


let's also file a ticket for this as a follow-up

smoorjani · 2025-12-17T00:39:55Z

tests/genai/utils/test_trace_utils.py

+    assert extracted_tools[0].function.name == "hard_to_extract_tool"
+
+
+def test_extract_available_tools_llm_fallback_not_triggered_when_tools_found():


what's the difference between this and test_extract_available_tools_from_trace_with_multiple_spans or test_extract_available_tools_from_trace_basic? if there is none, let's remove this test

smoorjani · 2025-12-17T00:40:25Z

tests/genai/utils/test_trace_utils.py

+        mock_raise_error,
+    )
+
+    from mlflow.genai.utils.trace_utils import _try_extract_available_tools_with_llm


nit: use top-level imports

smoorjani · 2025-12-17T00:40:39Z

tests/genai/utils/test_trace_utils.py

+
+    from mlflow.genai.utils.trace_utils import _try_extract_available_tools_with_llm
+
+    # Should return empty list, not raise exception


nit: one-line comment that doesn't really help readability.

Signed-off-by: Xiang Shen <xshen.shc@gmail.com>

xsh310 · 2025-12-17T04:45:42Z

Updated PR to address @smoorjani 's comments

…mlflow#19322) Signed-off-by: Xiang Shen <xshen.shc@gmail.com>

…#19322) Signed-off-by: Xiang Shen <xshen.shc@gmail.com>

xsh310 mentioned this pull request Dec 10, 2025

[Agentic Judges] Add util function to retrieve available tools from trace #19307

Merged

29 tasks

xsh310 force-pushed the stack/ML-59978-add-fallback-for-available-tools-retrieval-using-llm branch from 8768b08 to 1f7e5d4 Compare December 10, 2025 22:57

xsh310 mentioned this pull request Dec 10, 2025

[ML-59978] Add util to retrieve tools called from trace #19318

Closed

29 tasks

xsh310 force-pushed the stack/ML-59978-add-fallback-for-available-tools-retrieval-using-llm branch 3 times, most recently from 57c863f to 56daf95 Compare December 12, 2025 18:06

xsh310 mentioned this pull request Dec 12, 2025

[Agentic Judges] Introduce tool call efficiency builtin judge #19358

Merged

29 tasks

xsh310 changed the title ~~[ML-59978] add fallback retrieval for available tools using llm~~ [Agentic Judges] Add fallback retrieval for available tools using LLM Dec 12, 2025

xsh310 requested review from AveshCSingh, B-Step62, alkispoly-db, dbczumar and smoorjani December 12, 2025 18:28

xsh310 marked this pull request as ready for review December 12, 2025 18:35

github-actions bot added area/evaluation MLflow Evaluation area/tracing MLflow Tracing and its integrations rn/none List under Small Changes in Changelogs. labels Dec 12, 2025

github-actions bot added the v3.7.1 label Dec 12, 2025

xsh310 added v3.8.0 and removed v3.7.1 labels Dec 12, 2025

xsh310 force-pushed the stack/ML-59978-add-fallback-for-available-tools-retrieval-using-llm branch 2 times, most recently from dac7121 to 1b62d37 Compare December 15, 2025 06:17

This was referenced Dec 15, 2025

[Agentic Judges] Introduce tool call correctness builtin judge #19391

Merged

[Agentic Judges] Support expectation comparison for ToolCallCorrectness #19413

Closed

xsh310 force-pushed the stack/ML-59978-add-fallback-for-available-tools-retrieval-using-llm branch from 1b62d37 to 51fb0bb Compare December 15, 2025 23:42

smoorjani requested changes Dec 16, 2025

View reviewed changes

xsh310 force-pushed the stack/ML-59978-add-fallback-for-available-tools-retrieval-using-llm branch 2 times, most recently from 00ed261 to 60d87c0 Compare December 16, 2025 05:07

xsh310 force-pushed the stack/ML-59978-add-fallback-for-available-tools-retrieval-using-llm branch 4 times, most recently from 4d13906 to b0b984d Compare December 16, 2025 06:34

xsh310 force-pushed the stack/ML-59978-add-fallback-for-available-tools-retrieval-using-llm branch from b0b984d to 6acf5f8 Compare December 16, 2025 22:24

smoorjani approved these changes Dec 17, 2025

View reviewed changes

[ML-59978] add fallback retrieval for available tools using llm

11452fe

Signed-off-by: Xiang Shen <xshen.shc@gmail.com>

xsh310 force-pushed the stack/ML-59978-add-fallback-for-available-tools-retrieval-using-llm branch from 6acf5f8 to 11452fe Compare December 17, 2025 03:12

github-actions bot assigned smoorjani Dec 17, 2025

xsh310 added this pull request to the merge queue Dec 17, 2025

Merged via the queue into mlflow:master with commit f0969af Dec 17, 2025
67 of 70 checks passed

xsh310 deleted the stack/ML-59978-add-fallback-for-available-tools-retrieval-using-llm branch December 17, 2025 05:05

WeichenXu123 pushed a commit to WeichenXu123/mlflow that referenced this pull request Dec 19, 2025

[Agentic Judges] Add fallback retrieval for available tools using LLM (…

df3c3e7

…mlflow#19322) Signed-off-by: Xiang Shen <xshen.shc@gmail.com>

WeichenXu123 pushed a commit that referenced this pull request Dec 19, 2025

[Agentic Judges] Add fallback retrieval for available tools using LLM (…

5f437ee

…#19322) Signed-off-by: Xiang Shen <xshen.shc@gmail.com>

		_logger.warning(f"Failed to link batch of traces to run: {e}")


		class ExtractedToolFunction(BaseModel):

		assert extracted_tools[0].function.name == "hard_to_extract_tool"


		def test_extract_available_tools_llm_fallback_not_triggered_when_tools_found():


		from mlflow.genai.utils.trace_utils import _try_extract_available_tools_with_llm

		# Should return empty list, not raise exception

Conversation

xsh310 commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🥞 Stacked PR

What changes are proposed in this pull request?

How is this PR tested?

Manual Testing

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

Should this PR be included in the next patch release?

Uh oh!

xsh310 commented Dec 12, 2025

Uh oh!

github-actions bot commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smoorjani Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xsh310 commented Dec 16, 2025

Uh oh!

smoorjani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smoorjani Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xsh310 commented Dec 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xsh310 commented Dec 10, 2025 •

edited

Loading

github-actions bot commented Dec 12, 2025 •

edited

Loading

smoorjani Dec 17, 2025 •

edited

Loading

smoorjani Dec 17, 2025 •

edited

Loading