[Agentic Judges] Add fallback retrieval for available tools using LLM#19322
Conversation
8768b08 to
1f7e5d4
Compare
57c863f to
56daf95
Compare
|
I'm not sure whether we should be concerned about cost here since @smoorjani you mentioned that ideally we don't want to use agent for built in scorer. With the current approach, for any trace that doesn't match the standard format, we might be running an agent call for each scorer call. I'm wondering whether we should make this fallback approach optional so that user can choose whether or not to opt in based on their cost tolerance level and what a good way to expose this option might be. cc @dbczumar , @smoorjani , @AveshCSingh |
|
Documentation preview for 11452fe is available at: More info
|
dac7121 to
1b62d37
Compare
1b62d37 to
51fb0bb
Compare
mlflow/genai/utils/trace_utils.py
Outdated
| _logger.warning(f"Failed to link batch of traces to run: {e}") | ||
|
|
||
|
|
||
| class ExtractedToolFunction(BaseModel): |
There was a problem hiding this comment.
why do we need new classes here? can we use the existing ones defined in mlflow.types.llm? or extend if absolutely necessary
There was a problem hiding this comment.
Removed all of these except for ExtractedToolsFromTrace which wraps a list of ChatTool. We need this because get_chat_completions_with_structured_output's output_schema only accept a BaseModel type
|
|
||
| assert len(extracted_tools) == 2 | ||
| tool_names = [t.function.name for t in extracted_tools] | ||
| assert "add" in tool_names |
There was a problem hiding this comment.
nit: can we assert on the exact format of extracted tools? this is generally safer than asserting on a portion of extracted_tools
| assert extracted_tools[0].function.name == "hard_to_extract_tool" | ||
|
|
||
|
|
||
| def test_extract_available_tools_llm_fallback_not_triggered_when_tools_found(): |
There was a problem hiding this comment.
I think this test is redundant with the happy path above?
There was a problem hiding this comment.
I think this is different from the test above. The one above tests the case where tool is not programmatically parsable and assert that fallback is triggered. This one tests the case where tool is programmatically parsable and assert that fallback is not triggered.
There was a problem hiding this comment.
what's the difference between this and test_extract_available_tools_from_trace_with_multiple_spans or test_extract_available_tools_from_trace_basic? if there is none, let's remove this test
00ed261 to
60d87c0
Compare
4d13906 to
b0b984d
Compare
|
Updated the PR to address @smoorjani 's comments |
b0b984d to
6acf5f8
Compare
smoorjani
left a comment
There was a problem hiding this comment.
left a few comments to address before merging, otherwise LGTM
| """ | ||
| if model is None: | ||
| if is_databricks_uri(mlflow.get_tracking_uri()): | ||
| # TODO: Add support for Databricks tool extraction with LLM fallback. |
There was a problem hiding this comment.
let's also file a ticket for this as a follow-up
| assert extracted_tools[0].function.name == "hard_to_extract_tool" | ||
|
|
||
|
|
||
| def test_extract_available_tools_llm_fallback_not_triggered_when_tools_found(): |
There was a problem hiding this comment.
what's the difference between this and test_extract_available_tools_from_trace_with_multiple_spans or test_extract_available_tools_from_trace_basic? if there is none, let's remove this test
| mock_raise_error, | ||
| ) | ||
|
|
||
| from mlflow.genai.utils.trace_utils import _try_extract_available_tools_with_llm |
There was a problem hiding this comment.
nit: use top-level imports
|
|
||
| from mlflow.genai.utils.trace_utils import _try_extract_available_tools_with_llm | ||
|
|
||
| # Should return empty list, not raise exception |
There was a problem hiding this comment.
nit: one-line comment that doesn't really help readability.
Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
6acf5f8 to
11452fe
Compare
|
Updated PR to address @smoorjani 's comments |
…mlflow#19322) Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
…#19322) Signed-off-by: Xiang Shen <xshen.shc@gmail.com>
🥞 Stacked PR
Use this link to review incremental changes.
What changes are proposed in this pull request?
Adding a fallback solution for parsing
available_toolsby using a tracing parsing agent.How is this PR tested?
Manual Testing
Tested on Langchain and OpenAI agent:
Response from
gpt-4o-mini(sometimes won't return all the tools available):Response form
gpt-4.1-mini(returns all the tools available with my example trace):Does this PR require documentation update?
Release Notes
Is this a user-facing change?
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/tracking: Tracking Service, tracking client APIs, autologgingarea/models: MLmodel format, model serialization/deserialization, flavorsarea/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registryarea/scoring: MLflow Model server, model deployment tools, Spark UDFsarea/evaluation: MLflow model evaluation features, evaluation metrics, and evaluation workflowsarea/gateway: MLflow AI Gateway client APIs, server, and third-party integrationsarea/prompts: MLflow prompt engineering features, prompt templates, and prompt managementarea/tracing: MLflow Tracing features, tracing APIs, and LLM tracing functionalityarea/projects: MLproject format, project running backendsarea/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/build: Build and test infrastructure for MLflowarea/docs: MLflow documentation pagesHow should the PR be classified in the release notes? Choose one:
rn/none- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/breaking-change- The PR will be mentioned in the "Breaking Changes" sectionrn/feature- A new user-facing feature worth mentioning in the release notesrn/bug-fix- A user-facing bug fix worth mentioning in the release notesrn/documentation- A user-facing documentation change worth mentioning in the release notesShould this PR be included in the next patch release?
Yesshould be selected for bug fixes, documentation updates, and other small changes.Noshould be selected for new features and larger changes. If you're unsure about the release classification of this PR, leave this unchecked to let the maintainers decide.What is a minor/patch release?
Bug fixes, doc updates and new features usually go into minor releases.
Bug fixes and doc updates usually go into patch releases.