fix(tools): isolate get_tool_definitions quiet_mode cache + dedup LCM injection (#17335)#17337
Closed
Sanjays2402 wants to merge 1 commit into
Closed
Conversation
… injection (NousResearch#17335) Long-lived Gateway processes were sending duplicate tool names to providers that enforce uniqueness: - DeepSeek: 'Tool names must be unique.' - Xiaomi MiMo: 'tools contains duplicate names: lcm_expand' - Moonshot/Kimi: 'function name lcm_grep is duplicated' TUI was unaffected because TUI runs with quiet_mode=False and skips the cache entirely. Root cause (two layered bugs) - model_tools.get_tool_definitions(quiet_mode=True) memoizes its result in _tool_defs_cache. The cache-hit path returned list(cached) (safe), but the FIRST uncached call stored and returned the SAME object. run_agent.py mutates self.tools (memory + LCM context-engine schemas) in-place, so the very first agent init in a Gateway process poisoned the cache, and every subsequent init appended LCM schemas again on top of the already-polluted list. - run_agent.py's context-engine injection (lcm_grep / lcm_describe / lcm_expand) had no dedup, unlike the memory-tools injection right above it which already skips already-present names. Fix (defense in depth, per the issue's suggested fix) - model_tools.get_tool_definitions: on the uncached branch, cache the computed list but return list(result) to the caller. Same pattern as the cache-hit path. - run_agent.py: build _existing_tool_names from self.tools and skip schemas whose names are already present, mirroring the memory-tools block. This also defends against plugin paths that may register the same schemas via ctx.register_tool(). Tests (tests/test_get_tool_definitions_cache_isolation.py) - test_first_uncached_call_returns_fresh_list \u2014 pins the fix; without it, first-call alias caused all the symptoms. - test_cache_hit_returns_fresh_list \u2014 pre-existing behavior stays. - test_caller_mutation_does_not_poison_cache \u2014 simulates run_agent appending lcm_grep / lcm_expand to the returned list and asserts the next call doesn't see them. - test_repeated_caller_mutation_does_not_accumulate \u2014 reproduces the long-lived Gateway accumulation pattern across 5 agent inits. - test_non_quiet_mode_does_not_use_cache \u2014 sanity, explains why TUI was fine. 5/5 pass on the new file; 23/23 still pass on tests/test_model_tools.py.
23 tasks
Contributor
|
Salvaged onto current |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #17335.
Problem
Long-lived Gateway processes (Feishu, etc.) were sending duplicate tool names to providers that enforce uniqueness:
Tool names must be unique.tools contains duplicate names: lcm_expandfunction name lcm_grep is duplicatedTUI was unaffected because TUI uses
quiet_mode=Falseand skips the cache.Root Cause (two layered bugs)
1.
model_tools.get_tool_definitions(quiet_mode=True)aliased the cached object on the first call.The cache-hit path (line 278) already returned
list(cached)— safe. But the first uncached call stored and returned the same object.run_agent.pythen mutatesself.toolsin-place (appending memory + LCM context-engine schemas), so the very first agent init in a Gateway process poisoned the cache, and every subsequent init appended LCM schemas again on top of the already-polluted list.2.
run_agent.py's context-engine injection had no dedup.Memory-tools injection (lines 1728–1748) already skips already-present names. The LCM injection right below it (lines 1986–1993) didn't. So even after fixing the cache, plugin paths that register schemas via
ctx.register_tool()could still produce duplicates.Fix (defense in depth, exactly as the issue suggested)
model_tools.py— on the uncached branch, cache the result but returnlist(result)to the caller, mirroring the cache-hit path:run_agent.py— build_existing_tool_namesfromself.toolsand skip already-present schemas, mirroring the memory-tools block above:Tests
New file
tests/test_get_tool_definitions_cache_isolation.py:test_first_uncached_call_returns_fresh_list— pins the fix; without it, the first-call alias is the entire bug.test_cache_hit_returns_fresh_list— pre-existing perf(tools): memoize get_tool_definitions + TTL-cache check_fn results #17098 behavior stays.test_caller_mutation_does_not_poison_cache— simulatesrun_agentappendinglcm_grep/lcm_expandto the returned list and asserts the next call doesn't see them.test_repeated_caller_mutation_does_not_accumulate— reproduces the long-lived Gateway accumulation across 5 agent inits.test_non_quiet_mode_does_not_use_cache— sanity, explains why TUI was unaffected.5/5 new tests pass; 23/23 existing
tests/test_model_tools.pystill pass.