Skip to content

fix(tools): isolate get_tool_definitions quiet_mode cache + dedup LCM injection (#17335)#17337

Closed
Sanjays2402 wants to merge 1 commit into
NousResearch:mainfrom
Sanjays2402:fix/17335-quiet-mode-cache-pollution
Closed

fix(tools): isolate get_tool_definitions quiet_mode cache + dedup LCM injection (#17335)#17337
Sanjays2402 wants to merge 1 commit into
NousResearch:mainfrom
Sanjays2402:fix/17335-quiet-mode-cache-pollution

Conversation

@Sanjays2402

Copy link
Copy Markdown
Contributor

Closes #17335.

Problem

Long-lived Gateway processes (Feishu, etc.) were sending duplicate tool names to providers that enforce uniqueness:

  • DeepSeek: Tool names must be unique.
  • Xiaomi MiMo: tools contains duplicate names: lcm_expand
  • Moonshot/Kimi: function name lcm_grep is duplicated

TUI was unaffected because TUI uses quiet_mode=False and skips the cache.

Root Cause (two layered bugs)

1. model_tools.get_tool_definitions(quiet_mode=True) aliased the cached object on the first call.
The cache-hit path (line 278) already returned list(cached) — safe. But the first uncached call stored and returned the same object. run_agent.py then mutates self.tools in-place (appending memory + LCM context-engine schemas), so the very first agent init in a Gateway process poisoned the cache, and every subsequent init appended LCM schemas again on top of the already-polluted list.

2. run_agent.py's context-engine injection had no dedup.
Memory-tools injection (lines 1728–1748) already skips already-present names. The LCM injection right below it (lines 1986–1993) didn't. So even after fixing the cache, plugin paths that register schemas via ctx.register_tool() could still produce duplicates.

Fix (defense in depth, exactly as the issue suggested)

model_tools.py — on the uncached branch, cache the result but return list(result) to the caller, mirroring the cache-hit path:

result = _compute_tool_definitions(...)
if quiet_mode:
    _tool_defs_cache[cache_key] = result
    return list(result)
return result

run_agent.py — build _existing_tool_names from self.tools and skip already-present schemas, mirroring the memory-tools block above:

_existing_tool_names = {t.get("function", {}).get("name") for t in self.tools if isinstance(t, dict)}
for _schema in self.context_compressor.get_tool_schemas():
    _tname = _schema.get("name", "")
    if _tname and _tname in _existing_tool_names:
        continue
    ...
    _existing_tool_names.add(_tname)

Tests

New file tests/test_get_tool_definitions_cache_isolation.py:

  • test_first_uncached_call_returns_fresh_listpins the fix; without it, the first-call alias is the entire bug.
  • test_cache_hit_returns_fresh_list — pre-existing perf(tools): memoize get_tool_definitions + TTL-cache check_fn results #17098 behavior stays.
  • test_caller_mutation_does_not_poison_cache — simulates run_agent appending lcm_grep / lcm_expand to the returned list and asserts the next call doesn't see them.
  • test_repeated_caller_mutation_does_not_accumulate — reproduces the long-lived Gateway accumulation across 5 agent inits.
  • test_non_quiet_mode_does_not_use_cache — sanity, explains why TUI was unaffected.
$ python -m pytest tests/test_get_tool_definitions_cache_isolation.py tests/test_model_tools.py -q
............................                                            [100%]
28 passed in 0.78s

5/5 new tests pass; 23/23 existing tests/test_model_tools.py still pass.

… injection (NousResearch#17335)

Long-lived Gateway processes were sending duplicate tool names to
providers that enforce uniqueness:

  - DeepSeek:        'Tool names must be unique.'
  - Xiaomi MiMo:     'tools contains duplicate names: lcm_expand'
  - Moonshot/Kimi:   'function name lcm_grep is duplicated'

TUI was unaffected because TUI runs with quiet_mode=False and skips the
cache entirely.

Root cause (two layered bugs)
- model_tools.get_tool_definitions(quiet_mode=True) memoizes its result
  in _tool_defs_cache. The cache-hit path returned list(cached) (safe),
  but the FIRST uncached call stored and returned the SAME object.
  run_agent.py mutates self.tools (memory + LCM context-engine schemas)
  in-place, so the very first agent init in a Gateway process
  poisoned the cache, and every subsequent init appended LCM schemas
  again on top of the already-polluted list.
- run_agent.py's context-engine injection (lcm_grep / lcm_describe /
  lcm_expand) had no dedup, unlike the memory-tools injection right
  above it which already skips already-present names.

Fix (defense in depth, per the issue's suggested fix)
- model_tools.get_tool_definitions: on the uncached branch, cache the
  computed list but return list(result) to the caller. Same pattern as
  the cache-hit path.
- run_agent.py: build _existing_tool_names from self.tools and skip
  schemas whose names are already present, mirroring the memory-tools
  block. This also defends against plugin paths that may register the
  same schemas via ctx.register_tool().

Tests (tests/test_get_tool_definitions_cache_isolation.py)
- test_first_uncached_call_returns_fresh_list \u2014 pins the fix; without
  it, first-call alias caused all the symptoms.
- test_cache_hit_returns_fresh_list \u2014 pre-existing behavior stays.
- test_caller_mutation_does_not_poison_cache \u2014 simulates run_agent
  appending lcm_grep / lcm_expand to the returned list and asserts the
  next call doesn't see them.
- test_repeated_caller_mutation_does_not_accumulate \u2014 reproduces the
  long-lived Gateway accumulation pattern across 5 agent inits.
- test_non_quiet_mode_does_not_use_cache \u2014 sanity, explains why TUI
  was fine.

5/5 pass on the new file; 23/23 still pass on tests/test_model_tools.py.
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/tools Tool registry, model_tools, toolsets comp/agent Core agent loop, run_agent.py, prompt builder labels Apr 29, 2026
@teknium1

Copy link
Copy Markdown
Contributor

Salvaged onto current main via #17889 (merge commit e0fa2cf972). Your authorship is preserved on the commit. Thanks!

@teknium1 teknium1 closed this Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/tools Tool registry, model_tools, toolsets P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: get_tool_definitions() quiet-mode cache pollution causes duplicate LCM tool schemas in Gateway

3 participants