feat(tools): progressive tool disclosure for MCP and plugin tools#31163
feat(tools): progressive tool disclosure for MCP and plugin tools#31163teknium1 wants to merge 2 commits into
Conversation
Adds Tool Search, a structured-tools progressive-disclosure layer that
replaces MCP and non-core plugin tools in the model-visible tools array
with three bridge tools (tool_search / tool_describe / tool_call) when
the deferrable surface would consume more than a configurable percentage
of the active model's context window. Core Hermes tools are never deferred.
Default mode is 'auto' with a 10% context threshold, so small toolsets
pay no overhead. Set tools.tool_search.enabled to 'on' to force or 'off'
to disable.
Design carefully reflects the OpenClaw production failure modes
documented in the openclaw-tool-search-report:
- Core tools never defer (toolsets._HERMES_CORE_TOOLS). Addresses the
'tools silently missing from isolated cron turns' regression class
(openclaw#84141) by construction: there is no code path that can
drop a core tool.
- Catalog is stateless across turns — rebuilt from the live tool-defs
list on every assembly. No session-keyed Map that can drift out of
sync with the registry.
- tool_call unwraps the bridge call before any hook fires, so plugin
pre/post hooks, guardrails, approval flows, and the activity feed
all see the underlying tool name, not the bridge (addresses
openclaw#85588 and the verbose-mode complaint on openclaw#79823).
- The unwrap happens in both the parallel and sequential paths of
agent/tool_executor.py and also in handle_function_call, so direct
callers (sandboxed code, eval harnesses) are covered too.
- Bridge tools cannot invoke each other (recursion guard) and cannot
invoke core tools (those must be called directly).
- Tools mode only — no JS-sandbox code-mode. Keeps the surface small.
- Token estimation via cheap char/4 heuristic; precision isn't needed
for the threshold decision.
Files:
- tools/tool_search.py — new module (BM25 retrieval, classification,
threshold gate, bridge dispatch, unwrap helper).
- tests/tools/test_tool_search.py — 35 tests including the OpenClaw
#84141 regression guard.
- model_tools.py — wires assembly into _compute_tool_definitions as the
final step, adds skip_tool_search_assembly kwarg so the bridge can
see the real catalog, dispatches the three bridge tools.
- agent/tool_executor.py — unwraps tool_call in both parallel and
sequential parsing loops so checkpointing, guardrails, plugin hooks,
and tool-progress callbacks all observe the underlying tool name.
- hermes_cli/config.py — DEFAULT_CONFIG['tools']['tool_search'] block.
- website/docs/user-guide/features/tool-search.md — user docs.
Validation:
- 35/35 new tests pass.
- Existing tool/registry/model_tools/config/coercion/executor tests
(82 + 74 + small adjacents) green.
- Live E2E: 20 fake MCP tools registered, get_tool_definitions returns
3 bridges, tool_search returns top 3 hits, tool_describe returns
full schema, tool_call dispatches to the real underlying handler
and the underlying result is what the model sees.
- Reserved-name recursion guard verified live.
- Core-tool refusal via tool_call verified live.
🔎 Lint report:
|
| Rule | Count |
|---|---|
PLW1514 |
1 |
First entries
scripts/tool_search_livetest.py:358: [PLW1514] `pathlib.Path(...).write_text` without explicit `encoding` argument
✅ Fixed issues: none
Unchanged: 0 pre-existing issues carried over.
ty (type checker)
Total: 9427 on HEAD, 9422 on base (🆕 +5)
🆕 New issues (5):
| Rule | Count |
|---|---|
invalid-assignment |
3 |
unresolved-import |
1 |
invalid-argument-type |
1 |
First entries
scripts/tool_search_livetest.py:410: [invalid-assignment] invalid-assignment: Object of type `bound method ToolRegistry.dispatch(name: str, args: dict[Unknown, Unknown], **kwargs) -> str` is not assignable to attribute `dispatch` of type `def dispatch(self, name: str, args: dict[Unknown, Unknown], **kwargs) -> str`
tests/tools/test_tool_search.py:15: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
model_tools.py:840: [invalid-assignment] invalid-assignment: Object of type `None` is not assignable to `<module 'tools.tool_search'>`
scripts/tool_search_livetest.py:375: [invalid-assignment] invalid-assignment: Object of type `def logging_dispatch(name, args, **kw) -> Unknown` is not assignable to attribute `dispatch` of type `def dispatch(self, name: str, args: dict[Unknown, Unknown], **kwargs) -> str`
scripts/tool_search_livetest.py:387: [invalid-argument-type] invalid-argument-type: Argument to `AIAgent.__init__` is incorrect: Expected `list[str]`, found `None`
✅ Fixed issues: none
Unchanged: 4890 pre-existing issues carried over.
Diagnostics are surfaced as warnings — this check never fails the build.
|
Sharing a test report — we tried this PR as the inference layer for a small local model in a degraded-mode fallback lane. Posting in case the data is useful; not a review, just a write-up of what we did, the environment, the changes we needed on our end, and the results. What we wanted to learnWhether progressive tool disclosure as implemented here is a viable fit for our specific use case: a small (4B) instruction-tuned model running on consumer hardware behind an OpenAI-compatible llama.cpp endpoint, used as a degraded-mode fallback for an agent that normally runs on a large cloud model with ~23 tools enabled. This is a narrow fit-for-purpose test from one specific angle, not an evaluation of the PR against its design goals. Environment
Changes we made on our side before the test was stableThree modifications were necessary in our environment. Sharing them so the conditions of the test are clear. 1. llama.cpp Adding 2. Reduced coreset. We exposed roughly 10 tools to the model (memory, terminal, session_search, read_file, write_file, search_files, patch, todo, clarify, plus this PR's bridge tool) instead of our full ~23. This is a separate ongoing line of work on what a small-model-friendly coreset looks like; we tested both arms below using this reduced set as the "treatment" baseline and our full ~23-tool set as the "control" baseline. 3. Bridge-tool description tightening. We made small adjustments to the description text on the discovery tool to make the 4B model invoke it more reliably. Diff available if useful. Test setSix sequential prompts in a single session (chained via Step 5 referenced a phrase ( ResultsBoth arms used Arm A — full coreset, this PR's mechanism disabled (our baseline)
Arm B — reduced coreset, this PR's mechanism enabled
Reference — same six prompts against our normal large-model agent (Claude Opus-class) 5/6 real tool executions. Step 5 returned 0 hits because the seeded phrase lives in the sandbox state.db, not the production one — model correctly reported no result rather than fabricating one. Token observations (informational)
Arm B wins on cold turns and on chat-heavy mixes. Arm B's per-turn cost grows when discovery results pile up in conversation history across many tool-using turns in close succession. Whether the net is positive depends on call mix; for our intended workload (chat-heavy with occasional tool use) Arm B is meaningfully cheaper. Fit-for-our-purpose resultFor the specific lane we're investigating (small local model behind an OpenAI-compatible llama.cpp endpoint as a degraded-mode fallback), this PR's mechanism produced a working agent in our test where our baseline configuration did not. We're going to keep iterating on this configuration internally and would not have a usable small-model lane without it. We're not claiming this generalises beyond our setup, and we make no claim about whether this is the use case the PR was designed for — just sharing the data in case it's useful as one additional point. Happy to share the test driver script, the bridge-tool description diff, or full session JSON dumps if any of that would help. This comment was written by a Hermes Agent instance (Claude Opus class model running on the Hermes Agent stack) on behalf of its operator, who ran the experiment and reviewed the report before posting. |
Adds a real-model live test for the tool_search feature. Spins up a real
AIAgent against Claude Haiku 4.5 via OpenRouter, registers 20 fake MCP
tools with realistic shapes, runs 5 scenarios twice each (tool_search ON
and OFF), and records the full transcript per run.
Captures both the bridge call sequence the model emitted (tool_search /
tool_describe / tool_call) and the underlying tool calls that actually
executed through the registry. Records iteration count, elapsed time,
and final response for an A/B comparison.
Scenarios cover:
A. Obvious single tool — direct keyword match
B. Vague paraphrased intent — stress retrieval quality
C. Multi-step chain — two deferred tools in sequence
D. Mixed core + deferred — verify core tools (read_file) get called
directly, not through tool_call
E. No tool needed — verify no spurious tool_search invocations
Baseline run included in scripts/out/ for reference. All 10 runs
(5 scenarios x 2 modes) pass — every expected underlying tool was
invoked, no core tool was incorrectly routed through tool_call, no
tool name was hallucinated.
Round-trip cost observed: tool_search enabled added +3 to +4 model
round trips per task vs disabled. Single-tool tasks completed in ~16-20s
vs ~10-11s direct. Multi-tool tasks ~20s vs ~14s. The bridge overhead
is real and measurable but the task completion rate is identical.
Live test resultsRan a real-model end-to-end test against Claude Haiku 4.5 via OpenRouter. Five 10/10 runs passed. Every expected underlying tool was invoked. Zero Side-by-side
Sample trace (Scenario A, ON)Sample trace (Scenario D, ON) — the safety guarantee in actionNote that Observed costs
Confidence to shipBehavior matches design. The bridge tools are usable by a real model without Future work, separate from this PR:
|
|
|
||
| suffix = "enabled" if enabled else "disabled" | ||
| out_path = out_dir / f"{scenario['id']}__{suffix}.json" | ||
| out_path.write_text(json.dumps(record, indent=2, default=str)) |
| }) | ||
|
|
||
| summary_path = out_dir / "_summary.json" | ||
| summary_path.write_text(json.dumps(summary, indent=2)) |
|
Merged via #34493 (rebased onto current main, your commit authorship preserved in The salvage carried the feature forward and closed a toolset-scoping hole found in review: the bridge read its catalog from the global registry, so a restricted-toolset session (subagent / kanban worker / curated gateway session) could Also: dropped the 11 checked-in |
Infographic
Summary
Adds Tool Search, a structured-tools progressive-disclosure layer that hides MCP and non-core plugin tools behind three bridge tools (
tool_search/tool_describe/tool_call) when the deferrable surface would consume more than ~10% of the active model's context window. Core Hermes tools are never deferred.Design choices are explicitly shaped by the published failure modes from OpenClaw's tool-search implementation (full research report in [my chat with Teknium that produced this PR]). Each of OpenClaw's open or recent reliability issues has a corresponding architectural defense or test here.
What it does
Default behavior (
enabled: auto):tool_searchto find what it needs,tool_describefor full schema,tool_callto invoke.Changes
tools/tool_search.py(new) — module with classification, threshold gate, BM25 retrieval, bridge dispatch, unwrap helper.tests/tools/test_tool_search.py(new) — 35 tests including a named regression guard for the OpenClaw cron-tool-loss class of bug (#84141 in their tracker).model_tools.py— wires assembly into_compute_tool_definitionsas the final step; addsskip_tool_search_assemblykwarg so the bridge can read the real (pre-assembly) catalog; dispatches the three bridge tools.agent/tool_executor.py— unwrapstool_callin both the parallel and the sequential parsing paths so checkpointing, guardrails, plugin pre/post hooks, and the tool-progress callback observe the underlying tool name, not the bridge.hermes_cli/config.py—DEFAULT_CONFIG['tools']['tool_search']block.website/docs/user-guide/features/tool-search.md— user docs.No edits to
cli.py,gateway/run.py,run_agent.py, ortoolsets.py.Reliability defenses by construction
These map 1:1 against the OpenClaw failure modes I documented:
heartbeat_respondviatool_search)toolsets._HERMES_CORE_TOOLS—terminal,read_file,write_file,patch,search_files,todo,memory,browser_*,send_message, the messaging primitives. The model is never asked to search for them.Mapthat can drift out of sync with the registry. A dedicatedTestRegression_OpenClawCron84141class enforces this.tool_callentry on the assistant message is left untouched so transcripts andtool_call_idmatching stay exactly as the model emitted them; unwrap is for hook/display only.tool_searchinvoked recursivelyresolve_underlying_call—tool_callcannot invoke any bridge tool.enabled: auto(default) with a 10% context threshold. Small toolsets pay zero overhead.Test plan
Live E2E (isolated HERMES_HOME, real registry, 20 fake
mcp-githubtools):get_tool_definitions(enabled_toolsets=['mcp-github'])returns 3 bridge tools, no raw MCP schemas in output.tool_search("github issue")returns top 3 hits withtotal_available: 20.tool_describe("mcp_github_action_5")returns the full schema.tool_call("mcp_github_action_5", {repo: "foo/bar"})dispatches and returns the underlying handler's{"ok": true}.tool_call("tool_call", {...})rejected with recursion guard message.tool_call("terminal", {...})rejected — model is told to call core tools directly.What this PR explicitly does not include
tool_search_codeis a 1,500-line subprocess + permission-mode + IPC bridge. The three structured bridge tools deliver the same value with a tenth the surface area. Adding code mode is a future PR if there is demand.~/.hermes/tool-search-catalog.json. The whole design assumes the catalog is cheap to rebuild and not worth caching across processes.defer_loadingand OpenAI's hostedtool_searchwould let us push the work to the provider when available. Cleanest to add after we have benchmark data showing whether the generic path is good enough on its own.hermes toolsUI changes. The feature is config-driven; the TUI doesn't need an entry yet.Follow-up work
autodefault's 10% threshold is right or needs tuning per-model.