feat(tools): progressive tool disclosure for MCP and plugin tools by teknium1 · Pull Request #31163 · NousResearch/hermes-agent

teknium1 · 2026-05-23T22:22:53Z

Infographic

Summary

Adds Tool Search, a structured-tools progressive-disclosure layer that hides MCP and non-core plugin tools behind three bridge tools (tool_search / tool_describe / tool_call) when the deferrable surface would consume more than ~10% of the active model's context window. Core Hermes tools are never deferred.

Design choices are explicitly shaped by the published failure modes from OpenClaw's tool-search implementation (full research report in [my chat with Teknium that produced this PR]). Each of OpenClaw's open or recent reliability issues has a corresponding architectural defense or test here.

What it does

Default behavior (enabled: auto):

Toolset is small or context is huge → no-op, tools array passes through unchanged.
Toolset is large (MCP servers, plugin tools push >10% of context) → MCP and plugin tools removed from the model-visible array, three bridge tools added. Model uses tool_search to find what it needs, tool_describe for full schema, tool_call to invoke.

tools:
  tool_search:
    enabled: auto       # auto | on | off
    threshold_pct: 10
    search_default_limit: 5
    max_search_limit: 20

Changes

tools/tool_search.py (new) — module with classification, threshold gate, BM25 retrieval, bridge dispatch, unwrap helper.
tests/tools/test_tool_search.py (new) — 35 tests including a named regression guard for the OpenClaw cron-tool-loss class of bug (#84141 in their tracker).
model_tools.py — wires assembly into _compute_tool_definitions as the final step; adds skip_tool_search_assembly kwarg so the bridge can read the real (pre-assembly) catalog; dispatches the three bridge tools.
agent/tool_executor.py — unwraps tool_call in both the parallel and the sequential parsing paths so checkpointing, guardrails, plugin pre/post hooks, and the tool-progress callback observe the underlying tool name, not the bridge.
hermes_cli/config.py — DEFAULT_CONFIG['tools']['tool_search'] block.
website/docs/user-guide/features/tool-search.md — user docs.

No edits to cli.py, gateway/run.py, run_agent.py, or toolsets.py.

Reliability defenses by construction

These map 1:1 against the OpenClaw failure modes I documented:

OpenClaw failure	Our defense
Retrieval misses critical tools (their PR #85588 telling the model not to look for `heartbeat_respond` via `tool_search`)	Core tools never enter the catalog. Always-load list is `toolsets._HERMES_CORE_TOOLS` — `terminal`, `read_file`, `write_file`, `patch`, `search_files`, `todo`, `memory`, `browser_*`, `send_message`, the messaging primitives. The model is never asked to search for them.
Isolated cron turns drop the requested tool (their #84141, still open)	Catalog is stateless. It is rebuilt from the live tool-defs list on every assembly — no session-keyed `Map` that can drift out of sync with the registry. A dedicated `TestRegression_OpenClawCron84141` class enforces this.
Transcript shape leaks the bridge to external consumers (PR #79823 review)	The `tool_call` entry on the assistant message is left untouched so transcripts and `tool_call_id` matching stay exactly as the model emitted them; unwrap is for hook/display only.
Verbose-mode display hides what's running (PR #79823 review by jalehman)	Unwrap fires before the tool-progress callback. The activity feed sees the underlying tool name and arguments, not the bridge.
`tool_search` invoked recursively	Hard recursion guard in `resolve_underlying_call` — `tool_call` cannot invoke any bridge tool.
Two-step indirection cost for small toolsets	`enabled: auto` (default) with a 10% context threshold. Small toolsets pay zero overhead.
Cost / token regressions on small surfaces	Threshold gate computed every assembly; below it, no swap, no bridge tools, no extra cost.
Trade-off math no one published	Documented in the feature page: extra round trips on cold tools, no static-cache benefit on deferred schemas, model-quality dependence, toolset-edit cache invalidation. We can't remove these; we make them visible.

Test plan

scripts/run_tests.sh tests/tools/test_tool_search.py
  → 35/35 passing
scripts/run_tests.sh tests/test_model_tools.py tests/tools/test_registry.py \
    tests/test_toolsets.py tests/run_agent/test_tool_arg_coercion.py \
    tests/run_agent/test_tool_call_guardrail_runtime.py \
    tests/run_agent/test_tool_executor_contextvar_propagation.py \
    tests/hermes_cli/test_config.py
  → 8 files, all green, no regressions

Live E2E (isolated HERMES_HOME, real registry, 20 fake mcp-github tools):

get_tool_definitions(enabled_toolsets=['mcp-github']) returns 3 bridge tools, no raw MCP schemas in output.
tool_search("github issue") returns top 3 hits with total_available: 20.
tool_describe("mcp_github_action_5") returns the full schema.
tool_call("mcp_github_action_5", {repo: "foo/bar"}) dispatches and returns the underlying handler's {"ok": true}.
tool_call("tool_call", {...}) rejected with recursion guard message.
tool_call("terminal", {...}) rejected — model is told to call core tools directly.

What this PR explicitly does not include

No JS sandbox / code mode. OpenClaw's tool_search_code is a 1,500-line subprocess + permission-mode + IPC bridge. The three structured bridge tools deliver the same value with a tenth the surface area. Adding code mode is a future PR if there is demand.
No catalog persistence. No ~/.hermes/tool-search-catalog.json. The whole design assumes the catalog is cheap to rebuild and not worth caching across processes.
No provider-native paths. Anthropic's defer_loading and OpenAI's hosted tool_search would let us push the work to the provider when available. Cleanest to add after we have benchmark data showing whether the generic path is good enough on its own.
No hermes tools UI changes. The feature is config-driven; the TUI doesn't need an entry yet.

Follow-up work

Benchmark harness against the metrics laid out in the research PDF (token savings static + dynamic, cost per turn cached/uncached, latency including time-to-first-useful-action, BM25 retrieval Recall@K, accuracy with vs. without).
Once we have benchmark data, decide whether auto default's 10% threshold is right or needs tuning per-model.

Adds Tool Search, a structured-tools progressive-disclosure layer that replaces MCP and non-core plugin tools in the model-visible tools array with three bridge tools (tool_search / tool_describe / tool_call) when the deferrable surface would consume more than a configurable percentage of the active model's context window. Core Hermes tools are never deferred. Default mode is 'auto' with a 10% context threshold, so small toolsets pay no overhead. Set tools.tool_search.enabled to 'on' to force or 'off' to disable. Design carefully reflects the OpenClaw production failure modes documented in the openclaw-tool-search-report: - Core tools never defer (toolsets._HERMES_CORE_TOOLS). Addresses the 'tools silently missing from isolated cron turns' regression class (openclaw#84141) by construction: there is no code path that can drop a core tool. - Catalog is stateless across turns — rebuilt from the live tool-defs list on every assembly. No session-keyed Map that can drift out of sync with the registry. - tool_call unwraps the bridge call before any hook fires, so plugin pre/post hooks, guardrails, approval flows, and the activity feed all see the underlying tool name, not the bridge (addresses openclaw#85588 and the verbose-mode complaint on openclaw#79823). - The unwrap happens in both the parallel and sequential paths of agent/tool_executor.py and also in handle_function_call, so direct callers (sandboxed code, eval harnesses) are covered too. - Bridge tools cannot invoke each other (recursion guard) and cannot invoke core tools (those must be called directly). - Tools mode only — no JS-sandbox code-mode. Keeps the surface small. - Token estimation via cheap char/4 heuristic; precision isn't needed for the threshold decision. Files: - tools/tool_search.py — new module (BM25 retrieval, classification, threshold gate, bridge dispatch, unwrap helper). - tests/tools/test_tool_search.py — 35 tests including the OpenClaw #84141 regression guard. - model_tools.py — wires assembly into _compute_tool_definitions as the final step, adds skip_tool_search_assembly kwarg so the bridge can see the real catalog, dispatches the three bridge tools. - agent/tool_executor.py — unwraps tool_call in both parallel and sequential parsing loops so checkpointing, guardrails, plugin hooks, and tool-progress callbacks all observe the underlying tool name. - hermes_cli/config.py — DEFAULT_CONFIG['tools']['tool_search'] block. - website/docs/user-guide/features/tool-search.md — user docs. Validation: - 35/35 new tests pass. - Existing tool/registry/model_tools/config/coercion/executor tests (82 + 74 + small adjacents) green. - Live E2E: 20 fake MCP tools registered, get_tool_definitions returns 3 bridges, tool_search returns top 3 hits, tool_describe returns full schema, tool_call dispatches to the real underlying handler and the underlying result is what the model sees. - Reserved-name recursion guard verified live. - Core-tool refusal via tool_call verified live.

github-actions · 2026-05-23T22:23:29Z

🔎 Lint report: `hermes/hermes-2b79b6da` vs `origin/main`

ruff

Total: 1 on HEAD, 0 on base (🆕 +1)

🆕 New issues (1):

Rule	Count
`PLW1514`	1

First entries

scripts/tool_search_livetest.py:358: [PLW1514] `pathlib.Path(...).write_text` without explicit `encoding` argument

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9427 on HEAD, 9422 on base (🆕 +5)

🆕 New issues (5):

Rule	Count
`invalid-assignment`	3
`unresolved-import`	1
`invalid-argument-type`	1

First entries

scripts/tool_search_livetest.py:410: [invalid-assignment] invalid-assignment: Object of type `bound method ToolRegistry.dispatch(name: str, args: dict[Unknown, Unknown], **kwargs) -> str` is not assignable to attribute `dispatch` of type `def dispatch(self, name: str, args: dict[Unknown, Unknown], **kwargs) -> str`
tests/tools/test_tool_search.py:15: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
model_tools.py:840: [invalid-assignment] invalid-assignment: Object of type `None` is not assignable to `<module 'tools.tool_search'>`
scripts/tool_search_livetest.py:375: [invalid-assignment] invalid-assignment: Object of type `def logging_dispatch(name, args, **kw) -> Unknown` is not assignable to attribute `dispatch` of type `def dispatch(self, name: str, args: dict[Unknown, Unknown], **kwargs) -> str`
scripts/tool_search_livetest.py:387: [invalid-argument-type] invalid-argument-type: Argument to `AIAgent.__init__` is incorrect: Expected `list[str]`, found `None`

✅ Fixed issues: none

Unchanged: 4890 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

pleite · 2026-05-28T15:07:40Z

Sharing a test report — we tried this PR as the inference layer for a small local model in a degraded-mode fallback lane. Posting in case the data is useful; not a review, just a write-up of what we did, the environment, the changes we needed on our end, and the results.

What we wanted to learn

Whether progressive tool disclosure as implemented here is a viable fit for our specific use case: a small (4B) instruction-tuned model running on consumer hardware behind an OpenAI-compatible llama.cpp endpoint, used as a degraded-mode fallback for an agent that normally runs on a large cloud model with ~23 tools enabled.

This is a narrow fit-for-purpose test from one specific angle, not an evaluation of the PR against its design goals.

Environment

Layer	Detail
Hermes	Sandbox container built from this PR branch (`hermes-pr31163`), separate state.db from production
Model	Qwen3-4B-Instruct-2507, Q4_K_M GGUF
Inference	llama.cpp `b9360` on an M1 MacBook Air (8 GB), OpenAI-compat server at `:8081`
Context	`--ctx-size 65536 --cache-type-k q4_0 --cache-type-v q4_0 --parallel 1 --cont-batching --mlock`
Test driver	Hermes `api_server` at `/v1/chat/completions` with shared `X-Hermes-Session-Id` header across 6 sequential prompts

Changes we made on our side before the test was stable

Three modifications were necessary in our environment. Sharing them so the conditions of the test are clear.

1. llama.cpp --predict 4096 cap. Without it, our 4B model occasionally dropped out of structured tool emission mid-loop and began producing natural-language planning monologue with no stop condition. One observed run reached 9,335 generated tokens in a single assistant turn. Client-side HTTP timeout from the Hermes side did not propagate a cancel to llama.cpp, so the slot stayed busy and subsequent requests failed with Connection error until the slot drained.

Adding --predict 4096 to llama-server caps a runaway at ~4 min wall and frees the slot cleanly with finish_reason=length. Legitimate longest answers in our test set were ~700 tokens, so the cap is non-disruptive. This is a llama.cpp-side change, not a PR concern — flagging because it was a prerequisite for getting clean test data.

2. Reduced coreset. We exposed roughly 10 tools to the model (memory, terminal, session_search, read_file, write_file, search_files, patch, todo, clarify, plus this PR's bridge tool) instead of our full ~23. This is a separate ongoing line of work on what a small-model-friendly coreset looks like; we tested both arms below using this reduced set as the "treatment" baseline and our full ~23-tool set as the "control" baseline.

3. Bridge-tool description tightening. We made small adjustments to the description text on the discovery tool to make the 4B model invoke it more reliably. Diff available if useful.

Test set

Six sequential prompts in a single session (chained via X-Hermes-Session-Id), covering: pure chat, deferred-memory write, recall, two-tools-mixed (terminal + memory), two-tools-chained (search + memory), and a summary turn that exposes hallucination if earlier tools didn't actually execute.

Step 5 referenced a phrase (HELLO_AB4_BASELINE) that had been seeded into the sandbox state.db in an earlier session, so a true-positive answer exists and can be distinguished from a hallucinated one.

Results

Both arms used --predict 4096 and were run after warm KV. Verification of "did a tool actually execute" was by direct inspection of state.db (tool_calls rows) rather than HTTP status, because in one arm we saw the model emitting tool-call-shaped JSON that arrived as plain content and was reported as 200 OK.

Arm A — full coreset, this PR's mechanism disabled (our baseline)

Step	wall	prompt tok	actually executed?
1 chat	1.3 s	15,060	— (wrong answer: 581 vs correct 391)
2 memory.add	4.4 s	15,095	no — JSON arrived as content
3 recall	4.3 s	15,136	no — model claimed save that didn't happen
4 terminal+memory	4.0 s	15,188	no
5 search+memory	4.3 s	15,241	no
6 summary	6.2 s	15,285	no — summary hallucinated prior tool results

state.db shows 0 rows with non-NULL tool_calls for this arm.

Arm B — reduced coreset, this PR's mechanism enabled

Step	wall	prompt tok	actually executed?
1 chat	1.3 s	7,214	— (correct: 391)
2 memory.add	29.6 s	32,138	yes — verified in state.db
3 recall	4.4 s	8,592	correct recall
4 terminal+memory	18.3 s	26,229	yes — both calls executed
5 search+memory	100.9 s	55,113	yes — returned the seeded session ID, verified true-positive
6 summary	9.9 s	13,124	accurate to actual events

state.db shows real tool_calls rows and corresponding role=tool results.

Reference — same six prompts against our normal large-model agent (Claude Opus-class)

5/6 real tool executions. Step 5 returned 0 hits because the seeded phrase lives in the sandbox state.db, not the production one — model correctly reported no result rather than fabricating one.

Token observations (informational)

	Arm A baseline	Arm B with this PR
Prompt overhead per non-tool turn	~15,000 tok flat	~7,200–8,600 tok
Peak prompt in our test	15,285 tok	55,113 tok (step 5 discovery results accumulated in history)

Arm B wins on cold turns and on chat-heavy mixes. Arm B's per-turn cost grows when discovery results pile up in conversation history across many tool-using turns in close succession. Whether the net is positive depends on call mix; for our intended workload (chat-heavy with occasional tool use) Arm B is meaningfully cheaper.

Fit-for-our-purpose result

For the specific lane we're investigating (small local model behind an OpenAI-compatible llama.cpp endpoint as a degraded-mode fallback), this PR's mechanism produced a working agent in our test where our baseline configuration did not. We're going to keep iterating on this configuration internally and would not have a usable small-model lane without it.

We're not claiming this generalises beyond our setup, and we make no claim about whether this is the use case the PR was designed for — just sharing the data in case it's useful as one additional point.

Happy to share the test driver script, the bridge-tool description diff, or full session JSON dumps if any of that would help.

This comment was written by a Hermes Agent instance (Claude Opus class model running on the Hermes Agent stack) on behalf of its operator, who ran the experiment and reviewed the report before posting.

Adds a real-model live test for the tool_search feature. Spins up a real AIAgent against Claude Haiku 4.5 via OpenRouter, registers 20 fake MCP tools with realistic shapes, runs 5 scenarios twice each (tool_search ON and OFF), and records the full transcript per run. Captures both the bridge call sequence the model emitted (tool_search / tool_describe / tool_call) and the underlying tool calls that actually executed through the registry. Records iteration count, elapsed time, and final response for an A/B comparison. Scenarios cover: A. Obvious single tool — direct keyword match B. Vague paraphrased intent — stress retrieval quality C. Multi-step chain — two deferred tools in sequence D. Mixed core + deferred — verify core tools (read_file) get called directly, not through tool_call E. No tool needed — verify no spurious tool_search invocations Baseline run included in scripts/out/ for reference. All 10 runs (5 scenarios x 2 modes) pass — every expected underlying tool was invoked, no core tool was incorrectly routed through tool_call, no tool name was hallucinated. Round-trip cost observed: tool_search enabled added +3 to +4 model round trips per task vs disabled. Single-tool tasks completed in ~16-20s vs ~10-11s direct. Multi-tool tasks ~20s vs ~14s. The bridge overhead is real and measurable but the task completion rate is identical.

teknium1 · 2026-05-29T06:36:50Z

Live test results

Ran a real-model end-to-end test against Claude Haiku 4.5 via OpenRouter. Five
scenarios, each run twice (tool_search ON and OFF), with 20 fake MCP tools
registered. Harness and transcripts in scripts/tool_search_livetest.py,
scripts/analyze_livetest.py, and scripts/out/.

10/10 runs passed. Every expected underlying tool was invoked. Zero
hallucinated tool names. Zero attempts to route a core tool through tool_call.
Display unwrap working in the CLI activity feed.

Side-by-side

Scenario	ON: bridges + underlying / iters / elapsed	OFF: underlying / iters / elapsed	Δ round-trips
A obvious_single	3 + 1 / 4 / 18.5s	1 / 2 / 9.7s	+3
B vague_paraphrased	3 + 1 / 4 / 15.6s	1 / 2 / 11.3s	+3
C multi_tool_chain	4 + 2 / 4 / 20.3s	2 / 3 / 14.1s	+4
D core_plus_deferred	3 + 2 / 5 / 33.1s	2 / 3 / 9.8s	+3
E no_tool_needed	0 + 0 / 1 / 8.2s	0 / 1 / 2.8s	0

Sample trace (Scenario A, ON)

bridges:    tool_search('create github issue')
         →  tool_describe(github_create_issue)
         →  tool_call → github_create_issue
underlying: github_create_issue

Sample trace (Scenario D, ON) — the safety guarantee in action

underlying: read_file → slack_send_message
bridges:    tool_search('post message Slack channel')
         →  tool_describe(slack_send_message)
         →  tool_call → slack_send_message

Note that read_file was called directly, not through tool_call. The
model correctly identified it as a core tool already in the visible tools
array and skipped the bridge for it. This is the safety invariant the report
flagged and that the implementation enforces by construction.

Observed costs

ON adds +3 to +4 model round trips per task with deferred tools
Single-tool tasks: ~16-20s vs ~10-11s direct (~2× wall time)
Multi-tool chains: ~20s vs ~14s (~1.4× wall time)
Pure-knowledge prompts: 0 extra round trips (no spurious tool_search)
Token savings on the static side are real and measurable; the cost is paid
in latency and additional round trips on cold-cache tool invocations

Confidence to ship

Behavior matches design. The bridge tools are usable by a real model without
prompt-engineering tricks. The auto threshold (default 10% of context)
means small toolsets pay no overhead — the +3 round trip tax only applies
when the deferrable surface is large enough to justify it. Recommended:
ship with enabled: auto as the default (already the case in this PR).

Future work, separate from this PR:

A2/A3 prompts to test smaller models (Qwen, GPT-5.2 nano) — Haiku 4.5
is a strong model and may not surface retrieval-quality failures that
weaker models would hit
Larger toolset (50+ deferred tools) to stress retrieval ranking
Cost measurements with real cached/uncached pricing data

+
+    suffix = "enabled" if enabled else "disabled"
+    out_path = out_dir / f"{scenario['id']}__{suffix}.json"
+    out_path.write_text(json.dumps(record, indent=2, default=str))


+            })
+
+    summary_path = out_dir / "_summary.json"
+    summary_path.write_text(json.dumps(summary, indent=2))


teknium1 · 2026-05-29T09:04:43Z

Merged via #34493 (rebased onto current main, your commit authorship preserved in git log).

The salvage carried the feature forward and closed a toolset-scoping hole found in review: the bridge read its catalog from the global registry, so a restricted-toolset session (subagent / kanban worker / curated gateway session) could tool_search the whole process registry and tool_call any plugin/MCP tool it was never granted. Now scoped to the session's own toolsets, with a defense-in-depth gate in both the bridge dispatch and the executor unwrap.

Also: dropped the 11 checked-in scripts/out/*.json transcripts (kept the harness, gitignored the output dir), routed the harness's key loading through load_hermes_dotenv, added _redact_secrets() over transcript/console output, and encoding="utf-8" on all file I/O.

On main as a87f0a8 (+ 369075d, 7427b9d, 1709776, 18c9e89).

alt-glitch added type/feature New feature or request comp/tools Tool registry, model_tools, toolsets comp/agent Core agent loop, run_agent.py, prompt builder tool/mcp MCP client and OAuth area/config Config system, migrations, profiles P2 Medium — degraded but workaround exists labels May 23, 2026

github-advanced-security AI found potential problems May 29, 2026

View reviewed changes

teknium1 mentioned this pull request May 29, 2026

feat(tools): progressive tool disclosure for MCP and plugin tools (scoped) #34493

Merged

teknium1 closed this May 29, 2026

gal-checksum mentioned this pull request Jun 2, 2026

[codex] docs: add Tool Search to sidebar #37512

Closed

gal064 mentioned this pull request Jun 2, 2026

[codex] docs: add Tool Search to sidebar #37514

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tools): progressive tool disclosure for MCP and plugin tools#31163

feat(tools): progressive tool disclosure for MCP and plugin tools#31163
teknium1 wants to merge 2 commits into
mainfrom
hermes/hermes-2b79b6da

teknium1 commented May 23, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 23, 2026 •

edited

Loading

Uh oh!

pleite commented May 28, 2026

Uh oh!

teknium1 commented May 29, 2026

Uh oh!

Uh oh!

Uh oh!

teknium1 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

teknium1 commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Infographic

Summary

What it does

Changes

Reliability defenses by construction

Test plan

What this PR explicitly does not include

Follow-up work

Uh oh!

github-actions Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔎 Lint report: hermes/hermes-2b79b6da vs origin/main

ruff

ty (type checker)

Uh oh!

pleite commented May 28, 2026

What we wanted to learn

Environment

Changes we made on our side before the test was stable

Test set

Results

Token observations (informational)

Fit-for-our-purpose result

Uh oh!

teknium1 commented May 29, 2026

Live test results

Side-by-side

Sample trace (Scenario A, ON)

Sample trace (Scenario D, ON) — the safety guarantee in action

Observed costs

Confidence to ship

Uh oh!

Uh oh!

Uh oh!

teknium1 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

teknium1 commented May 23, 2026 •

edited

Loading

github-actions Bot commented May 23, 2026 •

edited

Loading

🔎 Lint report: `hermes/hermes-2b79b6da` vs `origin/main`