Skip to content

feat(tools): progressive tool disclosure for MCP and plugin tools#31163

Closed
teknium1 wants to merge 2 commits into
mainfrom
hermes/hermes-2b79b6da
Closed

feat(tools): progressive tool disclosure for MCP and plugin tools#31163
teknium1 wants to merge 2 commits into
mainfrom
hermes/hermes-2b79b6da

Conversation

@teknium1

@teknium1 teknium1 commented May 23, 2026

Copy link
Copy Markdown
Contributor

Infographic

Tool Search infographic

Summary

Adds Tool Search, a structured-tools progressive-disclosure layer that hides MCP and non-core plugin tools behind three bridge tools (tool_search / tool_describe / tool_call) when the deferrable surface would consume more than ~10% of the active model's context window. Core Hermes tools are never deferred.

Design choices are explicitly shaped by the published failure modes from OpenClaw's tool-search implementation (full research report in [my chat with Teknium that produced this PR]). Each of OpenClaw's open or recent reliability issues has a corresponding architectural defense or test here.

What it does

Default behavior (enabled: auto):

  • Toolset is small or context is huge → no-op, tools array passes through unchanged.
  • Toolset is large (MCP servers, plugin tools push >10% of context) → MCP and plugin tools removed from the model-visible array, three bridge tools added. Model uses tool_search to find what it needs, tool_describe for full schema, tool_call to invoke.
tools:
  tool_search:
    enabled: auto       # auto | on | off
    threshold_pct: 10
    search_default_limit: 5
    max_search_limit: 20

Changes

  • tools/tool_search.py (new) — module with classification, threshold gate, BM25 retrieval, bridge dispatch, unwrap helper.
  • tests/tools/test_tool_search.py (new) — 35 tests including a named regression guard for the OpenClaw cron-tool-loss class of bug (#84141 in their tracker).
  • model_tools.py — wires assembly into _compute_tool_definitions as the final step; adds skip_tool_search_assembly kwarg so the bridge can read the real (pre-assembly) catalog; dispatches the three bridge tools.
  • agent/tool_executor.py — unwraps tool_call in both the parallel and the sequential parsing paths so checkpointing, guardrails, plugin pre/post hooks, and the tool-progress callback observe the underlying tool name, not the bridge.
  • hermes_cli/config.pyDEFAULT_CONFIG['tools']['tool_search'] block.
  • website/docs/user-guide/features/tool-search.md — user docs.

No edits to cli.py, gateway/run.py, run_agent.py, or toolsets.py.

Reliability defenses by construction

These map 1:1 against the OpenClaw failure modes I documented:

OpenClaw failure Our defense
Retrieval misses critical tools (their PR #85588 telling the model not to look for heartbeat_respond via tool_search) Core tools never enter the catalog. Always-load list is toolsets._HERMES_CORE_TOOLSterminal, read_file, write_file, patch, search_files, todo, memory, browser_*, send_message, the messaging primitives. The model is never asked to search for them.
Isolated cron turns drop the requested tool (their #84141, still open) Catalog is stateless. It is rebuilt from the live tool-defs list on every assembly — no session-keyed Map that can drift out of sync with the registry. A dedicated TestRegression_OpenClawCron84141 class enforces this.
Transcript shape leaks the bridge to external consumers (PR #79823 review) The tool_call entry on the assistant message is left untouched so transcripts and tool_call_id matching stay exactly as the model emitted them; unwrap is for hook/display only.
Verbose-mode display hides what's running (PR #79823 review by jalehman) Unwrap fires before the tool-progress callback. The activity feed sees the underlying tool name and arguments, not the bridge.
tool_search invoked recursively Hard recursion guard in resolve_underlying_calltool_call cannot invoke any bridge tool.
Two-step indirection cost for small toolsets enabled: auto (default) with a 10% context threshold. Small toolsets pay zero overhead.
Cost / token regressions on small surfaces Threshold gate computed every assembly; below it, no swap, no bridge tools, no extra cost.
Trade-off math no one published Documented in the feature page: extra round trips on cold tools, no static-cache benefit on deferred schemas, model-quality dependence, toolset-edit cache invalidation. We can't remove these; we make them visible.

Test plan

scripts/run_tests.sh tests/tools/test_tool_search.py
  → 35/35 passing
scripts/run_tests.sh tests/test_model_tools.py tests/tools/test_registry.py \
    tests/test_toolsets.py tests/run_agent/test_tool_arg_coercion.py \
    tests/run_agent/test_tool_call_guardrail_runtime.py \
    tests/run_agent/test_tool_executor_contextvar_propagation.py \
    tests/hermes_cli/test_config.py
  → 8 files, all green, no regressions

Live E2E (isolated HERMES_HOME, real registry, 20 fake mcp-github tools):

  • get_tool_definitions(enabled_toolsets=['mcp-github']) returns 3 bridge tools, no raw MCP schemas in output.
  • tool_search("github issue") returns top 3 hits with total_available: 20.
  • tool_describe("mcp_github_action_5") returns the full schema.
  • tool_call("mcp_github_action_5", {repo: "foo/bar"}) dispatches and returns the underlying handler's {"ok": true}.
  • tool_call("tool_call", {...}) rejected with recursion guard message.
  • tool_call("terminal", {...}) rejected — model is told to call core tools directly.

What this PR explicitly does not include

  • No JS sandbox / code mode. OpenClaw's tool_search_code is a 1,500-line subprocess + permission-mode + IPC bridge. The three structured bridge tools deliver the same value with a tenth the surface area. Adding code mode is a future PR if there is demand.
  • No catalog persistence. No ~/.hermes/tool-search-catalog.json. The whole design assumes the catalog is cheap to rebuild and not worth caching across processes.
  • No provider-native paths. Anthropic's defer_loading and OpenAI's hosted tool_search would let us push the work to the provider when available. Cleanest to add after we have benchmark data showing whether the generic path is good enough on its own.
  • No hermes tools UI changes. The feature is config-driven; the TUI doesn't need an entry yet.

Follow-up work

  • Benchmark harness against the metrics laid out in the research PDF (token savings static + dynamic, cost per turn cached/uncached, latency including time-to-first-useful-action, BM25 retrieval Recall@K, accuracy with vs. without).
  • Once we have benchmark data, decide whether auto default's 10% threshold is right or needs tuning per-model.

Adds Tool Search, a structured-tools progressive-disclosure layer that
replaces MCP and non-core plugin tools in the model-visible tools array
with three bridge tools (tool_search / tool_describe / tool_call) when
the deferrable surface would consume more than a configurable percentage
of the active model's context window. Core Hermes tools are never deferred.

Default mode is 'auto' with a 10% context threshold, so small toolsets
pay no overhead. Set tools.tool_search.enabled to 'on' to force or 'off'
to disable.

Design carefully reflects the OpenClaw production failure modes
documented in the openclaw-tool-search-report:

  - Core tools never defer (toolsets._HERMES_CORE_TOOLS). Addresses the
    'tools silently missing from isolated cron turns' regression class
    (openclaw#84141) by construction: there is no code path that can
    drop a core tool.
  - Catalog is stateless across turns — rebuilt from the live tool-defs
    list on every assembly. No session-keyed Map that can drift out of
    sync with the registry.
  - tool_call unwraps the bridge call before any hook fires, so plugin
    pre/post hooks, guardrails, approval flows, and the activity feed
    all see the underlying tool name, not the bridge (addresses
    openclaw#85588 and the verbose-mode complaint on openclaw#79823).
  - The unwrap happens in both the parallel and sequential paths of
    agent/tool_executor.py and also in handle_function_call, so direct
    callers (sandboxed code, eval harnesses) are covered too.
  - Bridge tools cannot invoke each other (recursion guard) and cannot
    invoke core tools (those must be called directly).
  - Tools mode only — no JS-sandbox code-mode. Keeps the surface small.
  - Token estimation via cheap char/4 heuristic; precision isn't needed
    for the threshold decision.

Files:
  - tools/tool_search.py — new module (BM25 retrieval, classification,
    threshold gate, bridge dispatch, unwrap helper).
  - tests/tools/test_tool_search.py — 35 tests including the OpenClaw
    #84141 regression guard.
  - model_tools.py — wires assembly into _compute_tool_definitions as the
    final step, adds skip_tool_search_assembly kwarg so the bridge can
    see the real catalog, dispatches the three bridge tools.
  - agent/tool_executor.py — unwraps tool_call in both parallel and
    sequential parsing loops so checkpointing, guardrails, plugin hooks,
    and tool-progress callbacks all observe the underlying tool name.
  - hermes_cli/config.py — DEFAULT_CONFIG['tools']['tool_search'] block.
  - website/docs/user-guide/features/tool-search.md — user docs.

Validation:
  - 35/35 new tests pass.
  - Existing tool/registry/model_tools/config/coercion/executor tests
    (82 + 74 + small adjacents) green.
  - Live E2E: 20 fake MCP tools registered, get_tool_definitions returns
    3 bridges, tool_search returns top 3 hits, tool_describe returns
    full schema, tool_call dispatches to the real underlying handler
    and the underlying result is what the model sees.
  - Reserved-name recursion guard verified live.
  - Core-tool refusal via tool_call verified live.
@github-actions

github-actions Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: hermes/hermes-2b79b6da vs origin/main

ruff

Total: 1 on HEAD, 0 on base (🆕 +1)

🆕 New issues (1):

Rule Count
PLW1514 1
First entries
scripts/tool_search_livetest.py:358: [PLW1514] `pathlib.Path(...).write_text` without explicit `encoding` argument

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9427 on HEAD, 9422 on base (🆕 +5)

🆕 New issues (5):

Rule Count
invalid-assignment 3
unresolved-import 1
invalid-argument-type 1
First entries
scripts/tool_search_livetest.py:410: [invalid-assignment] invalid-assignment: Object of type `bound method ToolRegistry.dispatch(name: str, args: dict[Unknown, Unknown], **kwargs) -> str` is not assignable to attribute `dispatch` of type `def dispatch(self, name: str, args: dict[Unknown, Unknown], **kwargs) -> str`
tests/tools/test_tool_search.py:15: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
model_tools.py:840: [invalid-assignment] invalid-assignment: Object of type `None` is not assignable to `<module 'tools.tool_search'>`
scripts/tool_search_livetest.py:375: [invalid-assignment] invalid-assignment: Object of type `def logging_dispatch(name, args, **kw) -> Unknown` is not assignable to attribute `dispatch` of type `def dispatch(self, name: str, args: dict[Unknown, Unknown], **kwargs) -> str`
scripts/tool_search_livetest.py:387: [invalid-argument-type] invalid-argument-type: Argument to `AIAgent.__init__` is incorrect: Expected `list[str]`, found `None`

✅ Fixed issues: none

Unchanged: 4890 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@alt-glitch alt-glitch added type/feature New feature or request comp/tools Tool registry, model_tools, toolsets comp/agent Core agent loop, run_agent.py, prompt builder tool/mcp MCP client and OAuth area/config Config system, migrations, profiles P2 Medium — degraded but workaround exists labels May 23, 2026
@pleite

pleite commented May 28, 2026

Copy link
Copy Markdown

Sharing a test report — we tried this PR as the inference layer for a small local model in a degraded-mode fallback lane. Posting in case the data is useful; not a review, just a write-up of what we did, the environment, the changes we needed on our end, and the results.

What we wanted to learn

Whether progressive tool disclosure as implemented here is a viable fit for our specific use case: a small (4B) instruction-tuned model running on consumer hardware behind an OpenAI-compatible llama.cpp endpoint, used as a degraded-mode fallback for an agent that normally runs on a large cloud model with ~23 tools enabled.

This is a narrow fit-for-purpose test from one specific angle, not an evaluation of the PR against its design goals.

Environment

Layer Detail
Hermes Sandbox container built from this PR branch (hermes-pr31163), separate state.db from production
Model Qwen3-4B-Instruct-2507, Q4_K_M GGUF
Inference llama.cpp b9360 on an M1 MacBook Air (8 GB), OpenAI-compat server at :8081
Context --ctx-size 65536 --cache-type-k q4_0 --cache-type-v q4_0 --parallel 1 --cont-batching --mlock
Test driver Hermes api_server at /v1/chat/completions with shared X-Hermes-Session-Id header across 6 sequential prompts

Changes we made on our side before the test was stable

Three modifications were necessary in our environment. Sharing them so the conditions of the test are clear.

1. llama.cpp --predict 4096 cap. Without it, our 4B model occasionally dropped out of structured tool emission mid-loop and began producing natural-language planning monologue with no stop condition. One observed run reached 9,335 generated tokens in a single assistant turn. Client-side HTTP timeout from the Hermes side did not propagate a cancel to llama.cpp, so the slot stayed busy and subsequent requests failed with Connection error until the slot drained.

Adding --predict 4096 to llama-server caps a runaway at ~4 min wall and frees the slot cleanly with finish_reason=length. Legitimate longest answers in our test set were ~700 tokens, so the cap is non-disruptive. This is a llama.cpp-side change, not a PR concern — flagging because it was a prerequisite for getting clean test data.

2. Reduced coreset. We exposed roughly 10 tools to the model (memory, terminal, session_search, read_file, write_file, search_files, patch, todo, clarify, plus this PR's bridge tool) instead of our full ~23. This is a separate ongoing line of work on what a small-model-friendly coreset looks like; we tested both arms below using this reduced set as the "treatment" baseline and our full ~23-tool set as the "control" baseline.

3. Bridge-tool description tightening. We made small adjustments to the description text on the discovery tool to make the 4B model invoke it more reliably. Diff available if useful.

Test set

Six sequential prompts in a single session (chained via X-Hermes-Session-Id), covering: pure chat, deferred-memory write, recall, two-tools-mixed (terminal + memory), two-tools-chained (search + memory), and a summary turn that exposes hallucination if earlier tools didn't actually execute.

Step 5 referenced a phrase (HELLO_AB4_BASELINE) that had been seeded into the sandbox state.db in an earlier session, so a true-positive answer exists and can be distinguished from a hallucinated one.

Results

Both arms used --predict 4096 and were run after warm KV. Verification of "did a tool actually execute" was by direct inspection of state.db (tool_calls rows) rather than HTTP status, because in one arm we saw the model emitting tool-call-shaped JSON that arrived as plain content and was reported as 200 OK.

Arm A — full coreset, this PR's mechanism disabled (our baseline)

Step wall prompt tok actually executed?
1 chat 1.3 s 15,060 — (wrong answer: 581 vs correct 391)
2 memory.add 4.4 s 15,095 no — JSON arrived as content
3 recall 4.3 s 15,136 no — model claimed save that didn't happen
4 terminal+memory 4.0 s 15,188 no
5 search+memory 4.3 s 15,241 no
6 summary 6.2 s 15,285 no — summary hallucinated prior tool results

state.db shows 0 rows with non-NULL tool_calls for this arm.

Arm B — reduced coreset, this PR's mechanism enabled

Step wall prompt tok actually executed?
1 chat 1.3 s 7,214 — (correct: 391)
2 memory.add 29.6 s 32,138 yes — verified in state.db
3 recall 4.4 s 8,592 correct recall
4 terminal+memory 18.3 s 26,229 yes — both calls executed
5 search+memory 100.9 s 55,113 yes — returned the seeded session ID, verified true-positive
6 summary 9.9 s 13,124 accurate to actual events

state.db shows real tool_calls rows and corresponding role=tool results.

Reference — same six prompts against our normal large-model agent (Claude Opus-class)

5/6 real tool executions. Step 5 returned 0 hits because the seeded phrase lives in the sandbox state.db, not the production one — model correctly reported no result rather than fabricating one.

Token observations (informational)

Arm A baseline Arm B with this PR
Prompt overhead per non-tool turn ~15,000 tok flat ~7,200–8,600 tok
Peak prompt in our test 15,285 tok 55,113 tok (step 5 discovery results accumulated in history)

Arm B wins on cold turns and on chat-heavy mixes. Arm B's per-turn cost grows when discovery results pile up in conversation history across many tool-using turns in close succession. Whether the net is positive depends on call mix; for our intended workload (chat-heavy with occasional tool use) Arm B is meaningfully cheaper.

Fit-for-our-purpose result

For the specific lane we're investigating (small local model behind an OpenAI-compatible llama.cpp endpoint as a degraded-mode fallback), this PR's mechanism produced a working agent in our test where our baseline configuration did not. We're going to keep iterating on this configuration internally and would not have a usable small-model lane without it.

We're not claiming this generalises beyond our setup, and we make no claim about whether this is the use case the PR was designed for — just sharing the data in case it's useful as one additional point.

Happy to share the test driver script, the bridge-tool description diff, or full session JSON dumps if any of that would help.


This comment was written by a Hermes Agent instance (Claude Opus class model running on the Hermes Agent stack) on behalf of its operator, who ran the experiment and reviewed the report before posting.

Adds a real-model live test for the tool_search feature. Spins up a real
AIAgent against Claude Haiku 4.5 via OpenRouter, registers 20 fake MCP
tools with realistic shapes, runs 5 scenarios twice each (tool_search ON
and OFF), and records the full transcript per run.

Captures both the bridge call sequence the model emitted (tool_search /
tool_describe / tool_call) and the underlying tool calls that actually
executed through the registry. Records iteration count, elapsed time,
and final response for an A/B comparison.

Scenarios cover:
  A. Obvious single tool — direct keyword match
  B. Vague paraphrased intent — stress retrieval quality
  C. Multi-step chain — two deferred tools in sequence
  D. Mixed core + deferred — verify core tools (read_file) get called
     directly, not through tool_call
  E. No tool needed — verify no spurious tool_search invocations

Baseline run included in scripts/out/ for reference. All 10 runs
(5 scenarios x 2 modes) pass — every expected underlying tool was
invoked, no core tool was incorrectly routed through tool_call, no
tool name was hallucinated.

Round-trip cost observed: tool_search enabled added +3 to +4 model
round trips per task vs disabled. Single-tool tasks completed in ~16-20s
vs ~10-11s direct. Multi-tool tasks ~20s vs ~14s. The bridge overhead
is real and measurable but the task completion rate is identical.
@teknium1

Copy link
Copy Markdown
Contributor Author

Live test results

Ran a real-model end-to-end test against Claude Haiku 4.5 via OpenRouter. Five
scenarios, each run twice (tool_search ON and OFF), with 20 fake MCP tools
registered. Harness and transcripts in scripts/tool_search_livetest.py,
scripts/analyze_livetest.py, and scripts/out/.

10/10 runs passed. Every expected underlying tool was invoked. Zero
hallucinated tool names. Zero attempts to route a core tool through tool_call.
Display unwrap working in the CLI activity feed.

Side-by-side

Scenario ON: bridges + underlying / iters / elapsed OFF: underlying / iters / elapsed Δ round-trips
A obvious_single 3 + 1 / 4 / 18.5s 1 / 2 / 9.7s +3
B vague_paraphrased 3 + 1 / 4 / 15.6s 1 / 2 / 11.3s +3
C multi_tool_chain 4 + 2 / 4 / 20.3s 2 / 3 / 14.1s +4
D core_plus_deferred 3 + 2 / 5 / 33.1s 2 / 3 / 9.8s +3
E no_tool_needed 0 + 0 / 1 / 8.2s 0 / 1 / 2.8s 0

Sample trace (Scenario A, ON)

bridges:    tool_search('create github issue')
         →  tool_describe(github_create_issue)
         →  tool_call → github_create_issue
underlying: github_create_issue

Sample trace (Scenario D, ON) — the safety guarantee in action

underlying: read_file → slack_send_message
bridges:    tool_search('post message Slack channel')
         →  tool_describe(slack_send_message)
         →  tool_call → slack_send_message

Note that read_file was called directly, not through tool_call. The
model correctly identified it as a core tool already in the visible tools
array and skipped the bridge for it. This is the safety invariant the report
flagged and that the implementation enforces by construction.

Observed costs

  • ON adds +3 to +4 model round trips per task with deferred tools
  • Single-tool tasks: ~16-20s vs ~10-11s direct (~2× wall time)
  • Multi-tool chains: ~20s vs ~14s (~1.4× wall time)
  • Pure-knowledge prompts: 0 extra round trips (no spurious tool_search)
  • Token savings on the static side are real and measurable; the cost is paid
    in latency and additional round trips on cold-cache tool invocations

Confidence to ship

Behavior matches design. The bridge tools are usable by a real model without
prompt-engineering tricks. The auto threshold (default 10% of context)
means small toolsets pay no overhead — the +3 round trip tax only applies
when the deferrable surface is large enough to justify it. Recommended:
ship with enabled: auto as the default (already the case in this PR).

Future work, separate from this PR:

  • A2/A3 prompts to test smaller models (Qwen, GPT-5.2 nano) — Haiku 4.5
    is a strong model and may not surface retrieval-quality failures that
    weaker models would hit
  • Larger toolset (50+ deferred tools) to stress retrieval ranking
  • Cost measurements with real cached/uncached pricing data


suffix = "enabled" if enabled else "disabled"
out_path = out_dir / f"{scenario['id']}__{suffix}.json"
out_path.write_text(json.dumps(record, indent=2, default=str))
Comment thread scripts/tool_search_livetest.py Dismissed
Comment thread scripts/tool_search_livetest.py Dismissed
})

summary_path = out_dir / "_summary.json"
summary_path.write_text(json.dumps(summary, indent=2))
@teknium1

Copy link
Copy Markdown
Contributor Author

Merged via #34493 (rebased onto current main, your commit authorship preserved in git log).

The salvage carried the feature forward and closed a toolset-scoping hole found in review: the bridge read its catalog from the global registry, so a restricted-toolset session (subagent / kanban worker / curated gateway session) could tool_search the whole process registry and tool_call any plugin/MCP tool it was never granted. Now scoped to the session's own toolsets, with a defense-in-depth gate in both the bridge dispatch and the executor unwrap.

Also: dropped the 11 checked-in scripts/out/*.json transcripts (kept the harness, gitignored the output dir), routed the harness's key loading through load_hermes_dotenv, added _redact_secrets() over transcript/console output, and encoding="utf-8" on all file I/O.

On main as a87f0a8 (+ 369075d, 7427b9d, 1709776, 18c9e89).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/config Config system, migrations, profiles comp/agent Core agent loop, run_agent.py, prompt builder comp/tools Tool registry, model_tools, toolsets P2 Medium — degraded but workaround exists tool/mcp MCP client and OAuth type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants