Skip to content

feat(tools): progressive tool disclosure for MCP and plugin tools (scoped)#34493

Merged
teknium1 merged 5 commits into
mainfrom
hermes/hermes-ede5b5b2
May 29, 2026
Merged

feat(tools): progressive tool disclosure for MCP and plugin tools (scoped)#34493
teknium1 merged 5 commits into
mainfrom
hermes/hermes-ede5b5b2

Conversation

@teknium1

@teknium1 teknium1 commented May 29, 2026

Copy link
Copy Markdown
Contributor

Summary

Salvages #31163 (Tool Search — progressive tool disclosure for MCP/plugin tools) onto current main and closes a toolset-scoping hole found in deep review.

Tool Search hides MCP + non-core plugin tools behind three bridge tools (tool_search / tool_describe / tool_call) when the deferrable surface exceeds ~10% of the active model's context window. Core Hermes tools are never deferred.

What changed vs #31163

The original bridge dispatch read its catalog from the global registry — get_tool_definitions() with no toolset scope, whose else branch is "start with everything." In a restricted-toolset session (subagent, kanban worker, curated gateway session) that meant the model could:

  1. tool_search the entire process registry, not just its granted tools, and
  2. tool_call any registered plugin/MCP tool it was never given — registry.dispatch() has no enabled_tools gate for non-execute_code tools, so the out-of-scope tool actually ran.

It also widened the process-global _last_resolved_tool_names to the whole registry on every tool_search, leaking core/sandbox tools into execute_code's fallback set.

Confirmed by live E2E (pre-fix)

Session scoped to enabled_toolsets=['mcp-github'] with an out-of-scope dangerplugin tool registered:

  • tool_search reported total_available: 26 (whole registry)
  • tool_call("secret_plugin_danger", {}) returned {"ok": true} — it dispatched a tool the session was never granted
  • _last_resolved_tool_names went 20 → 51 (now including terminal)

Fix

  • handle_function_call gains enabled_toolsets / disabled_toolsets; the bridge dispatch scopes get_tool_definitions to them. This both scopes the searchable catalog and stops the global-pollution side effect.
  • Defense-in-depth gate rejects any tool_call'd name not in the scoped deferrable catalog.
  • tool_executor's unwrap (concurrent + sequential paths) enforces the same scope before dispatch — it unwraps tool_call → underlying name and bypasses the bridge branch, so the gate must live there too. New _tool_search_scoped_names() helper, cached per-agent on registry generation + toolset scope.
  • New scoped_deferrable_names() helper in tool_search.py shared by both sites.
  • get_tool_definitions / _compute_tool_definitions signatures annotated Optional[List[str]] (were List[str] = None).

Validation (post-fix, same E2E)

Before After
tool_search scoped to mcp-github total_available: 26 20
tool_call(out-of-scope plugin) {"ok": true} (ran) rejected: "not available in this session"
tool_call(in-scope tool) ran ran
_last_resolved_tool_names after tool_search 20 → 51 (leaked terminal) 20 → 20

Changes

  • tools/tool_search.py (new) — classification, threshold gate, BM25 retrieval, bridge dispatch, scoped_deferrable_names().
  • model_tools.py — assembly wired into _compute_tool_definitions; bridge dispatch in handle_function_call, now toolset-scoped.
  • agent/tool_executor.py — unwrap tool_call in both parsing paths with the scope gate; _tool_search_scoped_names() cache helper.
  • agent/agent_runtime_helpers.py — forwards toolset scope into the sequential dispatch.
  • hermes_cli/config.pyDEFAULT_CONFIG['tools']['tool_search'] block.
  • tests/tools/test_tool_search.py — 39 tests (35 original + 4-test TestRegression_ToolsetScoping).
  • website/docs/user-guide/features/tool-search.md — docs incl. the scoping guarantee.

Test plan

scripts/run_tests.sh tests/tools/test_tool_search.py        → 39/39
scripts/run_tests.sh tests/test_model_tools.py tests/test_toolsets.py \
  tests/tools/test_registry.py tests/hermes_cli/test_config.py \
  tests/run_agent/test_tool_arg_coercion.py                 → 269/269 (combined)
scripts/run_tests.sh tests/run_agent/test_agent_guardrails.py \
  tests/run_agent/test_concurrent_interrupt.py \
  tests/run_agent/test_tool_call_guardrail_runtime.py \
  tests/run_agent/test_tool_executor_contextvar_propagation.py → 52/52

Supersedes #31163.

Infographic

Tool Search — progressive tool disclosure

teknium1 added 2 commits May 29, 2026 00:38
Adds Tool Search, a structured-tools progressive-disclosure layer that
replaces MCP and non-core plugin tools in the model-visible tools array
with three bridge tools (tool_search / tool_describe / tool_call) when
the deferrable surface would consume more than a configurable percentage
of the active model's context window. Core Hermes tools are never deferred.

Default mode is 'auto' with a 10% context threshold, so small toolsets
pay no overhead. Set tools.tool_search.enabled to 'on' to force or 'off'
to disable.

Design carefully reflects the OpenClaw production failure modes
documented in the openclaw-tool-search-report:

  - Core tools never defer (toolsets._HERMES_CORE_TOOLS). Addresses the
    'tools silently missing from isolated cron turns' regression class
    (openclaw#84141) by construction: there is no code path that can
    drop a core tool.
  - Catalog is stateless across turns — rebuilt from the live tool-defs
    list on every assembly. No session-keyed Map that can drift out of
    sync with the registry.
  - tool_call unwraps the bridge call before any hook fires, so plugin
    pre/post hooks, guardrails, approval flows, and the activity feed
    all see the underlying tool name, not the bridge (addresses
    openclaw#85588 and the verbose-mode complaint on openclaw#79823).
  - The unwrap happens in both the parallel and sequential paths of
    agent/tool_executor.py and also in handle_function_call, so direct
    callers (sandboxed code, eval harnesses) are covered too.
  - Bridge tools cannot invoke each other (recursion guard) and cannot
    invoke core tools (those must be called directly).
  - Tools mode only — no JS-sandbox code-mode. Keeps the surface small.
  - Token estimation via cheap char/4 heuristic; precision isn't needed
    for the threshold decision.

Files:
  - tools/tool_search.py — new module (BM25 retrieval, classification,
    threshold gate, bridge dispatch, unwrap helper).
  - tests/tools/test_tool_search.py — 35 tests including the OpenClaw
    #84141 regression guard.
  - model_tools.py — wires assembly into _compute_tool_definitions as the
    final step, adds skip_tool_search_assembly kwarg so the bridge can
    see the real catalog, dispatches the three bridge tools.
  - agent/tool_executor.py — unwraps tool_call in both parallel and
    sequential parsing loops so checkpointing, guardrails, plugin hooks,
    and tool-progress callbacks all observe the underlying tool name.
  - hermes_cli/config.py — DEFAULT_CONFIG['tools']['tool_search'] block.
  - website/docs/user-guide/features/tool-search.md — user docs.

Validation:
  - 35/35 new tests pass.
  - Existing tool/registry/model_tools/config/coercion/executor tests
    (82 + 74 + small adjacents) green.
  - Live E2E: 20 fake MCP tools registered, get_tool_definitions returns
    3 bridges, tool_search returns top 3 hits, tool_describe returns
    full schema, tool_call dispatches to the real underlying handler
    and the underlying result is what the model sees.
  - Reserved-name recursion guard verified live.
  - Core-tool refusal via tool_call verified live.
…olsets

Tool Search read its catalog from the global registry (get_tool_definitions
with no toolset scope = 'start with everything'), so a restricted-toolset
session — subagent, kanban worker, curated gateway session — could:

  1. tool_search the entire process registry, not just its granted tools, and
  2. tool_call any registered plugin/MCP tool it was never given, because
     registry.dispatch() has no enabled_tools gate for non-execute_code tools.

A scoped session (enabled_toolsets=['mcp-github']) reported total_available=26
and successfully invoked an out-of-scope plugin tool via tool_call.

Fix:
- handle_function_call gains enabled_toolsets/disabled_toolsets; the bridge
  dispatch scopes get_tool_definitions to them (also stops polluting the
  process-global _last_resolved_tool_names with out-of-scope tools, which
  leaked into execute_code's sandbox-tool fallback).
- A defense-in-depth gate rejects any tool_call'd name not in the scoped
  deferrable catalog.
- tool_executor's unwrap (both concurrent + sequential paths) enforces the
  same scope before dispatch, since it unwraps tool_call -> underlying name
  and bypasses the bridge branch. New _tool_search_scoped_names() helper,
  cached per-agent on registry generation + toolset scope.
- New scoped_deferrable_names() helper in tool_search.py shared by both sites.

Tests: 4 new regression tests in TestRegression_ToolsetScoping (scoped
catalog, out-of-scope tool_call rejection, no global pollution, helper).
@github-actions

github-actions Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: hermes/hermes-ede5b5b2 vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9436 on HEAD, 9439 on base (✅ -3)

🆕 New issues (5):

Rule Count
invalid-assignment 3
invalid-argument-type 1
unresolved-import 1
First entries
scripts/tool_search_livetest.py:389: [invalid-argument-type] invalid-argument-type: Argument to `AIAgent.__init__` is incorrect: Expected `list[str]`, found `None`
model_tools.py:850: [invalid-assignment] invalid-assignment: Object of type `None` is not assignable to `<module 'tools.tool_search'>`
scripts/tool_search_livetest.py:377: [invalid-assignment] invalid-assignment: Object of type `def logging_dispatch(name, args, **kw) -> Unknown` is not assignable to attribute `dispatch` of type `def dispatch(self, name: str, args: dict[Unknown, Unknown], **kwargs) -> str`
tests/tools/test_tool_search.py:15: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
scripts/tool_search_livetest.py:412: [invalid-assignment] invalid-assignment: Object of type `bound method ToolRegistry.dispatch(name: str, args: dict[Unknown, Unknown], **kwargs) -> str` is not assignable to attribute `dispatch` of type `def dispatch(self, name: str, args: dict[Unknown, Unknown], **kwargs) -> str`

✅ Fixed issues (4):

Rule Count
invalid-argument-type 3
invalid-parameter-default 1
First entries
model_tools.py:331: [invalid-parameter-default] invalid-parameter-default: Default value of type `None` is not assignable to annotated parameter type `list[str]`
acp_adapter/server.py:798: [invalid-argument-type] invalid-argument-type: Argument to function `get_tool_definitions` is incorrect: Expected `list[str]`, found `Any | None`
gateway/run.py:13573: [invalid-argument-type] invalid-argument-type: Argument to function `get_tool_definitions` is incorrect: Expected `list[str]`, found `Any | None`
tui_gateway/server.py:6700: [invalid-argument-type] invalid-argument-type: Argument to function `get_tool_definitions` is incorrect: Expected `list[str]`, found `Any | None | list[str]`

Unchanged: 4894 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

Brings in the tool_search live-test harness from the original PR but leaves
out the 11 checked-in scripts/out/*.json transcript files — those are
non-deterministic model output that goes stale the moment the model changes
and were the bulk of the diff. scripts/out/ is now gitignored so a harness
run never re-commits them.

Fixes on top:
- API-key loading goes through hermes_cli.env_loader.load_hermes_dotenv
  instead of hand-parsing ~/.hermes/.env and assigning the value to a local.
  The canonical loader never materializes the secret in a local variable in
  this module, which clears the four CodeQL high alerts
  (py/clear-text-storage / py/clear-text-logging-sensitive-data at the
  transcript write/print sites — they were tracing the key from the
  hand-rolled parser into the records) and removes a hand-rolled parser.
- encoding='utf-8' on every write_text/read_text in both harness scripts
  (Windows-footgun hygiene).

Co-authored-by: teknium1 <127238744+teknium1@users.noreply.github.com>
@alt-glitch alt-glitch added type/security Security vulnerability or hardening P2 Medium — degraded but workaround exists comp/tools Tool registry, model_tools, toolsets comp/agent Core agent loop, run_agent.py, prompt builder tool/mcp MCP client and OAuth labels May 29, 2026
…args

The scoping fix added enabled_toolsets/disabled_toolsets to the
agent_runtime_helpers sequential dispatch into handle_function_call, so
test_invoke_tool_dispatches_to_handle_function_call's assert_called_once_with
(exact match) needs the two new kwargs. Both are None for the default agent
fixture.
Comment thread scripts/tool_search_livetest.py Dismissed
Comment thread scripts/tool_search_livetest.py Dismissed
Comment thread scripts/tool_search_livetest.py Dismissed
Comment thread scripts/tool_search_livetest.py Dismissed
The live harness runs against a real OpenRouter key; record['error'] is a
full traceback that, on an auth failure, could echo a request header or URL
containing the key. _redact_secrets() now masks the live OPENROUTER_API_KEY,
any sk-/sk-or- bearer token, and Authorization/Bearer headers before
final_response and error enter the transcript or the console print. Addresses
the CodeQL clear-text-storage/logging findings at the source.
@teknium1 teknium1 merged commit a87f0a8 into main May 29, 2026
26 checks passed
@teknium1 teknium1 deleted the hermes/hermes-ede5b5b2 branch May 29, 2026 09:04
davidgut1982 added a commit to davidgut1982/hermes-agent that referenced this pull request May 30, 2026
…isclosure

Adds an optional, opt-in embedding reranker to the tool_search BM25 bridge
(PR NousResearch#34493). Default OFF — when disabled the BM25 path is byte-for-byte
identical to upstream. urllib-only (no new deps), task-prefixed, md5-cached
tool embeddings, full-catalog retrieve, rerank/RRF(k=10) modes, graceful
BM25 fallback on any endpoint failure. Backend is any OpenAI-compatible
/v1/embeddings endpoint (cloud, local CPU, or GPU).

Live-validated (194 tools / 98 labeled queries, nomic-embed-text-v2-moe):
overall Recall@5 0.617 -> 0.810, SEMANTIC 0.500 -> 0.849, LEXICAL preserved
at 1.000; warm per-query ~146ms, dead-endpoint fallback ~8ms.

Fulfills NousResearch#13332.
davidgut1982 added a commit to davidgut1982/hermes-agent that referenced this pull request May 31, 2026
…isclosure

Adds an optional, opt-in embedding reranker to the tool_search BM25 bridge
(PR NousResearch#34493). Default OFF — when disabled the BM25 path is byte-for-byte
identical to upstream. urllib-only (no new deps), task-prefixed, md5-cached
tool embeddings, full-catalog retrieve, rerank/RRF(k=10) modes, graceful
BM25 fallback on any endpoint failure. Backend is any OpenAI-compatible
/v1/embeddings endpoint (cloud, local CPU, or GPU).

Live-validated (194 tools / 98 labeled queries, nomic-embed-text-v2-moe):
overall Recall@5 0.617 -> 0.810, SEMANTIC 0.500 -> 0.849, LEXICAL preserved
at 1.000; warm per-query ~146ms, dead-endpoint fallback ~8ms.

Fulfills NousResearch#13332.
davidgut1982 added a commit to davidgut1982/hermes-agent that referenced this pull request Jun 2, 2026
…isclosure

Adds an optional, opt-in embedding reranker to the tool_search BM25 bridge
(PR NousResearch#34493). Default OFF — when disabled the BM25 path is byte-for-byte
identical to upstream. urllib-only (no new deps), task-prefixed, md5-cached
tool embeddings, full-catalog retrieve, rerank/RRF(k=10) modes, graceful
BM25 fallback on any endpoint failure. Backend is any OpenAI-compatible
/v1/embeddings endpoint (cloud, local CPU, or GPU).

Live-validated (194 tools / 98 labeled queries, nomic-embed-text-v2-moe):
overall Recall@5 0.617 -> 0.810, SEMANTIC 0.500 -> 0.849, LEXICAL preserved
at 1.000; warm per-query ~146ms, dead-endpoint fallback ~8ms.

Fulfills NousResearch#13332.
davidgut1982 added a commit to davidgut1982/hermes-agent that referenced this pull request Jun 4, 2026
…isclosure

Adds an optional, opt-in embedding reranker to the tool_search BM25 bridge
(PR NousResearch#34493). Default OFF — when disabled the BM25 path is byte-for-byte
identical to upstream. urllib-only (no new deps), task-prefixed, md5-cached
tool embeddings, full-catalog retrieve, rerank/RRF(k=10) modes, graceful
BM25 fallback on any endpoint failure. Backend is any OpenAI-compatible
/v1/embeddings endpoint (cloud, local CPU, or GPU).

Live-validated (194 tools / 98 labeled queries, nomic-embed-text-v2-moe):
overall Recall@5 0.617 -> 0.810, SEMANTIC 0.500 -> 0.849, LEXICAL preserved
at 1.000; warm per-query ~146ms, dead-endpoint fallback ~8ms.

Fulfills NousResearch#13332.
davidgut1982 added a commit to davidgut1982/hermes-agent that referenced this pull request Jun 5, 2026
…isclosure

Adds an optional, opt-in embedding reranker to the tool_search BM25 bridge
(PR NousResearch#34493). Default OFF — when disabled the BM25 path is byte-for-byte
identical to upstream. urllib-only (no new deps), task-prefixed, md5-cached
tool embeddings, full-catalog retrieve, rerank/RRF(k=10) modes, graceful
BM25 fallback on any endpoint failure. Backend is any OpenAI-compatible
/v1/embeddings endpoint (cloud, local CPU, or GPU).

Live-validated (194 tools / 98 labeled queries, nomic-embed-text-v2-moe):
overall Recall@5 0.617 -> 0.810, SEMANTIC 0.500 -> 0.849, LEXICAL preserved
at 1.000; warm per-query ~146ms, dead-endpoint fallback ~8ms.

Fulfills NousResearch#13332.
davidgut1982 added a commit to davidgut1982/hermes-agent that referenced this pull request Jun 5, 2026
…isclosure

Adds an optional, opt-in embedding reranker to the tool_search BM25 bridge
(PR NousResearch#34493). Default OFF — when disabled the BM25 path is byte-for-byte
identical to upstream. urllib-only (no new deps), task-prefixed, md5-cached
tool embeddings, full-catalog retrieve, rerank/RRF(k=10) modes, graceful
BM25 fallback on any endpoint failure. Backend is any OpenAI-compatible
/v1/embeddings endpoint (cloud, local CPU, or GPU).

Live-validated (194 tools / 98 labeled queries, nomic-embed-text-v2-moe):
overall Recall@5 0.617 -> 0.810, SEMANTIC 0.500 -> 0.849, LEXICAL preserved
at 1.000; warm per-query ~146ms, dead-endpoint fallback ~8ms.

Fulfills NousResearch#13332.
davidgut1982 added a commit to davidgut1982/hermes-agent that referenced this pull request Jun 6, 2026
…isclosure

Adds an optional, opt-in embedding reranker to the tool_search BM25 bridge
(PR NousResearch#34493). Default OFF — when disabled the BM25 path is byte-for-byte
identical to upstream. urllib-only (no new deps), task-prefixed, md5-cached
tool embeddings, full-catalog retrieve, rerank/RRF(k=10) modes, graceful
BM25 fallback on any endpoint failure. Backend is any OpenAI-compatible
/v1/embeddings endpoint (cloud, local CPU, or GPU).

Live-validated (194 tools / 98 labeled queries, nomic-embed-text-v2-moe):
overall Recall@5 0.617 -> 0.810, SEMANTIC 0.500 -> 0.849, LEXICAL preserved
at 1.000; warm per-query ~146ms, dead-endpoint fallback ~8ms.

Fulfills NousResearch#13332.
davidgut1982 added a commit to davidgut1982/hermes-agent that referenced this pull request Jun 6, 2026
…isclosure

Adds an optional, opt-in embedding reranker to the tool_search BM25 bridge
(PR NousResearch#34493). Default OFF — when disabled the BM25 path is byte-for-byte
identical to upstream. urllib-only (no new deps), task-prefixed, md5-cached
tool embeddings, full-catalog retrieve, rerank/RRF(k=10) modes, graceful
BM25 fallback on any endpoint failure. Backend is any OpenAI-compatible
/v1/embeddings endpoint (cloud, local CPU, or GPU).

Live-validated (194 tools / 98 labeled queries, nomic-embed-text-v2-moe):
overall Recall@5 0.617 -> 0.810, SEMANTIC 0.500 -> 0.849, LEXICAL preserved
at 1.000; warm per-query ~146ms, dead-endpoint fallback ~8ms.

Fulfills NousResearch#13332.
davidgut1982 added a commit to davidgut1982/hermes-agent that referenced this pull request Jun 6, 2026
…isclosure

Adds an optional, opt-in embedding reranker to the tool_search BM25 bridge
(PR NousResearch#34493). Default OFF — when disabled the BM25 path is byte-for-byte
identical to upstream. urllib-only (no new deps), task-prefixed, md5-cached
tool embeddings, full-catalog retrieve, rerank/RRF(k=10) modes, graceful
BM25 fallback on any endpoint failure. Backend is any OpenAI-compatible
/v1/embeddings endpoint (cloud, local CPU, or GPU).

Live-validated (194 tools / 98 labeled queries, nomic-embed-text-v2-moe):
overall Recall@5 0.617 -> 0.810, SEMANTIC 0.500 -> 0.849, LEXICAL preserved
at 1.000; warm per-query ~146ms, dead-endpoint fallback ~8ms.

Fulfills NousResearch#13332.
davidgut1982 added a commit to davidgut1982/hermes-agent that referenced this pull request Jun 6, 2026
…isclosure

Adds an optional, opt-in embedding reranker to the tool_search BM25 bridge
(PR NousResearch#34493). Default OFF — when disabled the BM25 path is byte-for-byte
identical to upstream. urllib-only (no new deps), task-prefixed, md5-cached
tool embeddings, full-catalog retrieve, rerank/RRF(k=10) modes, graceful
BM25 fallback on any endpoint failure. Backend is any OpenAI-compatible
/v1/embeddings endpoint (cloud, local CPU, or GPU).

Live-validated (194 tools / 98 labeled queries, nomic-embed-text-v2-moe):
overall Recall@5 0.617 -> 0.810, SEMANTIC 0.500 -> 0.849, LEXICAL preserved
at 1.000; warm per-query ~146ms, dead-endpoint fallback ~8ms.

Fulfills NousResearch#13332.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/tools Tool registry, model_tools, toolsets P2 Medium — degraded but workaround exists tool/mcp MCP client and OAuth type/security Security vulnerability or hardening

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants