Skip to content

doctor: surface active HERMES_TOOLS_SUBSET at boot (#75/#87 follow-up)#96

Merged
PowerCreek merged 1 commit into
mainfrom
doctor-tools-subset-probe
May 25, 2026
Merged

doctor: surface active HERMES_TOOLS_SUBSET at boot (#75/#87 follow-up)#96
PowerCreek merged 1 commit into
mainfrom
doctor-tools-subset-probe

Conversation

@PowerCreek

Copy link
Copy Markdown

Summary

Companion to PR #95 (HERMES_DEFAULT_PROVIDER probe). Operators who narrow the tool surface via HERMES_TOOLS_SUBSET (#75/#87) now see at hermes doctor time exactly which tools the filter parsed to — catches two failure modes that previously required a separate hermes mcp list diff.

Behavior

  • Silent when env var is unset/empty/whitespace (silent-when-irrelevant pattern).
  • check_ok with the count + a sample of names (first 6, +N more suffix to keep the row readable). Operator confirms the filter parsed as expected.
  • check_info reminder when zero entries use the mcp_ prefix but some entries look structured (have an underscore). Most common failure mode: operator forgot the mcp_<server>_<tool> prefix for MCP tools, and silently filters nothing.

Out of scope (for this PR)

Cross-checking the parsed names against the live MCP tool registry would catch typos directly — but requires spinning up create_mcp_server() at probe time. Operators who want the cross-check can run hermes mcp list separately. Filing as opportunistic follow-up if demand surfaces.

Test plan

  • 8 new tests pass (silent-when-unset/empty/whitespace, count+sample, long-list-truncated, mcp-prefix-reminder fires, doesn't fire when prefix present, doesn't fire on bare simple names)
  • 30 total green across affected suites — no regression
  • After merge: HERMES_TOOLS_SUBSET=silo_query,confer_run hermes doctor shows row + reminder

Operators who narrow the tool surface via HERMES_TOOLS_SUBSET can
now confirm at ``hermes doctor`` time exactly which tools the
filter parsed to. Catches two failure modes that previously
required a separate ``hermes mcp list`` diff:

1. Operator typoed a tool name → still in the parsed list (no
   cross-check), but the diff against ``hermes mcp list`` is now
   trivial.
2. Operator forgot the ``mcp_<server>_<tool>`` prefix for MCP
   tools → no entry uses ``mcp_`` prefix but entries look
   structured → info reminder fires.

Silent when env var is unset/empty (silent-when-irrelevant pattern
from the #88/#53/#54 doctor probes). When set, surfaces:

  * check_ok with the count + a sample of names (first 6, then
    ``+N more`` suffix to keep the row readable);
  * check_info reminder when zero entries use the mcp_ prefix but
    some look structured (the most common parse-correctly-but-
    filter-nothing failure mode).

Cross-check against the live MCP registry was considered + rejected
for this PR — it would require spinning up ``create_mcp_server()``
at probe time. Operators can ``hermes mcp list`` separately if they
want the full diff. Filing as opportunistic follow-up if demand
shows up.

## Tests

- 8 new tests in tests/hermes_cli/test_doctor_tools_subset_probe.py:
  silent-when-unset / silent-when-empty / silent-when-whitespace /
  count-and-sample-shown / long-list-truncated / mcp-prefix-reminder
  / no-reminder-when-mcp-present / no-reminder-when-only-simple-bare-
  names.
- 30 total green across affected suites (probe + provider-env-probe
  + mcp_subset_filter). No regression.
@PowerCreek PowerCreek merged commit 0a84b64 into main May 25, 2026
@PowerCreek PowerCreek deleted the doctor-tools-subset-probe branch May 25, 2026 04:25
PowerCreek added a commit that referenced this pull request May 25, 2026
…vances #89 Direction A) (#98)

Operator-supplied intent override (Option A3 from #97). When
``HERMES_INTENT_OVERRIDE=code`` is set, the system prompt's
``stable`` layer narrows for tool-call-heavy traffic — addresses
#89's prompt-saturation symptom on mid-tier coding models.

## What narrows under code intent

| Block | Action | Why |
|---|---|---|
| SOUL.md | Skip | Largest single contributor; falls back to short DEFAULT_AGENT_IDENTITY floor |
| HERMES_AGENT_HELP_GUIDANCE | Skip | Off-topic for tool-call traffic |
| SKILLS_GUIDANCE | Skip | Per-tool block, off-topic for code |
| KANBAN_GUIDANCE | Skip | Worker-lifecycle, off-topic for code |
| SESSION_SEARCH_GUIDANCE | Skip | Off-topic for code |
| skills_prompt (the big one) | Skip | Biggest contributor when many skills loaded |
| MEMORY_GUIDANCE | **Keep** | Small + sometimes useful even for code |
| TOOL_USE_ENFORCEMENT_GUIDANCE | **Keep** | Critical for tool emission |
| Per-model operational guidance | **Keep** | Model-quality-specific |
| Env / platform hints | **Keep** | Execution-environment essentials |
| nous-subscription + computer-use + alibaba | **Keep** | Operational invariants |
| ``context`` + ``volatile`` layers | **Untouched** | Out of scope per #97 |

Other intents (``confer`` / ``planning`` / ``exploration`` /
``refinement`` / ``generic``) are recognized as valid but pass
through without narrowing in v1 (keeps the door open for per-
intent shape later).

## Intent vocabulary

Matches devagentic#240's ``intent_classifier`` 6-key enum exactly,
so the same operator-side classifier that's wired into devagentic's
R5 dispatch hook can also drive hermes-side prompt narrowing
without a second vocabulary.

## Doctor probe

New ``_check_intent_override_env`` probe surfaces the active
override at ``hermes doctor`` time — silent when unset, check_ok
when valid (with a narrowing-active note for ``code``), check_warn
with the full valid-keys list when typo'd. Mirrors the silent-
when-irrelevant pattern from PR #95 / #96.

## Tests

- 22 new prompt-narrowing tests in
  ``tests/agent/test_system_prompt_intent_override.py``: resolver
  enum + normalization (5), per-section drops under code (7),
  pass-through for non-code intents (5), typo falls back (1),
  byte-count regression (1), default-still-includes counter-case (1),
  case-insensitive (1), runtime-vs-doctor-config sanity (1).
- 6 new doctor-probe tests in
  ``tests/hermes_cli/test_doctor_intent_override_probe.py``:
  silent-when-unset / silent-when-empty / code-ok-with-narrowing-note /
  non-code-valid-pass-through / typo-warn-with-valid-sample /
  case-insensitive.
- 258 total green across affected suites (system-prompt + prompt-
  builder + restore + doctor + provider-env + tools-subset). No
  regression in the existing prompt-shape pins.

## Composition note

Option A1 (port classifier) + A2 (devagentic GraphQL surface) are
deferred per the #97 sequencing — A3 unblocks deployment-specific
narrowing immediately; A1/A2 only matter when dynamic per-turn
classification is needed on the hermes side. The classifier output
on the devagentic side (NousResearch#240) drives R5 dispatch decisions there.
PowerCreek added a commit that referenced this pull request May 27, 2026
#115) (#116)

Companion to devagentic#315 (initiative preamble). When operator
sets ``HERMES_TOOL_USE_ENFORCEMENT=required``, the chat_completions
transport injects ``tool_choice: "required"`` on every dispatch
where tools are attached — the model-layer enforcement that closes
the gap devagentic#315's soft-signal preamble leaves open.

## Behavior

- Unset / empty / unknown value → default behavior unchanged (no
  ``tool_choice`` injected by hermes)
- ``HERMES_TOOL_USE_ENFORCEMENT=required`` + tools attached →
  ``tool_choice: "required"`` set on the API kwargs
- Tools NOT attached → no injection (sending ``tool_choice=required``
  with empty tools is a 400 on most providers)
- Caller-supplied ``tool_choice`` already on kwargs → no override
  (the dispatcher-tier signal wins; env is a session-tier default)

Per devagentic#203 §1.3 — hermes owns model-call-shape decisions
(per-call enforcement). Devagentic's models.json
``default_tool_choice`` is the dispatcher-tier default; this env is
the session-tier override.

## Where it fires

Both build_kwargs paths in ``chat_completions.py``:
- Legacy fallback path (unregistered providers)
- Provider-profile path (known providers via providers/ registry)

Shared helper ``_maybe_inject_required_tool_choice(api_kwargs,
tools)`` keeps the two sites in sync.

## Doctor probe

New ``_check_tool_use_enforcement_env`` surfaces the active setting
— silent when unset, ``check_ok`` on ``required``, ``check_warn``
with valid-values hint on typos. Mirrors the silent-when-irrelevant
pattern from #95 / #96 / persona-deferred.

## Tests

- 18 new tests in tests/agent/test_tool_use_enforcement.py:
  resolver returns None/required/case-insensitive/unknown (8
  parametrized), injection happy path (1), no-inject-when-unset (1),
  no-inject-when-no-tools (1 covering both None and empty list),
  does-not-clobber-existing-tool_choice (1), no-inject-on-unknown
  (1), doctor silent-when-unset (1), doctor check_ok on required
  (1), doctor check_warn on unknown (1).
- 128 total green across affected suites (new + doctor + provider/
  intent/persona/tools-subset probes). No regression.

## Sequencing per #115 body

The issue says "Land after devagentic#315 Phase 1 has deployed +
been observed. If the preamble alone closes the reliability gap to
operator satisfaction, this issue may not need to ship."

This PR ships the env-knob in opt-in OFF-by-default mode, so:
- Operators can enable it the moment they observe NousResearch#315's preamble
  is insufficient (no further hermes-side dev cycle needed)
- Default behavior unchanged → zero risk to non-client-tier sessions
- Doctor probe surfaces the active state so operators can confirm
  enablement at boot

Saves the round-trip of waiting + then dev'ing once the signal
arrives.
PowerCreek added a commit that referenced this pull request May 27, 2026
After v0.18.4's tool_call recovery (#124) landed, the next-level
bug surfaced in sandbox field-test: model calls a tool name
hermes' worker didn't register, the invalid_tool_call retry path
fires, but its verbose-only print is invisible in default runs.
Combined with model hallucination ("the file has been created..."
narration on the NEXT turn), the mismatch becomes invisible —
operators see model narration, not the underlying tool-name
mismatch.

## Fix

Upgrade conversation_loop.py:3219's verbose-only print to:

1. ``logger.warning`` with the invented name + count + first 10
   registered names + model + provider for cross-system log
   correlation
2. ``agent._emit_status`` surfacing the mismatch in the user-
   facing stream

Operator immediately sees:
- WHICH name the model invented
- HOW MANY tools the worker has registered
- WHICH tools (sample) ARE registered
- Across which retry of 3

No behavior change — existing invalid_tool_call retry semantics
unchanged. Pure observability boost.

## Tests

- 3 new source-level tests in
  tests/agent/test_loud_invalid_tool_call.py: patch-landed,
  emit_status template includes name + count, warning includes
  model + provider for correlation.
- 20 total green across affected suites — no regression.

## Composition

Same observability family as the #95 / #96 doctor probes. Helps
operators distinguish "hermes ate the tool_call" from "sandbox
toolset doesn't expose what the model is calling".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant