Skip to content

feat(middleware): add adaptive middleware to hermes-agent, consumed by NeMo-Relay#29724

Merged
kshitijk4poor merged 3 commits into
NousResearch:mainfrom
bbednarski9:bbednarski/nmf-41B-nemoflow-plugin
Jun 6, 2026
Merged

feat(middleware): add adaptive middleware to hermes-agent, consumed by NeMo-Relay#29724
kshitijk4poor merged 3 commits into
NousResearch:mainfrom
bbednarski9:bbednarski/nmf-41B-nemoflow-plugin

Conversation

@bbednarski9

@bbednarski9 bbednarski9 commented May 21, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

This PR adds the middleware half of the Hermes Agent execution-intercept work, stacked on top of the now-merged observability contract from #38232.

#38232 established low-overhead telemetry hooks for watching what Hermes does. This PR adds opt-in middleware intercepts for changing or owning execution:

  • llm_request and tool_request middleware can inspect and rewrite payloads before execution.
  • llm_execution and tool_execution middleware can wrap or replace the actual downstream LLM/tool call.
  • NeMo-Relay is now an in-repo consumer of the execution middleware contract, so the middleware layer is validated against a real adaptive execution use case rather than only test plugins.
  • The no-listener path is explicitly gated with has_middleware(kind) and returns by reference, so default uninstrumented Hermes runs do not pay request-copy or execution-chain overhead.

The intent is to keep the two layers distinct:

Related Issue

Type of Change

  • New feature
  • Refactor
  • Documentation
  • Tests
  • Bug fix
  • Breaking change

Changes Made

Middleware Registry And Contract

  • Added middleware registration and lookup alongside plugin hooks in hermes_cli/plugins.py.
  • Added has_middleware(kind) so hot-path call sites can cheaply skip middleware work when no consumer is registered.
  • Added middleware entry points in hermes_cli/middleware.py:
    • apply_llm_request_middleware
    • apply_tool_request_middleware
    • run_llm_execution_middleware
    • run_tool_execution_middleware
  • Kept backward-compatible aliases for the old API request naming where needed.

Request Middleware

  • llm_request middleware can rewrite provider request payloads before the API call.
  • tool_request middleware can rewrite tool arguments before tool execution.
  • No-listener request middleware returns by reference without deepcopy.
  • The changed flag is based on bool(trace) instead of structural deep-equality comparisons across copied payloads.

Execution Middleware

  • llm_execution middleware can wrap or replace the downstream LLM call.
  • tool_execution middleware can wrap or replace the downstream tool call.
  • The execution chain distinguishes downstream failures from middleware failures with _DownstreamExecutionError, so a failing tool/LLM call is not retried by middleware chaining.
  • Middleware that raises after a successful next_call() no longer causes the terminal LLM/tool call to execute twice.

Hermes Call Sites

NeMo-Relay Consumer

  • Extended plugins/observability/nemo_relay from a telemetry-only observer into a real adaptive execution consumer.
  • NeMo-Relay can now opt into managed LLM execution through llm_execution.
  • NeMo-Relay can now opt into managed tool execution through tool_execution.
  • When managed execution is disabled, the middleware falls through to Hermes with next_call(...).
  • The plugin still records observer telemetry for sessions, turns, API calls, tools, approvals, subagents, and final ATIF export.

Documentation

  • Added middleware documentation in docs/middleware/README.md.
  • Updated the NeMo-Relay plugin README to describe both telemetry hooks and execution middleware.

Tests

  • Added plugin registry and middleware contract tests.
  • Added no-listener by-reference fast-path tests.
  • Added request-change tracing tests.
  • Added downstream failure tests to verify middleware does not retry failing LLM/tool calls.
  • Added post-next_call middleware failure tests to verify terminal execution is not repeated.
  • Added NeMo-Relay plugin tests for the execution middleware paths.
  • Updated model tool and run-agent tests for the new middleware behavior.

Suggested Review Order

  1. hermes_cli/plugins.py and hermes_cli/middleware.py

    • Review the shared middleware contract, registration API, no-listener gates, request middleware result shape, and execution-chain semantics.
  2. Hermes hot-path call sites

    • agent/conversation_loop.py
    • model_tools.py
    • agent/tool_executor.py
    • agent/agent_runtime_helpers.py
    • Confirm middleware is gated before expensive work and that observability hooks remain separately gated.
  3. NeMo-Relay consumer

    • plugins/observability/nemo_relay/__init__.py
    • plugins/observability/nemo_relay/README.md
    • Confirm managed LLM/tool execution falls through correctly when disabled and captures adaptive execution when enabled.
  4. Tests and docs

    • tests/hermes_cli/test_plugins.py
    • tests/plugins/test_nemo_relay_plugin.py
    • tests/test_model_tools.py
    • tests/run_agent/test_run_agent.py
    • docs/middleware/README.md

How to Test

Targeted test run used locally:

uv run pytest tests/hermes_cli/test_plugins.py tests/test_model_tools.py tests/run_agent/test_run_agent.py tests/plugins/test_nemo_relay_plugin.py

Result:

491 passed in 90.41s

Useful manual checks:

# Confirm middleware registry / contract tests.
uv run pytest tests/hermes_cli/test_plugins.py

# Confirm model tool middleware behavior.
uv run pytest tests/test_model_tools.py

# Confirm run-agent integration behavior.
uv run pytest tests/run_agent/test_run_agent.py

# Confirm NeMo-Relay observer + execution middleware behavior.
uv run pytest tests/plugins/test_nemo_relay_plugin.py

Checklist

Code

  • Middleware registration is implemented through the plugin manager.
  • Middleware presence checks avoid hot-path work when no consumer is registered.
  • Request middleware returns by reference on the no-listener path.
  • Execution middleware returns directly to Hermes on the no-listener path.
  • Request changed state avoids structural deep-equality comparisons.
  • Downstream execution failures are not retried by the middleware chain.
  • Middleware failures after next_call() do not re-run the terminal LLM/tool call.
  • NeMo-Relay is wired as an in-repo execution middleware consumer.
  • Backward-compatible aliases are preserved where needed.

Documentation & Housekeeping

Screenshots / Logs

No UI screenshots.

Hot-path middleware overhead was checked after the no-listener gates were added:

Payload Objects Old modeled apply_llm Current no-listener apply_llm No-listener run_llm No-listener apply_tool No-listener run_tool
100 KB 100 0.553 ms 0.000458 ms 0.000250 ms 0.000417 ms 0.000250 ms
1 MB 1024 5.999 ms 0.000458 ms 0.000250 ms 0.000458 ms 0.000291 ms
5 MB 5120 31.002 ms 0.000458 ms 0.000291 ms 0.000458 ms 0.000271 ms
10 MB 10240 64.149 ms 0.000562 ms 0.000291 ms 0.000458 ms 0.000292 ms

Interpretation:

  • Default no-listener Hermes only pays cheap middleware presence checks.
  • Request copying and middleware-chain setup are only paid when a middleware consumer is registered.
  • NeMo-Relay managed execution is an explicit opt-in middleware cost.

@alt-glitch alt-glitch added type/feature New feature or request comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have telemetry Touches outbound telemetry, usage attribution, or analytics — needs opt-in gating before merge labels May 21, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

This is phase 2 of the NeMo-Flow stacked change, depending on the middleware hooks from PR #29722. Supersedes the combined PR #29551. Related: draft #28611, unified telemetry tracking #6642. P3 per plugin-provider-demotion taste rule.

@bbednarski9 bbednarski9 force-pushed the bbednarski/nmf-41B-nemoflow-plugin branch from 4f5be2f to d93e059 Compare May 21, 2026 18:57
@bbednarski9 bbednarski9 changed the title feat(observability): add optional NeMo-Flow telemetry plugin feat(observability): add optional NeMo-Relay telemetry plugin May 22, 2026
@bbednarski9 bbednarski9 changed the title feat(observability): add optional NeMo-Relay telemetry plugin feat(observability): add optional NeMo-Flow telemetry plugin May 22, 2026
@bbednarski9 bbednarski9 force-pushed the bbednarski/nmf-41B-nemoflow-plugin branch from 520bd9e to c595a54 Compare May 22, 2026 18:23
@bbednarski9 bbednarski9 marked this pull request as ready for review May 22, 2026 18:27
@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Blocked on #29722 review — left detailed notes on the observer/middleware split there: #29722 (comment). This PR is unblocked once the observer half of #29722 lands.

@bbednarski9 bbednarski9 force-pushed the bbednarski/nmf-41B-nemoflow-plugin branch from c595a54 to d9ac183 Compare June 2, 2026 00:06
@bbednarski9 bbednarski9 requested a review from a team June 2, 2026 00:06
@bbednarski9 bbednarski9 changed the title feat(observability): add optional NeMo-Flow telemetry plugin feat(middleware): add adaptive middleware to hermes-agent, consumed by NeMo-Relay Jun 2, 2026
@bbednarski9 bbednarski9 force-pushed the bbednarski/nmf-41B-nemoflow-plugin branch from b24a720 to 0367582 Compare June 2, 2026 21:44

@kshitijk4poor kshitijk4poor left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the middleware half — the llm_execution / tool_execution wiring with NeMo-Relay as a real consumer is exactly what we wanted to see, and it's what closes the "no in-repo consumer" gap from the #29722 round. _DownstreamExecutionError is a good touch too — it correctly keeps a failing tool/LLM call from being retried by the chain.

A few things to sort before this lands, all on the hot path or the shared contract.

Critical

1. Middleware isn't gated on a registered consumer — re-introduces the per-call cost we just removed.

apply_llm_request_middleware runs deepcopy(request) twice on every API call before it ever checks whether any middleware is registered (hermes_cli/middleware.py:67-68), and it's called unconditionally at agent/conversation_loop.py:1231. That's the same per-call cost we gated out of the observability PR — and the pre_api_request hook ~25 lines below it (conversation_loop.py:1256) is already gated with has_hook(), with a comment about not wanting to walk every tool result + base64 image on every call. There's no has_middleware() yet, so all four apply_* / run_* entry points pay this unconditionally. Same on the tool path: apply_tool_request_middleware at model_tools.py:976 and agent/tool_executor.py:197 deepcopies tool args on every tool call.

Fix: add has_middleware(kind) mirroring has_hook(), gate the apply_* / run_* entry points, and return a by-reference no-op result when nothing's registered (same shape as the hook gating in the observability PR).

2. changed = current_request != original_request (hermes_cli/middleware.py:88 and :125) is a full structural deep-equality compare of two deepcopies of the same source — guaranteed False whenever no middleware rewrote the payload. Use changed = bool(trace).

Warning

3. Double-execution if a middleware raises after calling next_call.

_run_execution_chain (hermes_cli/middleware.py:195-236) handles a downstream failure correctly via _DownstreamExecutionError — verified, and test_execution_middleware_does_not_retry_downstream_failure covers it. But if a middleware calls next_call() successfully and then raises in its own post-processing, that exception falls into the generic except Exception at :227, which retries from call_at(index + 1) and re-runs the terminal call — so the tool or LLM call fires twice.

The bundled NeMo-Relay consumer is well-behaved so it won't hit this, but since this is the public contract for arbitrary third-party middleware it's worth closing now. A flag tracking whether next_call already fired in the frame (don't re-run the terminal if so) handles it, plus a test for the post-next_call-raise case alongside the existing downstream one.

Structural (not a code issue, but affects merge)

This PR currently carries the full observability diff layered under the middleware — it touches the same hot-path files as #29722 (conversation_loop.py, model_tools.py, agent_runtime_helpers.py). Once the observability half lands, this will conflict. Recommend rebasing down to just the middleware delta on top of the landed observability — it'll also make the actual change much easier to review (right now the real delta is buried under changes we've already reviewed).

Looks good

  • llm_execution / tool_execution wiring with a real consumer — validates the contract shape against an actual use case.
  • _DownstreamExecutionError correctly distinguishes downstream failures from middleware failures for the pre-next_call case.
  • Back-compat aliases (api_requestllm_request) are clean.
  • All targeted tests pass.

None of this is hard — items 1 and 2 mostly mirror the gating pattern already in the observability PR. Happy to re-review quickly once these are in.

Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
@bbednarski9 bbednarski9 force-pushed the bbednarski/nmf-41B-nemoflow-plugin branch from ec1285f to 2e0c908 Compare June 3, 2026 18:27
@bbednarski9

Copy link
Copy Markdown
Contributor Author
  • Rebased the PR down to a single middleware/adaptive commit on current main, after the observer half landed in perf(observability): observer telemetry hooks + NeMo-Relay plugin, gated tool emit (salvage #38190/#29722) #38232
  • Added has_middleware(kind) mirroring has_hook().
  • Gated apply_* and run_* middleware entry points so the no-listener path returns by reference without deepcopy/callback setup.
  • Changed request middleware changed tracking to bool(trace).
  • Fixed the post-next_call middleware exception path so it does not re-run the downstream tool/LLM call.
  • Added regression coverage for no-listener fast path, changed = bool(trace), and post-next_call exceptions.

Responses to individual comments

1. Middleware isn't gated on a registered consumer

Thanks for pointing this out - I had forgotten to propagate these changes into this PR. Super helpful.

Resolved and measured on the restacked branch.

I added has_middleware(kind) mirroring has_hook() and moved the middleware checks ahead of any deepcopy() or execution-chain setup. With no middleware registered, the request middleware helpers now return by reference:

  • payload is request / original_payload is request
  • payload is args / original_payload is args
  • changed=False
  • trace=[]

The execution middleware helpers also bypass chain setup and call the downstream callback directly when no callbacks are registered.

I also benchmarked the no-consumer path with chat-shaped payloads. I used many message/tool-result dicts rather than one giant string, because deepcopy() reuses immutable string objects and under-represents traversal cost.

Payload Objects Old modeled apply_llm Current no-listener apply_llm No-listener run_llm No-listener apply_tool No-listener run_tool
100 KB 100 0.553 ms 0.000458 ms 0.000250 ms 0.000417 ms 0.000250 ms
1 MB 1024 5.999 ms 0.000458 ms 0.000250 ms 0.000458 ms 0.000291 ms
5 MB 5120 31.002 ms 0.000458 ms 0.000291 ms 0.000458 ms 0.000271 ms
10 MB 10240 64.149 ms 0.000562 ms 0.000291 ms 0.000458 ms 0.000292 ms

Interpretation: the default install/no-consumer path is effectively just the registry presence check and is flat with payload size. If request middleware is actually registered, that remains an explicit opt-in behavior-changing path and still pays the defensive copy/traversal cost. I added regression coverage for the no-listener by-reference behavior so this does not regress.

2. Deep equality compare

I changed request middleware result tracking from structural deep equality to trace-based tracking:

changed=bool(trace)

That applies to both LLM request middleware and tool request middleware. This avoids walking the full request/args payload a third time after the defensive copies, and it also reflects the actual middleware contract: a request is considered changed when a middleware callback returned a valid replacement payload and produced a trace entry.

I added regression coverage for the edge case where middleware returns an equal/same-shaped payload. That now reports changed=True because middleware did run and supplied a replacement, without relying on expensive structural comparison.

3. Double-execution if a middleware raises after calling next_call

I updated _run_execution_chain to track whether next_call() was already invoked in the current middleware frame. If the middleware raises after a successful downstream call, the generic middleware-error handler now returns the already-computed downstream result instead of retrying call_at(index + 1).

I also added regression coverage for this exact case:

def test_execution_middleware_post_next_call_error_does_not_retry(...)

The test verifies the terminal callback is called exactly once and the original downstream result is preserved.

@bbednarski9 bbednarski9 requested a review from kshitijk4poor June 4, 2026 05:51
@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Reviewed the restacked commit 2e0c908 end-to-end (read the contract + all four hot-path call sites, ran the 4 targeted suites → 491 passed, plus an adversarial pass with a second reviewer + standalone repros). The three items from my earlier notes are genuinely resolved and each has a real regression test — nice. One new High-severity hole that the post-next_call fix opened up, with a small fix.

✅ Confirmed resolved

  • Consumer gatinghas_middleware(kind) mirrors has_hook(); no-listener path returns by reference (payload is request verified, locked by test_middleware_helpers_skip_no_listener_work). Benchmarks match.
  • changed = bool(trace) — correct and intentional; can't produce a false changed=False. (Minor: .changed currently has no runtime consumer — only asserted in tests. Fine to leave, just flagging.)
  • Post-next_call no automatic re-run — the terminal can't run twice on the automatic path; covered by test_execution_middleware_post_next_call_error_does_not_retry.
  • Also nice: invoke_tool → handle_function_call passes skip_tool_request_middleware=True + forwards the trace, so request middleware doesn't double-apply on the agent-loop path.

🔴 High — downstream failure silently masked as None (hermes_cli/middleware.py:244, 265-266)

The post-next_call fix opened the mirror-image bug. next_called is set True at line 244 before the recursion at line 246 can fail, but next_result is only assigned after a successful recursion. So when:

  1. middleware calls next_call(),
  2. the downstream LLM/tool call genuinely raisesnext_result stays None, next_called is already True,
  3. middleware catches the _DownstreamExecutionError and raises its own error (normal error-translation pattern),
  4. line 258 catches it, next_called is True → line 266 returns next_result = None.

A real provider/tool failure gets reported as a successful empty result. Reproduced standalone:

RESULT: None
terminal_runs: 1   # terminal ran once and failed; chain returned None instead of raising

This path isn't covered — test_execution_middleware_does_not_retry_downstream_failure only exercises a middleware that lets the error propagate; the catch-and-reraise case is the gap.

Minimal fix — track downstream success, not just invocation:

next_succeeded = False
def next_call(next_payload=None):
    nonlocal next_called, next_succeeded, next_result
    next_called = True
    next_result = call_at(index + 1, payload if next_payload is None else next_payload)
    next_succeeded = True
    return next_result
...
except Exception as exc:
    logger.warning(...)
    if next_succeeded:          # was: if next_called
        return next_result
    if next_called:             # downstream failed + middleware re-raised → propagate
        raise
    return call_at(index + 1, payload)

Worth a regression test where the downstream raises and the middleware catches-then-reraises (asserts the error propagates, not None).

🟡 Optional (non-blocking)

  • next_call isn't single-use — a middleware that deliberately calls next_call() twice runs the terminal twice. Doesn't violate the "no automatic retry" claim (that holds), but if the intended contract is "terminal executes at most once," it needs a single-use guard.
  • deepcopy of the LLM request (lines 75-76, 90) — unsafe if api_kwargs ever carries non-deepcopyable/stateful objects (clients, callbacks, file handles). Guarded by the caller try/except against hard failures, but a silent client-clone would slip through. Tool path is safe (plain JSON args). A deepcopy-safety comment, or shallow-copy of the known-mutated keys, would harden it.

Doesn't strictly block merge today — the High trigger needs a catch-and-reraise consumer, and the only in-repo consumer (NeMo-Relay) calls its own captured next_call rather than the chain wrapper. But it's a real latent hole in a brand-new control-plane contract and the fix is ~4 lines + a test, so I'd rather get it right now than paper over it later. Everything else looks good.

  Track successful next_call completion separately from invocation so execution
  middleware that catches and translates a downstream provider/tool failure does
  not accidentally convert that failure into a successful None result.

  Also avoid wrapping BaseException from downstream execution, and document the
  execution middleware error semantics.

  Tests cover:
  - pre-next_call middleware failures fail open to the remaining chain
  - post-next_call middleware failures preserve the downstream result
  - translated downstream failures propagate instead of returning None
  - downstream BaseException is not wrapped

Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
@bbednarski9

Copy link
Copy Markdown
Contributor Author

@kshitijk4poor Thanks for the follow-up. I addressed your comments and added some additional test cases and documentation.

What changed:

  • _run_execution_chain now tracks next_succeeded separately from next_called.
  • next_succeeded is set only after downstream execution returns successfully.
  • If middleware calls next_call(), downstream fails, and the middleware catches/translates that failure, Hermes now propagates the translated error instead of returning None.
  • If middleware calls next_call() successfully and later raises during post-processing, Hermes still preserves the downstream result and does not re-run the provider/tool.
  • next_call() now wraps ordinary Exception downstream failures, not BaseException, so interrupts/cancellation-style base exceptions are not accidentally converted into normal middleware errors.

I added regression coverage for the edge cases around this contract:

  • test_execution_middleware_translated_downstream_failure_is_not_masked
  • test_execution_middleware_pre_next_call_error_fails_open_to_remaining_chain
  • test_execution_middleware_downstream_base_exception_is_not_wrapped

I also updated docs/middleware/README.md to spell out the execution middleware error semantics for plugin authors:

  • pre-next_call middleware failures fail open to the remaining chain/base execution;
  • post-success middleware failures preserve the downstream result and do not re-run execution;
  • downstream provider/tool failures can propagate or be deliberately translated, but are not converted into a successful None result.

Validation:

99 passed in 3.60s

Targeted command:

.venv/bin/python -m pytest tests/hermes_cli/test_plugins.py tests/plugins/test_nemo_relay_plugin.py -q

…opies

Address the two non-blocking follow-ups from review:

- next_call is now single-use per middleware frame. A second invocation
  raises instead of silently re-running the downstream provider/tool, so
  the terminal call cannot execute twice via the chain. The error surfaces
  through the existing handler, which preserves the first downstream result.
- Request-middleware payload copies go through _safe_copy(), which falls
  back to a shallow dict copy when deepcopy() fails on a non-deepcopyable
  member (clients, callbacks, file handles) instead of aborting the pass.

Adds regression coverage for both: double next_call() keeps the terminal
single-run, and a non-deepcopyable (threading.Lock) request payload still
runs middleware via the shallow fallback.

@kshitijk4poor kshitijk4poor left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Picked up the two non-blocking follow-ups myself and pushed them onto the branch (c4c5548, fast-forward on top of 5abe456):

  • next_call is now single-use per middleware frame — a second invocation raises instead of silently re-running the downstream provider/tool, so the terminal call can't execute twice via the chain. The error surfaces through the existing handler, which preserves the first (successful) downstream result.
  • Request-middleware copies go through _safe_copy() — falls back to a shallow dict copy when deepcopy() fails on a non-deepcopyable member (clients, callbacks, file handles) instead of aborting the request-middleware pass. Tool args stay plain JSON so they're unaffected; this just hardens the LLM-request path.

Added regression coverage for both:

  • test_execution_middleware_double_next_call_does_not_run_terminal_twice — double next_call() keeps the terminal single-run and returns the first result.
  • test_request_middleware_tolerates_non_deepcopyable_payload — a threading.Lock in the payload (which deepcopy can't pickle) still runs middleware via the shallow fallback, with the lock shared by reference.

Verified locally — full targeted suite green:

496 passed
(scripts/run_tests.sh tests/hermes_cli/test_plugins.py tests/test_model_tools.py tests/run_agent/test_run_agent.py tests/plugins/test_nemo_relay_plugin.py)

The earlier High (translated-downstream-failure masked as None) is fixed and verified across all 7 execution-chain paths, the new error semantics are documented, and the two optional items are now closed too. Approving. 🚢

@kshitijk4poor kshitijk4poor enabled auto-merge June 6, 2026 17:39
@bbednarski9

Copy link
Copy Markdown
Contributor Author

Thanks @kshitijk4poor

@kshitijk4poor kshitijk4poor merged commit d4a7bfd into NousResearch:main Jun 6, 2026
30 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have telemetry Touches outbound telemetry, usage attribution, or analytics — needs opt-in gating before merge type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants