feat(middleware): add adaptive middleware to hermes-agent, consumed by NeMo-Relay#29724
Conversation
4f5be2f to
d93e059
Compare
520bd9e to
c595a54
Compare
|
Blocked on #29722 review — left detailed notes on the observer/middleware split there: #29722 (comment). This PR is unblocked once the observer half of #29722 lands. |
c595a54 to
d9ac183
Compare
b24a720 to
0367582
Compare
kshitijk4poor
left a comment
There was a problem hiding this comment.
Reviewed the middleware half — the llm_execution / tool_execution wiring with NeMo-Relay as a real consumer is exactly what we wanted to see, and it's what closes the "no in-repo consumer" gap from the #29722 round. _DownstreamExecutionError is a good touch too — it correctly keeps a failing tool/LLM call from being retried by the chain.
A few things to sort before this lands, all on the hot path or the shared contract.
Critical
1. Middleware isn't gated on a registered consumer — re-introduces the per-call cost we just removed.
apply_llm_request_middleware runs deepcopy(request) twice on every API call before it ever checks whether any middleware is registered (hermes_cli/middleware.py:67-68), and it's called unconditionally at agent/conversation_loop.py:1231. That's the same per-call cost we gated out of the observability PR — and the pre_api_request hook ~25 lines below it (conversation_loop.py:1256) is already gated with has_hook(), with a comment about not wanting to walk every tool result + base64 image on every call. There's no has_middleware() yet, so all four apply_* / run_* entry points pay this unconditionally. Same on the tool path: apply_tool_request_middleware at model_tools.py:976 and agent/tool_executor.py:197 deepcopies tool args on every tool call.
Fix: add has_middleware(kind) mirroring has_hook(), gate the apply_* / run_* entry points, and return a by-reference no-op result when nothing's registered (same shape as the hook gating in the observability PR).
2. changed = current_request != original_request (hermes_cli/middleware.py:88 and :125) is a full structural deep-equality compare of two deepcopies of the same source — guaranteed False whenever no middleware rewrote the payload. Use changed = bool(trace).
Warning
3. Double-execution if a middleware raises after calling next_call.
_run_execution_chain (hermes_cli/middleware.py:195-236) handles a downstream failure correctly via _DownstreamExecutionError — verified, and test_execution_middleware_does_not_retry_downstream_failure covers it. But if a middleware calls next_call() successfully and then raises in its own post-processing, that exception falls into the generic except Exception at :227, which retries from call_at(index + 1) and re-runs the terminal call — so the tool or LLM call fires twice.
The bundled NeMo-Relay consumer is well-behaved so it won't hit this, but since this is the public contract for arbitrary third-party middleware it's worth closing now. A flag tracking whether next_call already fired in the frame (don't re-run the terminal if so) handles it, plus a test for the post-next_call-raise case alongside the existing downstream one.
Structural (not a code issue, but affects merge)
This PR currently carries the full observability diff layered under the middleware — it touches the same hot-path files as #29722 (conversation_loop.py, model_tools.py, agent_runtime_helpers.py). Once the observability half lands, this will conflict. Recommend rebasing down to just the middleware delta on top of the landed observability — it'll also make the actual change much easier to review (right now the real delta is buried under changes we've already reviewed).
Looks good
llm_execution/tool_executionwiring with a real consumer — validates the contract shape against an actual use case._DownstreamExecutionErrorcorrectly distinguishes downstream failures from middleware failures for the pre-next_callcase.- Back-compat aliases (
api_request→llm_request) are clean. - All targeted tests pass.
None of this is hard — items 1 and 2 mostly mirror the gating pattern already in the observability PR. Happy to re-review quickly once these are in.
Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
ec1285f to
2e0c908
Compare
Responses to individual comments1. Middleware isn't gated on a registered consumerThanks for pointing this out - I had forgotten to propagate these changes into this PR. Super helpful. Resolved and measured on the restacked branch. I added
The execution middleware helpers also bypass chain setup and call the downstream callback directly when no callbacks are registered. I also benchmarked the no-consumer path with chat-shaped payloads. I used many message/tool-result dicts rather than one giant string, because
Interpretation: the default install/no-consumer path is effectively just the registry presence check and is flat with payload size. If request middleware is actually registered, that remains an explicit opt-in behavior-changing path and still pays the defensive copy/traversal cost. I added regression coverage for the no-listener by-reference behavior so this does not regress. 2. Deep equality compareI changed request middleware result tracking from structural deep equality to trace-based tracking: That applies to both LLM request middleware and tool request middleware. This avoids walking the full request/args payload a third time after the defensive copies, and it also reflects the actual middleware contract: a request is considered changed when a middleware callback returned a valid replacement payload and produced a trace entry. I added regression coverage for the edge case where middleware returns an equal/same-shaped payload. That now reports changed=True because middleware did run and supplied a replacement, without relying on expensive structural comparison. 3. Double-execution if a middleware raises after calling
|
|
Reviewed the restacked commit ✅ Confirmed resolved
🔴 High — downstream failure silently masked as
|
Track successful next_call completion separately from invocation so execution middleware that catches and translates a downstream provider/tool failure does not accidentally convert that failure into a successful None result. Also avoid wrapping BaseException from downstream execution, and document the execution middleware error semantics. Tests cover: - pre-next_call middleware failures fail open to the remaining chain - post-next_call middleware failures preserve the downstream result - translated downstream failures propagate instead of returning None - downstream BaseException is not wrapped Signed-off-by: Bryan Bednarski <bbednarski@nvidia.com>
|
@kshitijk4poor Thanks for the follow-up. I addressed your comments and added some additional test cases and documentation. What changed:
I added regression coverage for the edge cases around this contract:
I also updated
Validation: Targeted command: .venv/bin/python -m pytest tests/hermes_cli/test_plugins.py tests/plugins/test_nemo_relay_plugin.py -q |
…opies Address the two non-blocking follow-ups from review: - next_call is now single-use per middleware frame. A second invocation raises instead of silently re-running the downstream provider/tool, so the terminal call cannot execute twice via the chain. The error surfaces through the existing handler, which preserves the first downstream result. - Request-middleware payload copies go through _safe_copy(), which falls back to a shallow dict copy when deepcopy() fails on a non-deepcopyable member (clients, callbacks, file handles) instead of aborting the pass. Adds regression coverage for both: double next_call() keeps the terminal single-run, and a non-deepcopyable (threading.Lock) request payload still runs middleware via the shallow fallback.
kshitijk4poor
left a comment
There was a problem hiding this comment.
Picked up the two non-blocking follow-ups myself and pushed them onto the branch (c4c5548, fast-forward on top of 5abe456):
next_callis now single-use per middleware frame — a second invocation raises instead of silently re-running the downstream provider/tool, so the terminal call can't execute twice via the chain. The error surfaces through the existing handler, which preserves the first (successful) downstream result.- Request-middleware copies go through
_safe_copy()— falls back to a shallow dict copy whendeepcopy()fails on a non-deepcopyable member (clients, callbacks, file handles) instead of aborting the request-middleware pass. Tool args stay plain JSON so they're unaffected; this just hardens the LLM-request path.
Added regression coverage for both:
test_execution_middleware_double_next_call_does_not_run_terminal_twice— doublenext_call()keeps the terminal single-run and returns the first result.test_request_middleware_tolerates_non_deepcopyable_payload— athreading.Lockin the payload (whichdeepcopycan't pickle) still runs middleware via the shallow fallback, with the lock shared by reference.
Verified locally — full targeted suite green:
496 passed
(scripts/run_tests.sh tests/hermes_cli/test_plugins.py tests/test_model_tools.py tests/run_agent/test_run_agent.py tests/plugins/test_nemo_relay_plugin.py)
The earlier High (translated-downstream-failure masked as None) is fixed and verified across all 7 execution-chain paths, the new error semantics are documented, and the two optional items are now closed too. Approving. 🚢
|
Thanks @kshitijk4poor |
What does this PR do?
This PR adds the middleware half of the Hermes Agent execution-intercept work, stacked on top of the now-merged observability contract from #38232.
#38232 established low-overhead telemetry hooks for watching what Hermes does. This PR adds opt-in middleware intercepts for changing or owning execution:
llm_requestandtool_requestmiddleware can inspect and rewrite payloads before execution.llm_executionandtool_executionmiddleware can wrap or replace the actual downstream LLM/tool call.has_middleware(kind)and returns by reference, so default uninstrumented Hermes runs do not pay request-copy or execution-chain overhead.The intent is to keep the two layers distinct:
Related Issue
main.Type of Change
Changes Made
Middleware Registry And Contract
hermes_cli/plugins.py.has_middleware(kind)so hot-path call sites can cheaply skip middleware work when no consumer is registered.hermes_cli/middleware.py:apply_llm_request_middlewareapply_tool_request_middlewarerun_llm_execution_middlewarerun_tool_execution_middlewareRequest Middleware
llm_requestmiddleware can rewrite provider request payloads before the API call.tool_requestmiddleware can rewrite tool arguments before tool execution.deepcopy.changedflag is based onbool(trace)instead of structural deep-equality comparisons across copied payloads.Execution Middleware
llm_executionmiddleware can wrap or replace the downstream LLM call.tool_executionmiddleware can wrap or replace the downstream tool call._DownstreamExecutionError, so a failing tool/LLM call is not retried by middleware chaining.next_call()no longer causes the terminal LLM/tool call to execute twice.Hermes Call Sites
agent/conversation_loop.py.model_tools.pyagent/tool_executor.pyagent/agent_runtime_helpers.pyNeMo-Relay Consumer
plugins/observability/nemo_relayfrom a telemetry-only observer into a real adaptive execution consumer.llm_execution.tool_execution.next_call(...).Documentation
docs/middleware/README.md.Tests
next_callmiddleware failure tests to verify terminal execution is not repeated.Suggested Review Order
hermes_cli/plugins.pyandhermes_cli/middleware.pyHermes hot-path call sites
agent/conversation_loop.pymodel_tools.pyagent/tool_executor.pyagent/agent_runtime_helpers.pyNeMo-Relay consumer
plugins/observability/nemo_relay/__init__.pyplugins/observability/nemo_relay/README.mdTests and docs
tests/hermes_cli/test_plugins.pytests/plugins/test_nemo_relay_plugin.pytests/test_model_tools.pytests/run_agent/test_run_agent.pydocs/middleware/README.mdHow to Test
Targeted test run used locally:
Result:
Useful manual checks:
Checklist
Code
changedstate avoids structural deep-equality comparisons.next_call()do not re-run the terminal LLM/tool call.Documentation & Housekeeping
Screenshots / Logs
No UI screenshots.
Hot-path middleware overhead was checked after the no-listener gates were added:
apply_llmapply_llmrun_llmapply_toolrun_toolInterpretation: