perf(observability): observer telemetry hooks + NeMo-Relay plugin, gated tool emit (salvage #38190/#29722)#38232
Merged
Merged
Conversation
Adds backend-neutral observer hooks for plugins: session, turn, API request, tool, approval, and subagent lifecycle events with stable correlation IDs (session_id, task_id, turn_id, api_request_id, tool_call_id, parent/child subagent ids). Extends VALID_HOOKS with api_request_error and subagent_start. Hot path is zero-cost when no plugin subscribes: has_hook()/presence checks gate all payload construction, request payloads are returned by reference when no middleware rewrites, and the sanitized response payload no longer embeds raw response objects. Bundles the optional NeMo-Relay observability plugin (plugins/observability/nemo_relay) as an in-repo consumer of the new hooks, peer to the existing langfuse plugin. Fails open when the optional nemo-relay package is not installed. Authored-by: Bryan Bednarski <bbednarski@nvidia.com> Salvaged from #29722 onto current main.
The salvaged PR incidentally stripped a trailing blank line from two unrelated test files (test_file_tools_cwd_resolution.py, test_tool_search.py). Restore them to keep the salvage diff scoped to the observability feature.
…ootprint
The salvaged observer contract gated the API-request hot path on has_hook()
but left the per-tool emit ungated: every tool call ran result-field
derivation + payload dict build + invoke_hook dispatch even with zero
plugins registered.
- _emit_post_tool_call_hook now short-circuits on has_hook("post_tool_call")
and derives status/error fields lazily (after the gate, only when a
listener will consume them). status defaults to None -> derived; explicit
blocked/cancelled callers still pass status through.
- transform_tool_result emit (pre-existing hook) likewise gated on
has_hook(); skips _tool_result_observer_fields when no listener.
- Removed the now-redundant _tool_result_observer_fields pre-computation at
the three ok-path call sites (model_tools, agent_runtime_helpers,
tool_executor) — the helper derives them, so the no-listener path costs
one dict lookup and the call sites shrink.
- Tests: stub has_hook=True where payload correctness is asserted; add a
no-listener regression proving post_tool_call/transform_tool_result emit
is skipped when nothing is registered.
Contributor
🔎 Lint report:
|
| Rule | Count |
|---|---|
unresolved-attribute |
8 |
unresolved-reference |
1 |
invalid-assignment |
1 |
First entries
run_agent.py:2054: [unresolved-attribute] unresolved-attribute: Object of type `Self@_invoke_api_request_error_hook` has no attribute `api_mode`
run_agent.py:2051: [unresolved-attribute] unresolved-attribute: Object of type `Self@_invoke_api_request_error_hook` has no attribute `model`
run_agent.py:2049: [unresolved-attribute] unresolved-attribute: Object of type `Self@_invoke_api_request_error_hook` has no attribute `session_id`
tests/run_agent/test_run_agent.py:2854: [unresolved-attribute] unresolved-attribute: Attribute `test` is not defined on `stmt` in union `stmt | (Unknown & ~None)`
run_agent.py:1927: [unresolved-reference] unresolved-reference: Name `SimpleNamespace` used when not defined
run_agent.py:2052: [unresolved-attribute] unresolved-attribute: Object of type `Self@_invoke_api_request_error_hook` has no attribute `provider`
model_tools.py:1066: [invalid-assignment] invalid-assignment: Object of type `None` is not assignable to `def reset_current_observability_context(tokens: tuple[Token[str], Token[str]]) -> None`
run_agent.py:2050: [unresolved-attribute] unresolved-attribute: Object of type `Self@_invoke_api_request_error_hook` has no attribute `platform`
tests/run_agent/test_run_agent.py:2858: [unresolved-attribute] unresolved-attribute: Attribute `orelse` is not defined on `stmt` in union `stmt | (Unknown & ~None)`
tools/delegate_tool.py:1149: [unresolved-attribute] unresolved-attribute: Unresolved attribute `_parent_turn_id` on type `AIAgent`
✅ Fixed issues: none
Unchanged: 5033 pre-existing issues carried over.
Diagnostics are surfaced as warnings — this check never fails the build.
CI slice 3 caught that tests/test_transform_tool_result_hook.py monkeypatches
invoke_hook but not has_hook, so the new has_hook("transform_tool_result")
gate skipped the emit and the transform never ran. Stub has_hook=True in the
shared _run_handle_function_call helper whenever a custom invoke_hook is
supplied (the test intends hooks to fire). The no-hook-registered test keeps
the real has_hook=False path — that's the gate's intended behavior.
7 tasks
19 tasks
1 task
8 tasks
19 tasks
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Re-salvages the observer-grade telemetry contract from #29722 (PR #38190) onto current
main, and closes the per-tool overhead gap that the prior salvage left open: the no-listener tool path is now as cheap as the API path.The contract is an in-process observer surface (session/turn/API/tool/approval/subagent lifecycle + correlation IDs). The bundled
nemo_relayconsumer is opt-in and writes local ATOF/ATIF trace files — no outbound network emission. Fails open when the optionalnemo-relaypackage is absent.Changes
model_tools.py—_emit_post_tool_call_hookshort-circuits onhas_hook("post_tool_call")and derivesstatus/error_type/error_messagelazily after the gate (only when a listener will consume them).transform_tool_resultemit (pre-existing hook) likewise gated onhas_hook(). Removed the redundant_tool_result_observer_fieldspre-compute at the ok-path call site.agent/agent_runtime_helpers.py,agent/tool_executor.py— dropped the duplicated field pre-computation at the ok-path emit sites; the helper derives them now. Explicit blocked/cancelled callers still passstatusthrough unchanged.has_hook=Truewhere payload correctness is asserted; added a no-listener regression proving thepost_tool_call/transform_tool_resultemit is skipped entirely when nothing is registered.Net effect: with zero plugins registered, the per-tool hot path costs one dict lookup — no result parse, no payload dict build, no dispatch.
Validation
post_tool_callemittransform_tool_resultemittest_model_tools.py,test_nemo_relay_plugin.py,test_plugins.py, session-boundary hooks (cli + gateway),tests/run_agent/test_run_agent.py(367),test_project_metadata.py. 0 failures.HERMES_HOME): (1)has_hookFalse → emit short-circuits; (2) registered listener → correctstatus/error_type/error_message/duration_ms; (3)nemo_relay.register()fails open without thenemo-relaypackage.Attribution
Cherry-picked from #29722 (feature commit by @bbednarski9, authorship preserved) and the trailing-newline restore by @kshitijk4poor (PR #38190). Slimming/gating follow-up commit is ours.
Closes #29722
Closes #38190
Infographic