fix(observability): flush plugin-config OpenInference when the final session closes#41551
Conversation
Clear NeMo Relay plugin-config observability only after the last active Hermes session finalizes. Use the plugin's async-safe awaitable helper for both initialize and clear so session rotation remains safe under active event loops. Disable the direct ATIF fallback when plugins.toml already owns the ATIF exporter lifecycle to avoid duplicate trajectory export on finalization.
… succeeds Signed-off-by: mnajafian-nv <mnajafian@nvidia.com>
kshitijk4poor
left a comment
There was a problem hiding this comment.
Reviewed end-to-end: checked out at head ecd4679d8, ran the suite, and verified every control-flow claim with standalone repros (module path confirmed against the PR worktree, not the installed copy).
Tests: 24/24 in tests/plugins/test_nemo_relay_plugin.py, 1086/1086 across tests/plugins/, 0 regressions, ruff clean. Scope is clean — only plugins/observability/nemo_relay/ + tests, no core files, no hardcoded secrets.
The fix does what it claims, all five behaviors verified: clear-only-after-last-finalize (pop-then-check not self.sessions; subagent sessions share the dict so it can't clear early), reinit on next start, multi-session keep-alive, ATIF ownership suppression, and ATOF fallback held until managed init succeeds then deregistered on reinit. Nice catch that _resolve_awaitable also fixes the init side — the old except RuntimeError: return False silently skipped plugins.toml init inside a running loop, likely part of the original never-flushed bug. I drove the (untested) clear-inside-running-loop branch: clear fires once, zero "coroutine was never awaited" warnings, clean reinit.
No Critical findings with a production trigger. Two latent warnings + a couple of suggestions inline. Net: correct fix targeting the real root cause — happy to see it merge with a tight follow-up for the clear-failure flag (or a quick amend here).
| return | ||
| _resolve_awaitable(clear()) | ||
| self._plugin_config_initialized = False | ||
| self._plugin_config_needs_reinit = bool(self.settings.plugins_config) |
There was a problem hiding this comment.
Warning (new in this PR): clear() failure strands the lifecycle state. The flags below only flip after this await returns, and close_session catches+logs if clear() raises. So on a clear() failure the runtime stays _plugin_config_initialized=True, _plugin_config_needs_reinit=False — the next session start then neither reinitializes nor activates the direct fallback, permanently stranded (no crash). Repro confirmed plugin.initialize count stays 1 across the next start. Minimal fix: set self._plugin_config_needs_reinit = bool(self.settings.plugins_config) in the failure path (or before the await) so a failed clear still re-arms reinit/fallback.
There was a problem hiding this comment.
Thanks, good catch. Fixed in 728612c29.
_clear_plugins_toml() now uses try/finally, so a plugin.clear() failure still clears Hermes-side managed state and re-arms reinitialization for the next session.
I also added a focused regression covering successful init -> failed final-session clear -> next-session recovery.
| def ensure_session(self, kwargs: dict[str, Any]) -> _SessionState: | ||
| self._maybe_reinitialize_plugins_toml() | ||
| session_id = _session_id(kwargs) | ||
| state = self.sessions.get(session_id) |
There was a problem hiding this comment.
Warning (latent, no current trigger): ensure_session has no post-finalize guard. If any ensure_session-calling hook fires after the terminal close_session, this re-initializes plugins.toml and the body below re-creates the session + pushes a scope that's never popped (repro: scope.push=2 vs scope.pop=1, initialized left True). Today nothing triggers it — the plugin on_session_end hook isn't invoked anywhere in cli.py/gateway/, and per-turn hooks fire during run_conversation (before finalize). Flagging only so a future hook addition doesn't silently re-arm an exporter for an ended session; a finalized-session-id guard (or skip auto-reinit/auto-create for an already-finalized id) would close it.
There was a problem hiding this comment.
Thanks for flagging this. I dug through the current CLI and gateway hook ordering and I agree this is a latent future-hardening concern, but not one I can trigger on the current finalized paths.
I kept it out of this PR to preserve scope around the concrete clear-failure lifecycle bug, but a finalized-session guard would be the right follow-up if Hermes adds a post-finalize ensure_session() path.
| logger.debug("NeMo Relay ATIF deregister failed", exc_info=True) | ||
| if self._plugin_config_initialized and not self.sessions: | ||
| try: | ||
| self._clear_plugins_toml() |
There was a problem hiding this comment.
Suggestion: this new clear-decision makes a pre-existing race observable. self.sessions and the lifecycle flags are mutated without a lock (only the module _LOCK guards the _Runtime singleton — pre-existing). This PR adds the not self.sessions check, so under concurrent subagent load a starting session could observe the dict empty here and clear a global exporter mid-session. Not a regression, but if this runs under heavy concurrency consider wrapping ensure_session/close_session/the flag transitions in _LOCK.
There was a problem hiding this comment.
Agreed this makes the pre-existing shared-state race easier to reason about. I left locking out of this PR to keep the fix scoped to the concrete clear-failure bug, but I agree ensure_session() / close_session() plus the lifecycle flags are the right place to tighten if we take a concurrency hardening follow-up.
| if not self.settings.atof_enabled: | ||
| if not self.settings.atof_enabled or self.atof_exporter is not None: | ||
| return | ||
| config = self.nemo_relay.AtofExporterConfig() |
There was a problem hiding this comment.
Suggestion (nit): direct-ATOF suppression relies on this only being reachable when plugins.toml isn't initialized (call-site discipline) rather than an explicit not self._plugins_toml_owns_exporter("atof") guard. Correct today; an explicit owns-check here would make it self-defending against future refactors.
There was a problem hiding this comment.
Good call. I left this one out to keep the PR focused on the concrete clear-failure bug, since the current path only reaches _configure_atof() when managed plugins.toml does not own ATOF.
If we do a small hardening follow-up here, adding the explicit owns-check would make sense.
Ensure failed plugin-config clear operations still re-arm managed reinitialization on the next Hermes session. Add focused regression coverage for successful init, failed final-session clear, and next-session recovery. Signed-off-by: mnajafian-nv <mnajafian@nvidia.com>
Unblocks #41551 (and any future mnajafian-nv contributions) from the contributor-attribution check. Maps mnajafian@nvidia.com -> mnajafian-nv.
Unblocks NousResearch#41551 (and any future mnajafian-nv contributions) from the contributor-attribution check. Maps mnajafian@nvidia.com -> mnajafian-nv.
Unblocks NousResearch#41551 (and any future mnajafian-nv contributions) from the contributor-attribution check. Maps mnajafian@nvidia.com -> mnajafian-nv.
Unblocks #41551 (and any future mnajafian-nv contributions) from the contributor-attribution check. Maps mnajafian@nvidia.com -> mnajafian-nv.
…erence-finalization fix(observability): flush plugin-config OpenInference when the final session closes
What does this PR do?
Fixes the bundled
observability/nemo_relayplugin so canonical plugin-config OpenInference is flushed when the final Hermes session closes.Before this change, Hermes initialized NeMo Relay plugin-config observability but did not reach the matching cleanup boundary when the final session finalized. That left plugin-config OpenInference buffered instead of flushed.
This update also fixes exporter ownership across failure and recovery paths:
ATOFandATIFfallbacks stay active untilplugins.tomlinitialization succeedsATOFfallback before continuingThis is a follow-up fix for the bundled NeMo Relay observability plugin introduced in
#38232.Related Issue
No GitHub issue filed. Follow-up to
#38232.Type of Change
Changes Made
plugins/observability/nemo_relay/__init__.pyplugins.tomlpresence aloneATOFfallback if managed init later takes ownershiptests/plugins/test_nemo_relay_plugin.pyplugins/observability/nemo_relay/README.mdplugins.tomlinitialization failsHow to Test
Run the targeted plugin regression suite with the repo-standard wrapper:
Verify the changed files are the only files in the PR diff:
Checklist
Code
fix(scope):,feat(scope):, etc.)pytest tests/ -qand all tests passDocumentation & Housekeeping
docs/, docstrings) — updatedplugins/observability/nemo_relay/README.mdcli-config.yaml.exampleif I added/changed config keys — N/A, no config keys changedCONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — N/A, no architecture or workflow docs changedScreenshots / Logs