Skip to content

fix(observability): flush plugin-config OpenInference when the final session closes#41551

Merged
kshitijk4poor merged 4 commits into
NousResearch:mainfrom
mnajafian-nv:fix/hermes-plugin-openinference-finalization
Jun 8, 2026
Merged

fix(observability): flush plugin-config OpenInference when the final session closes#41551
kshitijk4poor merged 4 commits into
NousResearch:mainfrom
mnajafian-nv:fix/hermes-plugin-openinference-finalization

Conversation

@mnajafian-nv

@mnajafian-nv mnajafian-nv commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes the bundled observability/nemo_relay plugin so canonical plugin-config OpenInference is flushed when the final Hermes session closes.

Before this change, Hermes initialized NeMo Relay plugin-config observability but did not reach the matching cleanup boundary when the final session finalized. That left plugin-config OpenInference buffered instead of flushed.

This update also fixes exporter ownership across failure and recovery paths:

  • plugin-config observability is cleared only after the last active Hermes session finalizes
  • plugin init and clear use the existing async-safe awaitable helper
  • direct ATOF and ATIF fallbacks stay active until plugins.toml initialization succeeds
  • if managed init later succeeds after a fallback-only run, Hermes clears the direct ATOF fallback before continuing

This is a follow-up fix for the bundled NeMo Relay observability plugin introduced in #38232.

Related Issue

No GitHub issue filed. Follow-up to #38232.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Security fix
  • Documentation update
  • Tests (adding or improving test coverage)
  • Refactor (no behavior change)
  • New skill (bundled or hub)

Changes Made

  • plugins/observability/nemo_relay/__init__.py
    • clear plugin-config observability only after the last active Hermes session finalizes
    • reinitialize plugin-config observability on the next session start after final cleanup
    • route plugin init and clear through the existing async-safe awaitable helper
    • decide direct fallback ownership from successful plugin-config init instead of plugins.toml presence alone
    • clear direct ATOF fallback if managed init later takes ownership
  • tests/plugins/test_nemo_relay_plugin.py
    • add lifecycle and async-loop regression coverage
    • add failure-path regressions for managed init failure and retry-after-finalize recovery
  • plugins/observability/nemo_relay/README.md
    • document that direct env-var fallbacks stay active if plugins.toml initialization fails

How to Test

  1. Run the targeted plugin regression suite with the repo-standard wrapper:

    bash scripts/run_tests.sh tests/plugins/test_nemo_relay_plugin.py -- -q
  2. Verify the changed files are the only files in the PR diff:

    git diff --stat upstream/main...HEAD

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — updated plugins/observability/nemo_relay/README.md
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A, no config keys changed
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — N/A, no architecture or workflow docs changed
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — no new platform-specific behavior added
  • I've updated tool descriptions/schemas if I changed tool behavior — N/A, no tool schemas changed

Screenshots / Logs

$ bash scripts/run_tests.sh tests/plugins/test_nemo_relay_plugin.py -- -q
▶ running per-file parallel test suite via run_tests_parallel.py
[  0.0% |     0/0 | ✓24 | ✗ 0] ✓ tests/plugins/test_nemo_relay_plugin.py (24✓, 2.0s)

Clear NeMo Relay plugin-config observability only after the last active Hermes session finalizes.

Use the plugin's async-safe awaitable helper for both initialize and clear so session rotation remains safe under active event loops.

Disable the direct ATIF fallback when plugins.toml already owns the ATIF exporter lifecycle to avoid duplicate trajectory export on finalization.
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins labels Jun 7, 2026
… succeeds

Signed-off-by: mnajafian-nv <mnajafian@nvidia.com>

@kshitijk4poor kshitijk4poor left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed end-to-end: checked out at head ecd4679d8, ran the suite, and verified every control-flow claim with standalone repros (module path confirmed against the PR worktree, not the installed copy).

Tests: 24/24 in tests/plugins/test_nemo_relay_plugin.py, 1086/1086 across tests/plugins/, 0 regressions, ruff clean. Scope is clean — only plugins/observability/nemo_relay/ + tests, no core files, no hardcoded secrets.

The fix does what it claims, all five behaviors verified: clear-only-after-last-finalize (pop-then-check not self.sessions; subagent sessions share the dict so it can't clear early), reinit on next start, multi-session keep-alive, ATIF ownership suppression, and ATOF fallback held until managed init succeeds then deregistered on reinit. Nice catch that _resolve_awaitable also fixes the init side — the old except RuntimeError: return False silently skipped plugins.toml init inside a running loop, likely part of the original never-flushed bug. I drove the (untested) clear-inside-running-loop branch: clear fires once, zero "coroutine was never awaited" warnings, clean reinit.

No Critical findings with a production trigger. Two latent warnings + a couple of suggestions inline. Net: correct fix targeting the real root cause — happy to see it merge with a tight follow-up for the clear-failure flag (or a quick amend here).

return
_resolve_awaitable(clear())
self._plugin_config_initialized = False
self._plugin_config_needs_reinit = bool(self.settings.plugins_config)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning (new in this PR): clear() failure strands the lifecycle state. The flags below only flip after this await returns, and close_session catches+logs if clear() raises. So on a clear() failure the runtime stays _plugin_config_initialized=True, _plugin_config_needs_reinit=False — the next session start then neither reinitializes nor activates the direct fallback, permanently stranded (no crash). Repro confirmed plugin.initialize count stays 1 across the next start. Minimal fix: set self._plugin_config_needs_reinit = bool(self.settings.plugins_config) in the failure path (or before the await) so a failed clear still re-arms reinit/fallback.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good catch. Fixed in 728612c29.

_clear_plugins_toml() now uses try/finally, so a plugin.clear() failure still clears Hermes-side managed state and re-arms reinitialization for the next session.

I also added a focused regression covering successful init -> failed final-session clear -> next-session recovery.

def ensure_session(self, kwargs: dict[str, Any]) -> _SessionState:
self._maybe_reinitialize_plugins_toml()
session_id = _session_id(kwargs)
state = self.sessions.get(session_id)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning (latent, no current trigger): ensure_session has no post-finalize guard. If any ensure_session-calling hook fires after the terminal close_session, this re-initializes plugins.toml and the body below re-creates the session + pushes a scope that's never popped (repro: scope.push=2 vs scope.pop=1, initialized left True). Today nothing triggers it — the plugin on_session_end hook isn't invoked anywhere in cli.py/gateway/, and per-turn hooks fire during run_conversation (before finalize). Flagging only so a future hook addition doesn't silently re-arm an exporter for an ended session; a finalized-session-id guard (or skip auto-reinit/auto-create for an already-finalized id) would close it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for flagging this. I dug through the current CLI and gateway hook ordering and I agree this is a latent future-hardening concern, but not one I can trigger on the current finalized paths.

I kept it out of this PR to preserve scope around the concrete clear-failure lifecycle bug, but a finalized-session guard would be the right follow-up if Hermes adds a post-finalize ensure_session() path.

logger.debug("NeMo Relay ATIF deregister failed", exc_info=True)
if self._plugin_config_initialized and not self.sessions:
try:
self._clear_plugins_toml()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: this new clear-decision makes a pre-existing race observable. self.sessions and the lifecycle flags are mutated without a lock (only the module _LOCK guards the _Runtime singleton — pre-existing). This PR adds the not self.sessions check, so under concurrent subagent load a starting session could observe the dict empty here and clear a global exporter mid-session. Not a regression, but if this runs under heavy concurrency consider wrapping ensure_session/close_session/the flag transitions in _LOCK.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this makes the pre-existing shared-state race easier to reason about. I left locking out of this PR to keep the fix scoped to the concrete clear-failure bug, but I agree ensure_session() / close_session() plus the lifecycle flags are the right place to tighten if we take a concurrency hardening follow-up.

if not self.settings.atof_enabled:
if not self.settings.atof_enabled or self.atof_exporter is not None:
return
config = self.nemo_relay.AtofExporterConfig()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion (nit): direct-ATOF suppression relies on this only being reachable when plugins.toml isn't initialized (call-site discipline) rather than an explicit not self._plugins_toml_owns_exporter("atof") guard. Correct today; an explicit owns-check here would make it self-defending against future refactors.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. I left this one out to keep the PR focused on the concrete clear-failure bug, since the current path only reaches _configure_atof() when managed plugins.toml does not own ATOF.

If we do a small hardening follow-up here, adding the explicit owns-check would make sense.

Ensure failed plugin-config clear operations still re-arm managed reinitialization on the next Hermes session.

Add focused regression coverage for successful init, failed final-session clear, and next-session recovery.

Signed-off-by: mnajafian-nv <mnajafian@nvidia.com>
kshitijk4poor added a commit that referenced this pull request Jun 8, 2026
Unblocks #41551 (and any future mnajafian-nv contributions) from the
contributor-attribution check. Maps mnajafian@nvidia.com -> mnajafian-nv.
@kshitijk4poor kshitijk4poor merged commit 3f1758d into NousResearch:main Jun 8, 2026
22 of 23 checks passed
@mnajafian-nv mnajafian-nv deleted the fix/hermes-plugin-openinference-finalization branch June 8, 2026 21:56
wachoo pushed a commit to wachoo/hermes-agent that referenced this pull request Jun 10, 2026
Unblocks NousResearch#41551 (and any future mnajafian-nv contributions) from the
contributor-attribution check. Maps mnajafian@nvidia.com -> mnajafian-nv.
changman pushed a commit to changman/hermes-agent that referenced this pull request Jun 10, 2026
Unblocks NousResearch#41551 (and any future mnajafian-nv contributions) from the
contributor-attribution check. Maps mnajafian@nvidia.com -> mnajafian-nv.
alt-glitch pushed a commit that referenced this pull request Jun 14, 2026
Unblocks #41551 (and any future mnajafian-nv contributions) from the
contributor-attribution check. Maps mnajafian@nvidia.com -> mnajafian-nv.
alt-glitch pushed a commit that referenced this pull request Jun 14, 2026
…erence-finalization

fix(observability): flush plugin-config OpenInference when the final session closes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants