Skip to content

fix(tui-gateway): close sidecar sessions on disconnect#21401

Closed
qWaitCrypto wants to merge 1 commit into
NousResearch:mainfrom
qWaitCrypto:fix/tui-gateway-close-sidecar-sessions
Closed

fix(tui-gateway): close sidecar sessions on disconnect#21401
qWaitCrypto wants to merge 1 commit into
NousResearch:mainfrom
qWaitCrypto:fix/tui-gateway-close-sidecar-sessions

Conversation

@qWaitCrypto

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes the dashboard sidecar session lifecycle leak behind #21370.

The underlying issue is not that tui_gateway.slash_worker forgets to exit on its own. That subprocess is intentionally persistent for real TUI sessions so slash commands can reuse the same worker. The leak happens because dashboard sidecar sessions created over /api/ws can outlive their websocket connection: on disconnect, the server previously only detached the transport and kept the session record alive in tui_gateway.server._sessions, which also kept the session's slash_worker alive.

This PR adds an explicit close_on_disconnect session mode for short-lived sidecar sessions, closes those sessions server-side when their websocket disconnects, and leaves the historical reconnect behavior unchanged for normal TUI sessions.

Related Issue

Fixes #21370

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • Added close_on_disconnect parsing and storage in tui_gateway/server.py for session.create.
  • Added a shared _close_session_by_id() helper so session.close, websocket disconnect cleanup, and process shutdown all use the same teardown path for finalize hooks, agent cleanup, notify unregistration, and slash-worker shutdown.
  • Added _close_sessions_for_transport() so websocket disconnect cleanup only closes explicitly marked sidecar sessions and still falls back to stdio for normal sessions.
  • Updated tui_gateway/ws.py to use the shared transport/session cleanup helper.
  • Updated web/src/components/ChatSidebar.tsx to create its sidecar session with close_on_disconnect: true.
  • Added regression tests covering:
    • session.create storing the close-on-disconnect flag
    • websocket-disconnect cleanup closing marked sidecar sessions and their workers
    • websocket-disconnect cleanup preserving normal sessions

How to Test

  1. Run the targeted TUI gateway regression tests:
    python -m pytest tests/test_tui_gateway_server.py -q -k "close_on_disconnect or close_sessions_for_transport or session_close_commits_memory_and_fires_finalize_hook or session_create_close_race_does_not_orphan_worker or session_create_no_race_keeps_worker_alive"
  2. Verify the dashboard frontend still compiles:
    cd web && npm run build
  3. Optional manual repro for the original issue:
    start hermes dashboard, open the dashboard chat/sidebar, disconnect/close the websocket client, and confirm sidecar python -m tui_gateway.slash_worker processes do not accumulate for the disconnected sidecar session.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: Ubuntu 24.04 (WSL)

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

For New Skills

N/A

Screenshots / Logs

  • Targeted backend regression:
    6 passed in 20.47s
  • Frontend build:
    cd web && npm run build
    ✓ built in 51.47s
  • Note: I did not run a full dashboard/browser end-to-end repro locally for this PR.

@alt-glitch alt-glitch added type/bug Something isn't working comp/tui Terminal UI (ui-tui/ + tui_gateway/) comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists javascript Pull requests that update javascript code labels May 7, 2026
@austinpickett austinpickett requested a review from Copilot May 18, 2026 14:44

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a sidecar slash-worker process leak (#21370) by allowing session.create callers to opt into closing the session (and its slash_worker) when the websocket transport disconnects. Normal TUI sessions retain reconnect semantics.

Changes:

  • Add close_on_disconnect flag, coercion helper, and a shared _close_session_by_id / _close_sessions_for_transport teardown path in tui_gateway/server.py; refactor session.close and _shutdown_sessions to use it.
  • Wire WS disconnect cleanup in tui_gateway/ws.py to the new helper.
  • Have the dashboard ChatSidebar opt into close_on_disconnect: true when creating its sidecar session, and add regression tests for the new helpers.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
tui_gateway/server.py Adds flag plumbing, shared close helper, and transport-scoped close that distinguishes sidecar vs. persistent sessions.
tui_gateway/ws.py Replaces inline transport-detach loop with the new _close_sessions_for_transport helper.
web/src/components/ChatSidebar.tsx Marks dashboard sidecar sessions as close_on_disconnect: true.
tests/test_tui_gateway_server.py Adds tests for flag storage and disconnect handling (sidecar closed, normal session preserved).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sea-monsters pushed a commit to sea-monsters/hermes-agent that referenced this pull request May 24, 2026
Three-layer defence-in-depth against orphaned slash_worker subprocesses:

1. **Server-side cleanup (P0)** — when a WebSocket disconnects, sessions
   marked close_on_disconnect=true (sidecar/short-lived) are finalised
   and their worker is killed. Normal TUI sessions still fall back to the
   stdio transport for historical reconnect compatibility.
   (Design from NousResearch#21401, adapted with permission.)

2. **Parent watchdog (P1, psutil)** — a daemon thread monitors the parent's
   PID + create_time fingerprint every 10 s and exits if the parent
   disappears. Handles crashes, SIGKILL, and PID reuse (critical on
   Windows). (Design from NousResearch#22863, adapted with permission.)

3. **Idle timeout + getppid() poll (P1, no deps)** — the main stdin loop
   uses select.select() with a 60 s timeout so it can periodically check
   os.getppid() and a 30-minute idle deadline. Works without psutil,
   adding defence even when the watchdog thread is not available.

Also refactors session teardown into a shared _close_session_by_id()
helper so that session.close RPC, WebSocket-disconnect cleanup, and
server shutdown all use the identical code path.

Co-authored-by: Hermes Agent
@teknium1

teknium1 commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Superseded by main commit 96cd37e2 ("fix(dashboard): reap orphaned embedded-chat sessions to stop slash_worker leak"), which implements the same fix this PR proposes:

  • A grace-windowed orphan reaper (_WS_ORPHAN_REAP_GRACE_S, default 20s) for WS sessions left transport-detached and not mid-turn.
  • A shared _teardown_session() used by both session.close and the reaper, so the _SlashWorker subprocess is always closed via one idempotent path.
  • _ws_session_is_orphaned() to detect sessions parked on the stdio transport with no live peer.

That closes the _sessions-leak-on-disconnect → orphaned slash-worker / leaked-AIAgent problem this PR addressed. Closing as redundant — thanks for the work and the detailed leak measurements; they corroborated the root cause.

@teknium1 teknium1 closed this Jun 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery comp/tui Terminal UI (ui-tui/ + tui_gateway/) javascript Pull requests that update javascript code P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: dashboard chat handler leaks tui_gateway.slash_worker processes after each gateway turn

4 participants