fix(tui-gateway): reap leaked slash_worker processes/fds on disconnect, create-race, and parent-death#35626
Conversation
tonydwb
left a comment
There was a problem hiding this comment.
Code Review Summary
Verdict: Approved ✅
Reviewed Changes
Comprehensive TUI gateway slash_worker leak fix. Three independent producers: (C1) disconnect reaping, (C2) create-close orphan race guard, (C3) parent-death watchdog. Plus uvicorn ws_ping, idle-session reaper, PTY group-kill.
✅ Looks Good
- Correctness: Each producer is independently addressed. The
_attach_workeridentity check (is) correctly handles concurrent spawns. PTY group-kill is guarded with!= os.getpgrp(). Watchdog uses psutilcreate_timefor PID-reuse safety. Idempotent close via_closed/_finalizedflags prevents double-kill. - Off-loop teardown:
asyncio.to_threadprevents blocking the uvicorn loop during worker close + DB sync. - Well-documented: Root cause table, test coverage per commit, known limitations and startup-window race disclosure.
- Edge cases: Zombie reap after SIGKILL, master fd close (force=True), GC of task handles, concurrent disconnect+spawn.
- No security concerns: No credential exposure. PTY group-kill self-guard prevents signalling gateway's own group.
Reviewed by Hermes Agent
… idempotent) Task: swl-qrf.2
Adds a reentrant _sessions_lock and a single locked teardown helper. A new _attach_worker(sid, session, worker) stores the worker only while sid still maps to the session, else closes it — closing the create/close orphan race at the three spawn sites (_build, _init_session, on-demand slash.exec). _build's finally no longer double-closes the worker (attach owns that now). Task: swl-qrf.3
…session_by_id Task: swl-qrf.4
…nsport Task: swl-qrf.5
…ching transport) The disconnect finally only re-pointed each owned session's transport to stdio, so close_on_disconnect sidecar sessions (and their slash_worker subprocesses) survived every clean tab-close until atexit — the dominant leak. Now delegate to _close_sessions_for_transport off the event loop (to_thread) so the blocking worker.close()/DB write can't stall other live WS connections. Task: swl-qrf.6
Task: swl-qrf.7
…tchdog Daemon thread polls _is_orphaned (original ppid check + psutil create_time PID-reuse guard, no PR_SET_PDEATHSIG). On orphan, drains an in-flight command up to a grace window then os._exit(0). Started before the HermesCLI build to cover the spawn window. Task: swl-qrf.8
Spawns a gateway stand-in that forks a watchdog-armed worker; the worker announces its pid only after arming (so the kill can't precede arming and anchor the watchdog to init). SIGKILLs the parent and asserts the orphan self-exits within a generous multiple of the sub-second poll+grace. Marked integration + live_system_guard_bypass. Task: swl-qrf.9
…ousResearch#24135) close() escalated SIGHUP->SIGTERM->SIGKILL against the leader only, leaking any grandchildren it spawned (NousResearch#32377). Capture the child's pgid at construction (never the dashboard's own group) and signal the whole group via os.killpg, with a single-leader fallback when no pgid is available. Keeps the trailing proc.close(force=True) that closes the PTY master fd (the NousResearch#24775 fd-leak fix). Task: swl-qrf.10
…tions Daemon thread scans every 5min and evicts sessions only when ALL hold: not running, no pending input, agent finished starting, transport dead/detached, and both last_active and created_at older than the 6h TTL. Backstops any disconnect path that slips past the WS finally (NousResearch#21370/NousResearch#21467). Task: swl-qrf.12
The post-turn slash-worker restart stored the fresh worker with a bare session["slash_worker"] = ... assignment. That runs as a turn ends and `running` flips false — exactly when a close_on_disconnect reap can pop the session on another thread. The bare store then leaked the new worker (reaped only on gateway exit via the parent-death watchdog). Thread sid through _restart_slash_worker and route the store through the same _attach_worker store-iff-still-mapped guard as the three spawn sites, closing the worker instead of orphaning it when the session is gone. Refs NousResearch#21370
…module) The check-windows-footguns CI flags the two os.killpg calls in the PTY group-kill path. pty_bridge is already POSIX-only (it imports fcntl/ termios/ptyprocess at module top, and the SIGHUP/SIGTERM/SIGKILL loop already carries the same suppression), so these are intentionally platform-gated. Suppress per the checker's documented convention rather than swapping to psutil, which can't target the PTY process group the NousResearch#32377 grandchild-reaping fix relies on.
cfbe81c to
24000c4
Compare
|
Salvaged into #42132. Your cross-platform parent-death watchdog commit ( |
…ctive_list liveness (re-scoped onto current main) Salvaged from NousResearch#35626 (banditburai) and re-scoped after maintainers landed the parent-death watchdog (slash_worker.py) and PTY process-group teardown (pty_bridge.py) directly on main. Those pieces are intentionally NOT included here — this carries only what is still missing: - C1 disconnect reap: ws.py's `finally` only re-pointed the dead transport at stdio. `_close_sessions_for_transport` now reaps `close_on_disconnect` sessions and schedules the grace-reap for the rest, offloaded via `asyncio.to_thread` so the blocking worker.close() + DB write never stalls the uvicorn loop. - C2 create/close orphan race: `_attach_worker` stores the worker iff `_sessions.get(sid) is session` under the lock (else closes it), applied at every spawn site incl. the post-turn `_restart_slash_worker`. - Single idempotent teardown funnel: session.close, WS disconnect, the generous-TTL idle reaper, shutdown, and the WS grace-reap all reach `_close_session_by_id` → `_teardown_session`; `_finalized`/`_closed` flags make concurrent/double teardown a no-op. `_sessions_lock` upgraded to RLock. - uvicorn `ws_ping_interval/timeout=20s` so a half-open socket (reverse-proxy 524) becomes a `WebSocketDisconnect` and the C1 path runs. Plus two review-driven hardening fixes (mine): - `session.active_list` now skips `_finalized` sessions so the footer "N sessions" count reflects attachable sessions instead of only ever growing until restart (NousResearch#38950). Keys on `_finalized` only, NOT the stdio sentinel, so a standalone `hermes --tui` session stays visible. - `_schedule_ws_orphan_reap._reap` pops via `_close_session_by_id` (under `_sessions_lock`) instead of `_sessions.pop` under the unrelated `_session_resume_lock` (NousResearch#39591); the resume_lock now only guards the orphan re-check against `session.resume`. - Float env knobs (`HERMES_SLASH_WATCHDOG_*`, `HERMES_TUI_SESSION_TTL_S`) parse with a fallback helper so a malformed value can't crash the worker at import. Fixes NousResearch#32377 Fixes NousResearch#38950 Addresses NousResearch#22855 Co-authored-by: banditburai <123342691+banditburai@users.noreply.github.com> Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
…ctive_list liveness (re-scoped onto current main) Salvaged from NousResearch#35626 (banditburai) and re-scoped after maintainers landed the parent-death watchdog (slash_worker.py) and PTY process-group teardown (pty_bridge.py) directly on main. Those pieces are intentionally NOT included here — this carries only what is still missing: - C1 disconnect reap: ws.py's `finally` only re-pointed the dead transport at stdio. `_close_sessions_for_transport` now reaps `close_on_disconnect` sessions and schedules the grace-reap for the rest, offloaded via `asyncio.to_thread` so the blocking worker.close() + DB write never stalls the uvicorn loop. - C2 create/close orphan race: `_attach_worker` stores the worker iff `_sessions.get(sid) is session` under the lock (else closes it), applied at every spawn site incl. the post-turn `_restart_slash_worker`. - Single idempotent teardown funnel: session.close, WS disconnect, the generous-TTL idle reaper, shutdown, and the WS grace-reap all reach `_close_session_by_id` → `_teardown_session`; `_finalized`/`_closed` flags make concurrent/double teardown a no-op. `_sessions_lock` upgraded to RLock. - uvicorn `ws_ping_interval/timeout=20s` so a half-open socket (reverse-proxy 524) becomes a `WebSocketDisconnect` and the C1 path runs. Plus two review-driven hardening fixes (mine): - `session.active_list` now skips `_finalized` sessions so the footer "N sessions" count reflects attachable sessions instead of only ever growing until restart (NousResearch#38950). Keys on `_finalized` only, NOT the stdio sentinel, so a standalone `hermes --tui` session stays visible. - `_schedule_ws_orphan_reap._reap` pops via `_close_session_by_id` (under `_sessions_lock`) instead of `_sessions.pop` under the unrelated `_session_resume_lock` (NousResearch#39591); the resume_lock now only guards the orphan re-check against `session.resume`. - Float env knobs (`HERMES_SLASH_WATCHDOG_*`, `HERMES_TUI_SESSION_TTL_S`) parse with a fallback helper so a malformed value can't crash the worker at import. Fixes NousResearch#32377 Fixes NousResearch#38950 Addresses NousResearch#22855 Co-authored-by: banditburai <123342691+banditburai@users.noreply.github.com> Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
…ctive_list liveness (re-scoped onto current main) Salvaged from NousResearch#35626 (banditburai) and re-scoped after maintainers landed the parent-death watchdog (slash_worker.py) and PTY process-group teardown (pty_bridge.py) directly on main. Those pieces are intentionally NOT included here — this carries only what is still missing: - C1 disconnect reap: ws.py's `finally` only re-pointed the dead transport at stdio. `_close_sessions_for_transport` now reaps `close_on_disconnect` sessions and schedules the grace-reap for the rest, offloaded via `asyncio.to_thread` so the blocking worker.close() + DB write never stalls the uvicorn loop. - C2 create/close orphan race: `_attach_worker` stores the worker iff `_sessions.get(sid) is session` under the lock (else closes it), applied at every spawn site incl. the post-turn `_restart_slash_worker`. - Single idempotent teardown funnel: session.close, WS disconnect, the generous-TTL idle reaper, shutdown, and the WS grace-reap all reach `_close_session_by_id` → `_teardown_session`; `_finalized`/`_closed` flags make concurrent/double teardown a no-op. `_sessions_lock` upgraded to RLock. - uvicorn `ws_ping_interval/timeout=20s` so a half-open socket (reverse-proxy 524) becomes a `WebSocketDisconnect` and the C1 path runs. Plus two review-driven hardening fixes (mine): - `session.active_list` now skips `_finalized` sessions so the footer "N sessions" count reflects attachable sessions instead of only ever growing until restart (NousResearch#38950). Keys on `_finalized` only, NOT the stdio sentinel, so a standalone `hermes --tui` session stays visible. - `_schedule_ws_orphan_reap._reap` pops via `_close_session_by_id` (under `_sessions_lock`) instead of `_sessions.pop` under the unrelated `_session_resume_lock` (NousResearch#39591); the resume_lock now only guards the orphan re-check against `session.resume`. - Float env knobs (`HERMES_SLASH_WATCHDOG_*`, `HERMES_TUI_SESSION_TTL_S`) parse with a fallback helper so a malformed value can't crash the worker at import. Fixes NousResearch#32377 Fixes NousResearch#38950 Addresses NousResearch#22855 Co-authored-by: banditburai <123342691+banditburai@users.noreply.github.com> Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
…ctive_list liveness (re-scoped onto current main) Salvaged from NousResearch#35626 (banditburai) and re-scoped after maintainers landed the parent-death watchdog (slash_worker.py) and PTY process-group teardown (pty_bridge.py) directly on main. Those pieces are intentionally NOT included here — this carries only what is still missing: - C1 disconnect reap: ws.py's `finally` only re-pointed the dead transport at stdio. `_close_sessions_for_transport` now reaps `close_on_disconnect` sessions and schedules the grace-reap for the rest, offloaded via `asyncio.to_thread` so the blocking worker.close() + DB write never stalls the uvicorn loop. - C2 create/close orphan race: `_attach_worker` stores the worker iff `_sessions.get(sid) is session` under the lock (else closes it), applied at every spawn site incl. the post-turn `_restart_slash_worker`. - Single idempotent teardown funnel: session.close, WS disconnect, the generous-TTL idle reaper, shutdown, and the WS grace-reap all reach `_close_session_by_id` → `_teardown_session`; `_finalized`/`_closed` flags make concurrent/double teardown a no-op. `_sessions_lock` upgraded to RLock. - uvicorn `ws_ping_interval/timeout=20s` so a half-open socket (reverse-proxy 524) becomes a `WebSocketDisconnect` and the C1 path runs. Plus two review-driven hardening fixes (mine): - `session.active_list` now skips `_finalized` sessions so the footer "N sessions" count reflects attachable sessions instead of only ever growing until restart (NousResearch#38950). Keys on `_finalized` only, NOT the stdio sentinel, so a standalone `hermes --tui` session stays visible. - `_schedule_ws_orphan_reap._reap` pops via `_close_session_by_id` (under `_sessions_lock`) instead of `_sessions.pop` under the unrelated `_session_resume_lock` (NousResearch#39591); the resume_lock now only guards the orphan re-check against `session.resume`. - Float env knobs (`HERMES_SLASH_WATCHDOG_*`, `HERMES_TUI_SESSION_TTL_S`) parse with a fallback helper so a malformed value can't crash the worker at import. Fixes NousResearch#32377 Fixes NousResearch#38950 Addresses NousResearch#22855 Co-authored-by: banditburai <123342691+banditburai@users.noreply.github.com> Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
…ctive_list liveness (re-scoped onto current main) Salvaged from #35626 (banditburai) and re-scoped after maintainers landed the parent-death watchdog (slash_worker.py) and PTY process-group teardown (pty_bridge.py) directly on main. Those pieces are intentionally NOT included here — this carries only what is still missing: - C1 disconnect reap: ws.py's `finally` only re-pointed the dead transport at stdio. `_close_sessions_for_transport` now reaps `close_on_disconnect` sessions and schedules the grace-reap for the rest, offloaded via `asyncio.to_thread` so the blocking worker.close() + DB write never stalls the uvicorn loop. - C2 create/close orphan race: `_attach_worker` stores the worker iff `_sessions.get(sid) is session` under the lock (else closes it), applied at every spawn site incl. the post-turn `_restart_slash_worker`. - Single idempotent teardown funnel: session.close, WS disconnect, the generous-TTL idle reaper, shutdown, and the WS grace-reap all reach `_close_session_by_id` → `_teardown_session`; `_finalized`/`_closed` flags make concurrent/double teardown a no-op. `_sessions_lock` upgraded to RLock. - uvicorn `ws_ping_interval/timeout=20s` so a half-open socket (reverse-proxy 524) becomes a `WebSocketDisconnect` and the C1 path runs. Plus two review-driven hardening fixes (mine): - `session.active_list` now skips `_finalized` sessions so the footer "N sessions" count reflects attachable sessions instead of only ever growing until restart (#38950). Keys on `_finalized` only, NOT the stdio sentinel, so a standalone `hermes --tui` session stays visible. - `_schedule_ws_orphan_reap._reap` pops via `_close_session_by_id` (under `_sessions_lock`) instead of `_sessions.pop` under the unrelated `_session_resume_lock` (#39591); the resume_lock now only guards the orphan re-check against `session.resume`. - Float env knobs (`HERMES_SLASH_WATCHDOG_*`, `HERMES_TUI_SESSION_TTL_S`) parse with a fallback helper so a malformed value can't crash the worker at import. Fixes #32377 Fixes #38950 Addresses #22855 Co-authored-by: banditburai <123342691+banditburai@users.noreply.github.com> Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com>
Dashboard chat sessions leaked
tui_gateway.slash_workersubprocesses — plus their pipe fds and PTY master fds — whenever a session disconnected, was abandoned, or its parent gateway died. This PR adds a teardown mechanism for each of the three independent producers (one narrow C3 residual disclosed under Known limitations); it is the leak cause (close + reap). The launchd fd-ceiling raise + self-heal ships separately in the sibling PRfix/gateway-fd-limit-hardening(Refs #24775).Problem
Leaked workers accumulate without bound and drag the host down. Observed: 128 stale workers swap-pinning a 7.8 GB box (#21467); fd exhaustion on macOS launchd (#24775); a per-turn handler leak (#21370); a dashboard hang from PTY-disconnect leaks behind a reverse proxy returning 524 (#32377); general lifecycle-gap fragility under intensive use (#22855). There are three distinct producers, and fixing any one alone leaves the leak in place.
Root cause → mechanism → file → commit
ws.pyfinallyonly re-pointed the transport back to stdio; the session (and worker) lived on_close_sessions_for_transportreapsclose_on_disconnectsessions / re-points the rest, dispatched viaasyncio.to_thread(the teardown does a blockingworker.close()+ sync DB write — inline would stall the uvicorn loop for every other live connection); funnels through one idempotent_close_session_by_idws.py,server.py0431e38eb6ce86df308912e62b3b01b054d0_attach_workerstores the worker iff_sessions.get(sid) is session(identity) under_sessions_lock, else closes it — applied at all 4 spawn sites incl. the post-turn_restart_slash_workerserver.py8912e62b305007e68eatexitnever runs)close()signalled only the PTY leader; master fd was dropped withoutforcegetppid()+create_timePID-reuse guard, drains in-flight thenos._exit(0)); (ii) PTY group-kill viaos.killpgon the child'ssetsidgroup (guarded!= os.getpgrp()so it never signals the gateway's own group) +proc.close(force=True)to close the master fd; (iii) idempotent_SlashWorker.close()(_closedflag + post-SIGKILL zombie reap + stdio close)slash_worker.py,pty_bridge.py,server.pydc0add6ffe637c7b419339e47a4Two supporting changes (detectors / defense-in-depth, not root causes):
ws_ping_interval/ws_ping_timeout = 20s— server-initiated pings turn a half-open socket (reverse-proxy 524, dropped tunnel) into aWebSocketDisconnectwithin ~20–40 s (one ping interval + one timeout), so the C1 disconnect path can run instead of the socket lingering silently.web_server.py(8d36d42fd).last_activeandcreated_atolder than a 6 h TTL). A last-resort backstop for any disconnect that slips past the WSfinally.server.py(d2b4f7b50; see Known limitations).Cross-cutting note
All teardown paths —
session.closeRPC,atexitshutdown, WS disconnect, idle eviction — now funnel through the single idempotent_close_session_by_id. Idempotency (via_finalizedon the session and_closedon the worker) is the explicit safety contract: concurrent/double teardown is a no-op, never a double-kill.Resolves
Fixes #21370— the worker is per-session (reused across turns, not respawned per turn); the leak was that the session + worker were never reaped on disconnect (C1), and the post-turn_restart_slash_workercould orphan its replacement worker (C2). Both are now closed.Fixes #21467— worker accumulation / swap-pinning: the same reap path bounds steady-state count; the idle reaper backstops it.Fixes #24775— fd exhaustion. PR1 fixes the leak cause (PTY master-fd close + zombie reap + session/worker teardown). The launchd fd-cap raise is mitigation and ships in the sibling PR (Refs #24775).Fixes #32377— PTY chat disconnect / 524 behind reverse proxy:ws_pingdetects the half-open socket → C1 reap → PTY group-kill reaps the grandchildren.Addresses #22855— lifecycle gaps / fragility: the watchdog + C1 reap + C2 guard cover the common-case gaps. This PR does not add thetools.process_registryregistration the issue proposed, so global teardown still can't enumerate live workers, and the startup-window race noted below remains; henceAddressesrather thanFixes.Supersedes #21401— builds on its disconnect-close approach (same files:ws.py,server.py,ChatSidebar.tsx), adding off-loop teardown (asyncio.to_thread) and the_attach_workercreate-race guard.Supersedes #32483— carries forward its PTY process-group reaping and adds idempotentclose(). For parent-death detection it uses a cross-platform psutil watchdog (create_time-based, PID-reuse-safe) in place of [codex] fix(tui): reap dashboard slash workers #32483's Linux-onlyPR_SET_PDEATHSIG; the trade is slightly higher detection latency on Linux in exchange for covering macOS, the platform in these reports.Known limitations
getppid()import window, the worker anchors to init and never self-terminates. Narrow; unreached by the registry (see Bug: slash_worker lifecycle gaps and system fragility observed during intensive Dashboard usage (#21370) #22855). Principled fix (deferred): pass the gateway PID to the worker at spawn instead of relying ongetppid(). The other two C3 mechanisms (PTY group-kill, idempotent close) are unaffected._transport_is_deadtreats the stdio transport as always-dead, so a local stdio session idle > 6 h (not running, no pending) will be reclaimed. Bounded (local-only, 6 h TTL, multiple exemptions); accepted for now.Tests
Each commit ships its own tests (unit + a live parent-death integration test under
-m integration).Reviewer guidance
Hot spots worth the closest look: the off-loop
asyncio.to_threadteardown (no event-loop reentrancy; DB write off the loop thread);_attach_worker'sisidentity check (not==) covering the restart site; theos.killpg!= os.getpgrp()self-group guard; the watchdog's in-flight drain beforeos._exit(0). Sibling:fix/gateway-fd-limit-hardeningraises the macOS launchd fd ceiling (Refs #24775).