fix(gateway): implement robust lifecycle management for slash_worker (#21370, #22855)#22863
fix(gateway): implement robust lifecycle management for slash_worker (#21370, #22855)#22863leether wants to merge 4 commits into
Conversation
|
Technical Note regarding TUI Gateway tests on macOS: During the final verification of this PR on macOS (Darwin), I observed one pre-existing test failure in Investigation ResultsTo ensure this wasn't a regression introduced by my changes, I performed a baseline test on a clean All other 173 tests passed successfully, including those directly related to the The core fix has been physically verified through "kill -9" simulations to confirm orphan reaping and watchdog effectiveness. |
e8d76e1 to
902d627
Compare
|
I pushed a small follow-up update after rebasing this PR onto the latest What changed:
Why this matters: Tradeoff / scope:
Validation:
|
Three-layer defence-in-depth against orphaned slash_worker subprocesses: 1. **Server-side cleanup (P0)** — when a WebSocket disconnects, sessions marked close_on_disconnect=true (sidecar/short-lived) are finalised and their worker is killed. Normal TUI sessions still fall back to the stdio transport for historical reconnect compatibility. (Design from NousResearch#21401, adapted with permission.) 2. **Parent watchdog (P1, psutil)** — a daemon thread monitors the parent's PID + create_time fingerprint every 10 s and exits if the parent disappears. Handles crashes, SIGKILL, and PID reuse (critical on Windows). (Design from NousResearch#22863, adapted with permission.) 3. **Idle timeout + getppid() poll (P1, no deps)** — the main stdin loop uses select.select() with a 60 s timeout so it can periodically check os.getppid() and a 30-minute idle deadline. Works without psutil, adding defence even when the watchdog thread is not available. Also refactors session teardown into a shared _close_session_by_id() helper so that session.close RPC, WebSocket-disconnect cleanup, and server shutdown all use the identical code path. Co-authored-by: Hermes Agent
|
Superseded by #42132 (merged), which closes the slash_worker subprocess leak via two guards: process-group kill on PTY teardown + a cross-platform parent-death watchdog in the worker. Closing as resolved — thanks for tackling this; the merged fix salvaged the process-group-kill and watchdog approaches with contributor authorship preserved. |
Co-authored with Gemini CLI (Orchestrator)
Problem
In high-frequency Dashboard usage scenarios,
tui_gateway.slash_workersubprocesses are leaked after each chat turn (#21370). Our forensics on macOS identified that these orphans parented toinit(PID 1) can accumulate significantly, causing memory exhaustion and interfering with environment synchronization duringhermes update.Solution: Defense-in-Depth
This PR implements a "three-way closure" for the worker lifecycle following the guidelines in AGENTS.md:
ProcessRegistrywithregister_host_processto track externally spawned host processes with standard housekeeping (checkpointing/pruning).slash_workerinto the globalProcessRegistry. Standard gateway teardowns (GatewayRunner.stop()) now explicitly reap these workers.slash_worker.pythat monitors the parent PID + create_time fingerprint. This ensures the worker self-terminates even if the parent is SIGKILLed or if PID reuse occurs on Windows.Verification & Testing
ps auxthathermes dashboard --stopsuccessfully reaps all associated workers through the registry.kill -9on the main process and verified orphans self-terminate within the watchdog window.tests/test_tui_gateway_server.py(the single failure was confirmed as pre-existing on macOS).Compliance
os.kill(pid, 0)usage).