Skip to content

fix(gateway): implement robust lifecycle management for slash_worker (#21370, #22855)#22863

Closed
leether wants to merge 4 commits into
NousResearch:mainfrom
leether:fix/slash-worker-lifecycle-22855
Closed

fix(gateway): implement robust lifecycle management for slash_worker (#21370, #22855)#22863
leether wants to merge 4 commits into
NousResearch:mainfrom
leether:fix/slash-worker-lifecycle-22855

Conversation

@leether

@leether leether commented May 9, 2026

Copy link
Copy Markdown

Co-authored with Gemini CLI (Orchestrator)

Problem

In high-frequency Dashboard usage scenarios, tui_gateway.slash_worker subprocesses are leaked after each chat turn (#21370). Our forensics on macOS identified that these orphans parented to init (PID 1) can accumulate significantly, causing memory exhaustion and interfering with environment synchronization during hermes update.

Solution: Defense-in-Depth

This PR implements a "three-way closure" for the worker lifecycle following the guidelines in AGENTS.md:

  1. Architecture Alignment: Extended ProcessRegistry with register_host_process to track externally spawned host processes with standard housekeeping (checkpointing/pruning).
  2. Active Management: Integrated slash_worker into the global ProcessRegistry. Standard gateway teardowns (GatewayRunner.stop()) now explicitly reap these workers.
  3. Passive Safety (Orphan Watchdog): Injected a lightweight watchdog thread into slash_worker.py that monitors the parent PID + create_time fingerprint. This ensures the worker self-terminates even if the parent is SIGKILLed or if PID reuse occurs on Windows.

Verification & Testing

  • Active Teardown Test: Verified via ps aux that hermes dashboard --stop successfully reaps all associated workers through the registry.
  • Hard Crash Test: Simulated kill -9 on the main process and verified orphans self-terminate within the watchdog window.
  • Fingerprint Simulation: Mocked PID reuse (same PID, different create_time) and verified the watchdog correctly triggers exit.
  • Regression Check: Passed 173/174 tests in tests/test_tui_gateway_server.py (the single failure was confirmed as pre-existing on macOS).

Compliance

  • Strictly adheres to AGENTS.md profile-aware path standards.
  • Cross-platform compatible (no os.kill(pid, 0) usage).
  • Zero impact on Prompt Caching logic.

@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/tui Terminal UI (ui-tui/ + tui_gateway/) comp/gateway Gateway runner, session dispatch, delivery labels May 9, 2026
@leether

leether commented May 10, 2026

Copy link
Copy Markdown
Author

Technical Note regarding TUI Gateway tests on macOS:

During the final verification of this PR on macOS (Darwin), I observed one pre-existing test failure in tests/test_tui_gateway_server.py:
test_browser_manage_connect_default_local_reports_launch_hint

Investigation Results

To ensure this wasn't a regression introduced by my changes, I performed a baseline test on a clean main worktree within the same environment. The failure persists on the unmodified upstream code, likely due to local Chrome/Chromium executable path mismatches or specific environment constraints on my machine.

All other 173 tests passed successfully, including those directly related to the ProcessRegistry and slash_worker lifecycle.

The core fix has been physically verified through "kill -9" simulations to confirm orphan reaping and watchdog effectiveness.

@leether leether force-pushed the fix/slash-worker-lifecycle-22855 branch from e8d76e1 to 902d627 Compare May 16, 2026 14:14
@leether

leether commented May 16, 2026

Copy link
Copy Markdown
Author

I pushed a small follow-up update after rebasing this PR onto the latest main.

What changed:

  • Added focused regression coverage for slash_worker lifecycle registration and the orphan watchdog.
  • Tightened one implementation detail in the registry integration: the original code passed the gateway session key only as task_id, but ProcessRegistry.has_active_for_session(...) checks ProcessSession.session_key. I updated register_host_process(...) to accept a session_key, changed _SlashWorker to pass it explicitly, and added a regression test for that contract.

Why this matters:
This keeps the original intent of the PR intact: slash_worker is not only present in the central registry, but can also be discovered by gateway-session lifecycle checks.

Tradeoff / scope:
This PR is complementary to the other related lifecycle fixes rather than a replacement for them:

Validation:

  • Focused regression tests pass locally: 4 passed.
  • There are still no reported GitHub checks on this branch, so I do not have a CI signal to wait on from the PR page.

sea-monsters pushed a commit to sea-monsters/hermes-agent that referenced this pull request May 24, 2026
Three-layer defence-in-depth against orphaned slash_worker subprocesses:

1. **Server-side cleanup (P0)** — when a WebSocket disconnects, sessions
   marked close_on_disconnect=true (sidecar/short-lived) are finalised
   and their worker is killed. Normal TUI sessions still fall back to the
   stdio transport for historical reconnect compatibility.
   (Design from NousResearch#21401, adapted with permission.)

2. **Parent watchdog (P1, psutil)** — a daemon thread monitors the parent's
   PID + create_time fingerprint every 10 s and exits if the parent
   disappears. Handles crashes, SIGKILL, and PID reuse (critical on
   Windows). (Design from NousResearch#22863, adapted with permission.)

3. **Idle timeout + getppid() poll (P1, no deps)** — the main stdin loop
   uses select.select() with a 60 s timeout so it can periodically check
   os.getppid() and a 30-minute idle deadline. Works without psutil,
   adding defence even when the watchdog thread is not available.

Also refactors session teardown into a shared _close_session_by_id()
helper so that session.close RPC, WebSocket-disconnect cleanup, and
server shutdown all use the identical code path.

Co-authored-by: Hermes Agent
@teknium1

teknium1 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Superseded by #42132 (merged), which closes the slash_worker subprocess leak via two guards: process-group kill on PTY teardown + a cross-platform parent-death watchdog in the worker. Closing as resolved — thanks for tackling this; the merged fix salvaged the process-group-kill and watchdog approaches with contributor authorship preserved.

@teknium1 teknium1 closed this Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery comp/tui Terminal UI (ui-tui/ + tui_gateway/) P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants