Skip to content

fix(pty-bridge): terminate PTY process groups on teardown#24135

Closed
paulb26 wants to merge 1 commit into
NousResearch:mainfrom
paulb26:fix/pty-group-cleanup
Closed

fix(pty-bridge): terminate PTY process groups on teardown#24135
paulb26 wants to merge 1 commit into
NousResearch:mainfrom
paulb26:fix/pty-group-cleanup

Conversation

@paulb26

@paulb26 paulb26 commented May 12, 2026

Copy link
Copy Markdown
Contributor

Problem statement

The dashboard chat runs the real TUI through PtyBridge. After extended dashboard use, child helper processes such as tui_gateway.slash_worker can survive chat teardown. Those orphaned helpers accumulate and eventually exhaust available PTY devices, causing new dashboard chats to fail with:

Chat failed to start: out of pty devices

This primarily affects long-running local dashboard users who open/close or reconnect chat sessions over time.

Root cause

PtyBridge.close() only sent teardown signals to the PTY leader process via self._proc.kill(sig). The TUI can spawn helper children under the same process group, and killing only the leader does not reliably terminate those descendants. The process tree outlives the dashboard chat session, leaking PTY-backed resources.

Fix approach

  • Capture the child process group id with os.getpgid(self._proc.pid).
    • Identifies the PTY subtree that should receive terminal-close signals.
  • Send SIGHUP, then SIGTERM, then SIGKILL to the full process group with os.killpg(pgid, sig).
    • Preserves the existing escalation behavior while including helper children.
  • Fall back to the existing single-process kill path if the process group cannot be resolved.
    • Preserves previous behavior for races where the process is already gone or group lookup fails.
  • Add focused unit coverage using fake process objects.
    • Verifies group signaling without depending on timing-sensitive real subprocess trees.

Test coverage

Added tests in tests/hermes_cli/test_pty_bridge.py:

  • test_close_signals_child_process_group
    • Verifies close() uses os.killpg() when a process group is available and does not fall back to single-process kill().
  • test_close_falls_back_to_single_process_signal_when_group_unknown
    • Verifies the existing self._proc.kill() path is still used if os.getpgid() fails.

Targeted test command:

scripts/run_tests.sh tests/hermes_cli/test_pty_bridge.py

Risk assessment

Low-to-moderate risk. The PTY bridge is already POSIX-only, and process-group teardown is the correct lifecycle boundary for an embedded terminal session. The main compatibility risk is that any child process intentionally meant to survive the dashboard TUI will now be terminated; for dashboard chat teardown, that is the desired behavior because helper descendants are session-owned. Fallback preserves the previous behavior if process-group lookup is unavailable.

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/cli CLI entry point, hermes_cli/, setup wizard labels May 12, 2026
@paulb26 paulb26 force-pushed the fix/pty-group-cleanup branch from 7b0af0d to 235895e Compare May 12, 2026 03:27
banditburai added a commit to banditburai/hermes-agent that referenced this pull request Jun 1, 2026
…ousResearch#24135)

close() escalated SIGHUP->SIGTERM->SIGKILL against the leader only, leaking any
grandchildren it spawned (NousResearch#32377). Capture the child's pgid at construction
(never the dashboard's own group) and signal the whole group via os.killpg, with
a single-leader fallback when no pgid is available. Keeps the trailing
proc.close(force=True) that closes the PTY master fd (the NousResearch#24775 fd-leak fix).

Task: swl-qrf.10
teknium1 added a commit that referenced this pull request Jun 8, 2026
@teknium1

teknium1 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Salvaged into #42132 — your fix(pty-bridge): terminate PTY process groups on teardown commit was cherry-picked onto current main with your authorship preserved in git log (b31c6c3). It's now guard 1 of the two-guard slash-worker leak fix. Thanks!

@teknium1 teknium1 closed this Jun 8, 2026
jhjaggars-hermes pushed a commit to jhjaggars/hermes-agent that referenced this pull request Jun 8, 2026
changman pushed a commit to changman/hermes-agent that referenced this pull request Jun 10, 2026
alt-glitch pushed a commit that referenced this pull request Jun 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants