Skip to content

[codex] fix(tui): reap dashboard slash workers#32483

Closed
stardust-mem wants to merge 1 commit into
NousResearch:mainfrom
stardust-mem:codex/tui-slash-worker-cleanup
Closed

[codex] fix(tui): reap dashboard slash workers#32483
stardust-mem wants to merge 1 commit into
NousResearch:mainfrom
stardust-mem:codex/tui-slash-worker-cleanup

Conversation

@stardust-mem

Copy link
Copy Markdown

Summary

Fixes #32377 by tightening dashboard PTY teardown so tui_gateway.slash_worker descendants do not survive repeated /api/pty disconnects.

Root Cause

The dashboard PTY close path signalled only the PTY root process. If that process exited while descendants such as the TUI gateway's slash worker were still alive, the worker could become orphaned and accumulate under the dashboard service cgroup. Separately, _SlashWorker.close() did not wait after SIGKILL, and the worker process had no Linux parent-death guard for abrupt gateway exits.

Changes

  • Track the PTY child process group in PtyBridge and send SIGHUP -> SIGTERM -> SIGKILL to that group during close.
  • Add Linux PR_SET_PDEATHSIG setup in tui_gateway.slash_worker itself, avoiding preexec_fn from the multithreaded gateway parent.
  • Make _SlashWorker.close() wait after SIGKILL and close stdio pipes in all exit paths.
  • Add regression tests for PTY grandchild cleanup, slash worker parent-death signal installation, and SIGKILL wait/reap behavior.

Validation

  • pytest tests/hermes_cli/test_pty_bridge.py::TestPtyBridgeClose::test_close_terminates_child_process_group -q
  • pytest tests/tui_gateway/test_protocol.py::test_slash_worker_child_installs_linux_parent_death_signal tests/tui_gateway/test_protocol.py::test_slash_worker_close_waits_after_kill -q
  • pytest tests/hermes_cli/test_pty_bridge.py tests/hermes_cli/test_web_server.py::TestPtyWebSocket -q
  • pytest tests/tui_gateway/test_protocol.py -q
  • pytest tests/test_tui_gateway_server.py::test_session_create_close_race_does_not_orphan_worker tests/test_tui_gateway_server.py::test_session_create_no_race_keeps_worker_alive -q
  • python -m py_compile hermes_cli/pty_bridge.py tui_gateway/server.py tui_gateway/slash_worker.py
  • git diff --check

Note: pytest tests/test_tui_gateway_server.py::test_browser_manage_connect_default_local_reports_launch_hint -q fails in this checkout independently of these changes; the assertion expects a browser-launch hint that the current local environment/code path does not emit.

@stardust-mem stardust-mem marked this pull request as ready for review May 26, 2026 12:38
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/tui Terminal UI (ui-tui/ + tui_gateway/) comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery labels May 26, 2026
@teknium1

teknium1 commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Superseded by #42132 (merged), which closes the slash_worker subprocess leak via two guards: process-group kill on PTY teardown + a cross-platform parent-death watchdog in the worker. Closing as resolved — thanks for tackling this; the merged fix salvaged the process-group-kill and watchdog approaches with contributor authorship preserved.

@teknium1 teknium1 closed this Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery comp/tui Terminal UI (ui-tui/ + tui_gateway/) P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dashboard hangs: tui_gateway.slash_worker subprocesses leak on PTY chat disconnect (524 via reverse proxy)

3 participants