Dashboard hangs under PTY chat after stacked tui_gateway.slash_worker subprocess leak
Summary
When running hermes dashboard --tui behind a reverse proxy (Cloudflare Tunnel in my case), repeated open/close of the in-browser Chat tab leaks tui_gateway.slash_worker Python subprocesses. They stack inside the dashboard's systemd cgroup until the process saturates memory and stops responding — the dashboard hangs with dozens of pending connections, and the proxy returns 524 (Cloudflare) / 502 (nginx). SIGTERM does not reap; only SIGKILL on the whole cgroup recovers.
Filed at the maintainers' request after a live incident on vm-hermes (Hermes Agent v0.14.0, commit cea87d913).
Environment
- Hermes Agent v0.14.0,
main @ cea87d913
- Linux (Azure Ubuntu 24.04, kernel 6.17.0-1015-azure)
- Python from the bundled venv
- Launched via systemd --user unit:
ExecStart=/.../hermes dashboard --port 9119 --host 127.0.0.1 --no-open --tui --skip-build
- Reverse proxy: Cloudflare Tunnel →
127.0.0.1:9119
- Model:
claude-opus-4.7 via Copilot OAuth (reasoning medium)
Reproduction
- Start
hermes dashboard --tui bound to loopback.
- Open the dashboard in a browser, click Chat (spawns a PTY +
tui_gateway.slash_worker).
- Close the tab / browser without typing
/quit, OR let the websocket drop due to upstream proxy timeout / page refresh.
- Re-open Chat. Repeat 3–5 times.
pgrep -af tui_gateway.slash_worker — workers accumulate, one per session, never reaped.
- After ~5 stacked workers the dashboard process climbs past ~800MB RSS, event loop starves, all HTTP requests stall (51 pending in my incident), proxy returns 524.
Observed behaviour
$ pgrep -af tui_gateway.slash_worker | wc -l
5
$ systemctl --user status hermes-dashboard.service
Memory: 781.4M (peak: 1.1G)
Tasks: 38
$ journalctl --user -u hermes-dashboard | grep -i "Incoming request ended abruptly: context canceled" | wc -l
51
SIGTERM to the unit was ignored. Recovery:
systemctl --user kill -s SIGKILL hermes-dashboard.service
systemctl --user reset-failed hermes-dashboard.service
systemctl --user start hermes-dashboard.service
Root cause (suspected)
Two related leaks in the PTY-bridge / slash-worker lifecycle:
/api/pty WebSocketDisconnect path closes the PtyBridge correctly, but the spawned hermes --tui child holds an open _SlashWorker (subprocess.Popen of tui_gateway.slash_worker). When the parent dies via the SIGHUP→SIGTERM→SIGKILL escalation in pty_bridge.PtyBridge.close(), the slash worker — which is a grandchild spawned by the in-PTY agent process, not the dashboard — does not always see a TTY-hangup propagation if the agent process exits non-cleanly. Result: an orphan tui_gateway.slash_worker is reparented to PID 1 (or remains under the dashboard cgroup since it was launched via Popen from inside an agent that started under the dashboard's user-unit cgroup).
_SlashWorker registers no atexit / signal handler and no PR_SET_PDEATHSIG in tui_gateway/server.py (lines ~183–264). close() is only called when the in-agent code reaches _restart_slash_worker or session shutdown — neither runs on abrupt websocket disconnect.
So:
- Every browser refresh / Cloudflare upstream timeout that drops the
/api/pty WS leaves an orphan worker.
- Because the workers live in the same systemd user-cgroup, they count against the dashboard service's memory and Tasks=, and
systemctl --user stop only signals the main pid; the workers ignore SIGTERM (no handler) and only die on KillMode=control-group + SIGKILL.
Suggested fix
Two independent guards, each cheap and useful on its own:
-
In _SlashWorker.__init__, set PR_SET_PDEATHSIG to SIGTERM on the child via a preexec_fn so the worker dies the moment its parent agent exits — even if the parent crashes or is SIGKILLed.
def _set_pdeathsig():
try:
import ctypes, signal as _sig
libc = ctypes.CDLL("libc.so.6", use_errno=True)
PR_SET_PDEATHSIG = 1
libc.prctl(PR_SET_PDEATHSIG, _sig.SIGTERM, 0, 0, 0)
except Exception:
pass
self.proc = subprocess.Popen(
argv,
...,
preexec_fn=_set_pdeathsig, # Linux-only; wrap in platform check
)
-
In /api/pty's finally block (hermes_cli/web_server.py ~3582–3588), after bridge.close(), also walk and SIGTERM any tui_gateway.slash_worker whose parent pid is the just-closed bridge's pid. This is defensive — guard (1) makes it redundant on Linux — but it's necessary on macOS where PR_SET_PDEATHSIG doesn't exist (use proc_track / kqueue NOTE_EXIT if you want symmetry, or just accept the explicit sweep).
Optional third: have systemd units document KillMode=control-group (it's the default, but worth a note in docs/deployment.md) so operators don't override it to process and lose the cgroup-wide SIGKILL recovery path.
Reproducibility
Reliably reproduces on my box: 3 open/close cycles on a flaky upstream (I forced this by toggling Cloudflare cache rules), 5 stacked workers, hang within 2 minutes. Happy to provide systemd journal excerpts or strace if useful.
Related
Dashboard hangs under PTY chat after stacked
tui_gateway.slash_workersubprocess leakSummary
When running
hermes dashboard --tuibehind a reverse proxy (Cloudflare Tunnel in my case), repeated open/close of the in-browser Chat tab leakstui_gateway.slash_workerPython subprocesses. They stack inside the dashboard's systemd cgroup until the process saturates memory and stops responding — the dashboard hangs with dozens of pending connections, and the proxy returns 524 (Cloudflare) / 502 (nginx). SIGTERM does not reap; only SIGKILL on the whole cgroup recovers.Filed at the maintainers' request after a live incident on
vm-hermes(Hermes Agent v0.14.0, commitcea87d913).Environment
main@cea87d913127.0.0.1:9119claude-opus-4.7via Copilot OAuth (reasoning medium)Reproduction
hermes dashboard --tuibound to loopback.tui_gateway.slash_worker)./quit, OR let the websocket drop due to upstream proxy timeout / page refresh.pgrep -af tui_gateway.slash_worker— workers accumulate, one per session, never reaped.Observed behaviour
SIGTERM to the unit was ignored. Recovery:
Root cause (suspected)
Two related leaks in the PTY-bridge / slash-worker lifecycle:
/api/ptyWebSocketDisconnect path closes thePtyBridgecorrectly, but the spawnedhermes --tuichild holds an open_SlashWorker(subprocess.Popen oftui_gateway.slash_worker). When the parent dies via the SIGHUP→SIGTERM→SIGKILL escalation inpty_bridge.PtyBridge.close(), the slash worker — which is a grandchild spawned by the in-PTY agent process, not the dashboard — does not always see a TTY-hangup propagation if the agent process exits non-cleanly. Result: an orphantui_gateway.slash_workeris reparented to PID 1 (or remains under the dashboard cgroup since it was launched viaPopenfrom inside an agent that started under the dashboard's user-unit cgroup)._SlashWorkerregisters noatexit/ signal handler and noPR_SET_PDEATHSIGintui_gateway/server.py(lines ~183–264).close()is only called when the in-agent code reaches_restart_slash_workeror session shutdown — neither runs on abrupt websocket disconnect.So:
/api/ptyWS leaves an orphan worker.systemctl --user stoponly signals the main pid; the workers ignore SIGTERM (no handler) and only die on KillMode=control-group + SIGKILL.Suggested fix
Two independent guards, each cheap and useful on its own:
In
_SlashWorker.__init__, setPR_SET_PDEATHSIGto SIGTERM on the child via apreexec_fnso the worker dies the moment its parent agent exits — even if the parent crashes or is SIGKILLed.In
/api/pty'sfinallyblock (hermes_cli/web_server.py~3582–3588), afterbridge.close(), also walk and SIGTERM anytui_gateway.slash_workerwhose parent pid is the just-closed bridge's pid. This is defensive — guard (1) makes it redundant on Linux — but it's necessary on macOS wherePR_SET_PDEATHSIGdoesn't exist (useproc_track/kqueue NOTE_EXITif you want symmetry, or just accept the explicit sweep).Optional third: have
systemdunits documentKillMode=control-group(it's the default, but worth a note indocs/deployment.md) so operators don't override it toprocessand lose the cgroup-wide SIGKILL recovery path.Reproducibility
Reliably reproduces on my box: 3 open/close cycles on a flaky upstream (I forced this by toggling Cloudflare cache rules), 5 stacked workers, hang within 2 minutes. Happy to provide systemd journal excerpts or strace if useful.
Related
Host:header issue affecting the same proxy setup in feat(dashboard): HERMES_DASHBOARD_ALLOWED_HOSTS env for reverse-proxy / tunnel deployments #32362 (DNS-rebinding allowlist via env var). That patch is what surfaced this leak — without it the dashboard wasn't reachable over the tunnel at all, so the leak never had a chance to accumulate.