Dashboard hangs: tui_gateway.slash_worker subprocesses leak on PTY chat disconnect (524 via reverse proxy)

# Dashboard hangs under PTY chat after stacked `tui_gateway.slash_worker` subprocess leak

## Summary

When running `hermes dashboard --tui` behind a reverse proxy (Cloudflare Tunnel in my case), repeated open/close of the in-browser **Chat** tab leaks `tui_gateway.slash_worker` Python subprocesses. They stack inside the dashboard's systemd cgroup until the process saturates memory and stops responding — the dashboard hangs with dozens of pending connections, and the proxy returns 524 (Cloudflare) / 502 (nginx). SIGTERM does not reap; only SIGKILL on the whole cgroup recovers.

Filed at the maintainers' request after a live incident on `vm-hermes` (Hermes Agent v0.14.0, commit `cea87d913`).

## Environment

- Hermes Agent v0.14.0, `main` @ `cea87d913`
- Linux (Azure Ubuntu 24.04, kernel 6.17.0-1015-azure)
- Python from the bundled venv
- Launched via systemd --user unit:
  ```
  ExecStart=/.../hermes dashboard --port 9119 --host 127.0.0.1 --no-open --tui --skip-build
  ```
- Reverse proxy: Cloudflare Tunnel → `127.0.0.1:9119`
- Model: `claude-opus-4.7` via Copilot OAuth (reasoning medium)

## Reproduction

1. Start `hermes dashboard --tui` bound to loopback.
2. Open the dashboard in a browser, click **Chat** (spawns a PTY + `tui_gateway.slash_worker`).
3. Close the tab / browser without typing `/quit`, OR let the websocket drop due to upstream proxy timeout / page refresh.
4. Re-open Chat. Repeat 3–5 times.
5. `pgrep -af tui_gateway.slash_worker` — workers accumulate, one per session, never reaped.
6. After ~5 stacked workers the dashboard process climbs past ~800MB RSS, event loop starves, all HTTP requests stall (51 pending in my incident), proxy returns 524.

## Observed behaviour

```
$ pgrep -af tui_gateway.slash_worker | wc -l
5

$ systemctl --user status hermes-dashboard.service
   Memory: 781.4M (peak: 1.1G)
   Tasks: 38

$ journalctl --user -u hermes-dashboard | grep -i "Incoming request ended abruptly: context canceled" | wc -l
51
```

SIGTERM to the unit was ignored. Recovery:

```
systemctl --user kill -s SIGKILL hermes-dashboard.service
systemctl --user reset-failed hermes-dashboard.service
systemctl --user start hermes-dashboard.service
```

## Root cause (suspected)

Two related leaks in the PTY-bridge / slash-worker lifecycle:

1. **`/api/pty` WebSocketDisconnect path closes the `PtyBridge` correctly, but the spawned `hermes --tui` child holds an open `_SlashWorker` (subprocess.Popen of `tui_gateway.slash_worker`).** When the parent dies via the SIGHUP→SIGTERM→SIGKILL escalation in `pty_bridge.PtyBridge.close()`, the slash worker — which is a *grandchild* spawned by the in-PTY agent process, not the dashboard — does not always see a TTY-hangup propagation if the agent process exits non-cleanly. Result: an orphan `tui_gateway.slash_worker` is reparented to PID 1 (or remains under the dashboard cgroup since it was launched via `Popen` from inside an agent that started under the dashboard's user-unit cgroup).
2. **`_SlashWorker` registers no `atexit` / signal handler and no `PR_SET_PDEATHSIG`** in `tui_gateway/server.py` (lines ~183–264). `close()` is only called when the in-agent code reaches `_restart_slash_worker` or session shutdown — neither runs on abrupt websocket disconnect.

So:
- Every browser refresh / Cloudflare upstream timeout that drops the `/api/pty` WS leaves an orphan worker.
- Because the workers live in the same systemd user-cgroup, they count against the dashboard service's memory and Tasks=, and `systemctl --user stop` only signals the main pid; the workers ignore SIGTERM (no handler) and only die on KillMode=control-group + SIGKILL.

## Suggested fix

Two independent guards, each cheap and useful on its own:

1. **In `_SlashWorker.__init__`**, set `PR_SET_PDEATHSIG` to SIGTERM on the child via a `preexec_fn` so the worker dies the moment its parent agent exits — even if the parent crashes or is SIGKILLed.

   ```python
   def _set_pdeathsig():
       try:
           import ctypes, signal as _sig
           libc = ctypes.CDLL("libc.so.6", use_errno=True)
           PR_SET_PDEATHSIG = 1
           libc.prctl(PR_SET_PDEATHSIG, _sig.SIGTERM, 0, 0, 0)
       except Exception:
           pass

   self.proc = subprocess.Popen(
       argv,
       ...,
       preexec_fn=_set_pdeathsig,   # Linux-only; wrap in platform check
   )
   ```

2. **In `/api/pty`'s `finally` block** (`hermes_cli/web_server.py` ~3582–3588), after `bridge.close()`, also walk and SIGTERM any `tui_gateway.slash_worker` whose parent pid is the just-closed bridge's pid. This is defensive — guard (1) makes it redundant on Linux — but it's necessary on macOS where `PR_SET_PDEATHSIG` doesn't exist (use `proc_track` / `kqueue NOTE_EXIT` if you want symmetry, or just accept the explicit sweep).

Optional third: have `systemd` units document `KillMode=control-group` (it's the default, but worth a note in `docs/deployment.md`) so operators don't override it to `process` and lose the cgroup-wide SIGKILL recovery path.

## Reproducibility

Reliably reproduces on my box: 3 open/close cycles on a flaky upstream (I forced this by toggling Cloudflare cache rules), 5 stacked workers, hang within 2 minutes. Happy to provide systemd journal excerpts or strace if useful.

## Related

- I separately patched a `Host:` header issue affecting the same proxy setup in #32362 (DNS-rebinding allowlist via env var). That patch is what surfaced this leak — without it the dashboard wasn't reachable over the tunnel at all, so the leak never had a chance to accumulate.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dashboard hangs: tui_gateway.slash_worker subprocesses leak on PTY chat disconnect (524 via reverse proxy) #32377

Dashboard hangs under PTY chat after stacked `tui_gateway.slash_worker` subprocess leak

Summary

Environment

Reproduction

Observed behaviour

Root cause (suspected)

Suggested fix

Reproducibility

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Dashboard hangs: tui_gateway.slash_worker subprocesses leak on PTY chat disconnect (524 via reverse proxy) #32377

Description

Dashboard hangs under PTY chat after stacked tui_gateway.slash_worker subprocess leak

Summary

Environment

Reproduction

Observed behaviour

Root cause (suspected)

Suggested fix

Reproducibility

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Dashboard hangs under PTY chat after stacked `tui_gateway.slash_worker` subprocess leak