tui_gateway.slash_worker subprocesses leak under dashboard usage (swap-pinning a 7.8GB box at 128 workers)

## Bug: `tui_gateway.slash_worker` subprocesses leak under dashboard usage

`hermes dashboard --tui` is documented to use a **persistent `_SlashWorker` subprocess** per session — singular, persistent across slash invocations (per `AGENTS.md` § "Slash Command Flow"). Observed behavior contradicts this: each `slash.exec` call appears to spawn a fresh `tui_gateway.slash_worker` subprocess and orphan it.

Workers from sessions that ended hours/days ago are still running. Each holds ~95 MB resident. On a busy multi-user dashboard box this accumulates fast enough to swap-pin the host.

### Reproducer

* `hermes dashboard --insecure --no-open --host 0.0.0.0 --port 9119 --tui`
* Multiple long-lived browser dashboard chat sessions (different users)
* Heavy use of slash commands via the embedded TUI in browser

### Observed accumulation (production, 7.8 GB box)

128 stale `tui_gateway.slash_worker` subprocesses across 5 dashboard chat sessions over ~48 h:

```
Session 20260505_104123_c8063a — 62 stale workers (oldest from 5/5 ~10:41 AM)
Session 20260506_072606_f58b62 — 30
Session 20260506_082737_2a52e2 — 29
Session 20260506_114839_5a88dd —  4
Session 20260505_163042_6b92fb —  3
                                ───
                         total 128
```

128 × ~95 MB ≈ 12 GB of resident demand on a 7.8 GB host. Result:

```
Before SIGTERM:   RAM free  187 MB,  swap used 3.9 GB / 4.0 GB,  slash_workers 128
After  SIGTERM:   RAM free  4.4 GB,  swap used  81 MB,           slash_workers  19
```

`kswapd0` was active. Every keystroke through the websocket → PTY bridge had to wait on kernel paging. Symptom for end users: dashboard text field laggy / unresponsive on a box that otherwise has no load.

### Re-accumulation rate

Cleared 128 → 19 at ~10:30 AM ET. Re-checked ~4 hours later: **back to 46**, none of which are reused across slash invocations within the same session — they accumulate, not deduplicate. Confirms the persistent-singular-worker behavior is not happening.

```
Session 20260506_082737_2a52e2 — 15 workers
Session 20260506_072606_f58b62 — 14
Session 20260505_104123_c8063a — 14
Session 20260507_142806_ea1f49 —  1   (active session)
                                ───
                          total 44 (+ 2 in active turn)
```

Each worker is invoked as:
```
python3 -m tui_gateway.slash_worker --session-key <session_id> --model claude-opus-4-7
```

### Where to look (per `AGENTS.md` TUI architecture section)

* `tui_gateway/server.py` — slash worker lifecycle / spawn path
* `tui_gateway/__main__.py` (or wherever `_SlashWorker` is constructed)
* `hermes_cli/pty_bridge.py` + `hermes_cli/web_server.py` `/api/pty` — dashboard side
* `ui-tui/src/*` — `slash.exec` dispatch path

The fix is most likely:
* the `_SlashWorker` is being constructed per-call instead of looked up from a session-scoped registry, **and/or**
* an existing registry isn't reaping workers when the session disconnects (no graceful close on websocket teardown)

### Cross-fleet observation

Same `hermes dashboard --tui` is running on 4 other VPSes in our fleet (Finn, Finn2, Jason, Sam — same release). All 4 currently show **zero** `tui_gateway.slash_worker` accumulation **because their dashboards are open but barely used by humans**. So the bug is the same code path everywhere; only Hermie has the multi-user traffic to expose it. This is consistent with "every slash invocation spawns a fresh worker, and only when the session disconnects do orphans become visible."

### Workaround in production

Stopgap kill cron deployed across all 5 of our boxes:

```bash
# Kill any tui_gateway.slash_worker process older than 60 minutes (gracefully).
ps -eo pid,etimes,cmd \
  | awk '/tui_gateway\.slash_worker/ && !/awk/ && $2+0 > 3600 {print $1}' \
  | xargs -r kill -TERM
```

Run every 30 min. SIGTERM exits cleanly; tested no impact to live sessions.

I'd be happy to clear out our production state more often and capture additional snapshots if useful (process tables, `/proc/<pid>/status`, etc.). Will keep applying the stopgap until upstream fix is available.

### Environment

* `hermes-agent` build: 2026.4.30 + 188 commits (`v2026.4.30-188-g5d3be898a`)
* Ubuntu 24.04 LTS
* Python 3.12, all default
* Anthropic API mode, model `claude-opus-4-7`
* 5 production VPSes, 7.8 GB RAM each


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tui_gateway.slash_worker subprocesses leak under dashboard usage (swap-pinning a 7.8GB box at 128 workers) #21467

Bug: `tui_gateway.slash_worker` subprocesses leak under dashboard usage

Reproducer

Observed accumulation (production, 7.8 GB box)

Re-accumulation rate

Where to look (per `AGENTS.md` TUI architecture section)

Cross-fleet observation

Workaround in production

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

tui_gateway.slash_worker subprocesses leak under dashboard usage (swap-pinning a 7.8GB box at 128 workers) #21467

Description

Bug: tui_gateway.slash_worker subprocesses leak under dashboard usage

Reproducer

Observed accumulation (production, 7.8 GB box)

Re-accumulation rate

Where to look (per AGENTS.md TUI architecture section)

Cross-fleet observation

Workaround in production

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Bug: `tui_gateway.slash_worker` subprocesses leak under dashboard usage

Where to look (per `AGENTS.md` TUI architecture section)