Bug: tui_gateway.slash_worker subprocesses leak under dashboard usage
hermes dashboard --tui is documented to use a persistent _SlashWorker subprocess per session — singular, persistent across slash invocations (per AGENTS.md § "Slash Command Flow"). Observed behavior contradicts this: each slash.exec call appears to spawn a fresh tui_gateway.slash_worker subprocess and orphan it.
Workers from sessions that ended hours/days ago are still running. Each holds ~95 MB resident. On a busy multi-user dashboard box this accumulates fast enough to swap-pin the host.
Reproducer
hermes dashboard --insecure --no-open --host 0.0.0.0 --port 9119 --tui
- Multiple long-lived browser dashboard chat sessions (different users)
- Heavy use of slash commands via the embedded TUI in browser
Observed accumulation (production, 7.8 GB box)
128 stale tui_gateway.slash_worker subprocesses across 5 dashboard chat sessions over ~48 h:
Session 20260505_104123_c8063a — 62 stale workers (oldest from 5/5 ~10:41 AM)
Session 20260506_072606_f58b62 — 30
Session 20260506_082737_2a52e2 — 29
Session 20260506_114839_5a88dd — 4
Session 20260505_163042_6b92fb — 3
───
total 128
128 × ~95 MB ≈ 12 GB of resident demand on a 7.8 GB host. Result:
Before SIGTERM: RAM free 187 MB, swap used 3.9 GB / 4.0 GB, slash_workers 128
After SIGTERM: RAM free 4.4 GB, swap used 81 MB, slash_workers 19
kswapd0 was active. Every keystroke through the websocket → PTY bridge had to wait on kernel paging. Symptom for end users: dashboard text field laggy / unresponsive on a box that otherwise has no load.
Re-accumulation rate
Cleared 128 → 19 at ~10:30 AM ET. Re-checked ~4 hours later: back to 46, none of which are reused across slash invocations within the same session — they accumulate, not deduplicate. Confirms the persistent-singular-worker behavior is not happening.
Session 20260506_082737_2a52e2 — 15 workers
Session 20260506_072606_f58b62 — 14
Session 20260505_104123_c8063a — 14
Session 20260507_142806_ea1f49 — 1 (active session)
───
total 44 (+ 2 in active turn)
Each worker is invoked as:
python3 -m tui_gateway.slash_worker --session-key <session_id> --model claude-opus-4-7
Where to look (per AGENTS.md TUI architecture section)
tui_gateway/server.py — slash worker lifecycle / spawn path
tui_gateway/__main__.py (or wherever _SlashWorker is constructed)
hermes_cli/pty_bridge.py + hermes_cli/web_server.py /api/pty — dashboard side
ui-tui/src/* — slash.exec dispatch path
The fix is most likely:
- the
_SlashWorker is being constructed per-call instead of looked up from a session-scoped registry, and/or
- an existing registry isn't reaping workers when the session disconnects (no graceful close on websocket teardown)
Cross-fleet observation
Same hermes dashboard --tui is running on 4 other VPSes in our fleet (Finn, Finn2, Jason, Sam — same release). All 4 currently show zero tui_gateway.slash_worker accumulation because their dashboards are open but barely used by humans. So the bug is the same code path everywhere; only Hermie has the multi-user traffic to expose it. This is consistent with "every slash invocation spawns a fresh worker, and only when the session disconnects do orphans become visible."
Workaround in production
Stopgap kill cron deployed across all 5 of our boxes:
# Kill any tui_gateway.slash_worker process older than 60 minutes (gracefully).
ps -eo pid,etimes,cmd \
| awk '/tui_gateway\.slash_worker/ && !/awk/ && $2+0 > 3600 {print $1}' \
| xargs -r kill -TERM
Run every 30 min. SIGTERM exits cleanly; tested no impact to live sessions.
I'd be happy to clear out our production state more often and capture additional snapshots if useful (process tables, /proc/<pid>/status, etc.). Will keep applying the stopgap until upstream fix is available.
Environment
hermes-agent build: 2026.4.30 + 188 commits (v2026.4.30-188-g5d3be898a)
- Ubuntu 24.04 LTS
- Python 3.12, all default
- Anthropic API mode, model
claude-opus-4-7
- 5 production VPSes, 7.8 GB RAM each
Bug:
tui_gateway.slash_workersubprocesses leak under dashboard usagehermes dashboard --tuiis documented to use a persistent_SlashWorkersubprocess per session — singular, persistent across slash invocations (perAGENTS.md§ "Slash Command Flow"). Observed behavior contradicts this: eachslash.execcall appears to spawn a freshtui_gateway.slash_workersubprocess and orphan it.Workers from sessions that ended hours/days ago are still running. Each holds ~95 MB resident. On a busy multi-user dashboard box this accumulates fast enough to swap-pin the host.
Reproducer
hermes dashboard --insecure --no-open --host 0.0.0.0 --port 9119 --tuiObserved accumulation (production, 7.8 GB box)
128 stale
tui_gateway.slash_workersubprocesses across 5 dashboard chat sessions over ~48 h:128 × ~95 MB ≈ 12 GB of resident demand on a 7.8 GB host. Result:
kswapd0was active. Every keystroke through the websocket → PTY bridge had to wait on kernel paging. Symptom for end users: dashboard text field laggy / unresponsive on a box that otherwise has no load.Re-accumulation rate
Cleared 128 → 19 at ~10:30 AM ET. Re-checked ~4 hours later: back to 46, none of which are reused across slash invocations within the same session — they accumulate, not deduplicate. Confirms the persistent-singular-worker behavior is not happening.
Each worker is invoked as:
Where to look (per
AGENTS.mdTUI architecture section)tui_gateway/server.py— slash worker lifecycle / spawn pathtui_gateway/__main__.py(or wherever_SlashWorkeris constructed)hermes_cli/pty_bridge.py+hermes_cli/web_server.py/api/pty— dashboard sideui-tui/src/*—slash.execdispatch pathThe fix is most likely:
_SlashWorkeris being constructed per-call instead of looked up from a session-scoped registry, and/orCross-fleet observation
Same
hermes dashboard --tuiis running on 4 other VPSes in our fleet (Finn, Finn2, Jason, Sam — same release). All 4 currently show zerotui_gateway.slash_workeraccumulation because their dashboards are open but barely used by humans. So the bug is the same code path everywhere; only Hermie has the multi-user traffic to expose it. This is consistent with "every slash invocation spawns a fresh worker, and only when the session disconnects do orphans become visible."Workaround in production
Stopgap kill cron deployed across all 5 of our boxes:
Run every 30 min. SIGTERM exits cleanly; tested no impact to live sessions.
I'd be happy to clear out our production state more often and capture additional snapshots if useful (process tables,
/proc/<pid>/status, etc.). Will keep applying the stopgap until upstream fix is available.Environment
hermes-agentbuild: 2026.4.30 + 188 commits (v2026.4.30-188-g5d3be898a)claude-opus-4-7