Skip to content

tui_gateway.slash_worker subprocesses leak under dashboard usage (swap-pinning a 7.8GB box at 128 workers) #21467

@ccook1963

Description

@ccook1963

Bug: tui_gateway.slash_worker subprocesses leak under dashboard usage

hermes dashboard --tui is documented to use a persistent _SlashWorker subprocess per session — singular, persistent across slash invocations (per AGENTS.md § "Slash Command Flow"). Observed behavior contradicts this: each slash.exec call appears to spawn a fresh tui_gateway.slash_worker subprocess and orphan it.

Workers from sessions that ended hours/days ago are still running. Each holds ~95 MB resident. On a busy multi-user dashboard box this accumulates fast enough to swap-pin the host.

Reproducer

  • hermes dashboard --insecure --no-open --host 0.0.0.0 --port 9119 --tui
  • Multiple long-lived browser dashboard chat sessions (different users)
  • Heavy use of slash commands via the embedded TUI in browser

Observed accumulation (production, 7.8 GB box)

128 stale tui_gateway.slash_worker subprocesses across 5 dashboard chat sessions over ~48 h:

Session 20260505_104123_c8063a — 62 stale workers (oldest from 5/5 ~10:41 AM)
Session 20260506_072606_f58b62 — 30
Session 20260506_082737_2a52e2 — 29
Session 20260506_114839_5a88dd —  4
Session 20260505_163042_6b92fb —  3
                                ───
                         total 128

128 × ~95 MB ≈ 12 GB of resident demand on a 7.8 GB host. Result:

Before SIGTERM:   RAM free  187 MB,  swap used 3.9 GB / 4.0 GB,  slash_workers 128
After  SIGTERM:   RAM free  4.4 GB,  swap used  81 MB,           slash_workers  19

kswapd0 was active. Every keystroke through the websocket → PTY bridge had to wait on kernel paging. Symptom for end users: dashboard text field laggy / unresponsive on a box that otherwise has no load.

Re-accumulation rate

Cleared 128 → 19 at ~10:30 AM ET. Re-checked ~4 hours later: back to 46, none of which are reused across slash invocations within the same session — they accumulate, not deduplicate. Confirms the persistent-singular-worker behavior is not happening.

Session 20260506_082737_2a52e2 — 15 workers
Session 20260506_072606_f58b62 — 14
Session 20260505_104123_c8063a — 14
Session 20260507_142806_ea1f49 —  1   (active session)
                                ───
                          total 44 (+ 2 in active turn)

Each worker is invoked as:

python3 -m tui_gateway.slash_worker --session-key <session_id> --model claude-opus-4-7

Where to look (per AGENTS.md TUI architecture section)

  • tui_gateway/server.py — slash worker lifecycle / spawn path
  • tui_gateway/__main__.py (or wherever _SlashWorker is constructed)
  • hermes_cli/pty_bridge.py + hermes_cli/web_server.py /api/pty — dashboard side
  • ui-tui/src/*slash.exec dispatch path

The fix is most likely:

  • the _SlashWorker is being constructed per-call instead of looked up from a session-scoped registry, and/or
  • an existing registry isn't reaping workers when the session disconnects (no graceful close on websocket teardown)

Cross-fleet observation

Same hermes dashboard --tui is running on 4 other VPSes in our fleet (Finn, Finn2, Jason, Sam — same release). All 4 currently show zero tui_gateway.slash_worker accumulation because their dashboards are open but barely used by humans. So the bug is the same code path everywhere; only Hermie has the multi-user traffic to expose it. This is consistent with "every slash invocation spawns a fresh worker, and only when the session disconnects do orphans become visible."

Workaround in production

Stopgap kill cron deployed across all 5 of our boxes:

# Kill any tui_gateway.slash_worker process older than 60 minutes (gracefully).
ps -eo pid,etimes,cmd \
  | awk '/tui_gateway\.slash_worker/ && !/awk/ && $2+0 > 3600 {print $1}' \
  | xargs -r kill -TERM

Run every 30 min. SIGTERM exits cleanly; tested no impact to live sessions.

I'd be happy to clear out our production state more often and capture additional snapshots if useful (process tables, /proc/<pid>/status, etc.). Will keep applying the stopgap until upstream fix is available.

Environment

  • hermes-agent build: 2026.4.30 + 188 commits (v2026.4.30-188-g5d3be898a)
  • Ubuntu 24.04 LTS
  • Python 3.12, all default
  • Anthropic API mode, model claude-opus-4-7
  • 5 production VPSes, 7.8 GB RAM each

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/gatewayGateway runner, session dispatch, deliverycomp/tuiTerminal UI (ui-tui/ + tui_gateway/)type/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions