Skip to content

TUI Native Memory Leak - RSS grows to 13+ GB after ~40 min active usage #15141

@eggressive

Description

@eggressive

TUI Native Memory Leak - RSS grows to 13+ GB after ~40 min active usage

Summary

Since the Apr 23–24 update (commits bd929ea5 and 67bfd4b8), the TUI frontend (Node.js Ink renderer) leaks native memory at an alarming rate. The JS heap stays bounded at ~2.7 GB while RSS climbs to 13+ GB, causing the process to freeze and get killed via SIGTERM within ~1 hour of active streaming.

The gateway backend remains healthy (~98 MB) throughout. Auto-heap-dump triggers are miscalibrated because they measure JS heapUsed, which is much smaller than the actual RSS leak.

Environment

  • Hermes version: v0.11.0 / commit 34c3e671 (Apr 24 hotfix)
  • Base commit: bf196a3f (v0.11.0 tag)
  • OS: Fedora Linux 41 (Wayland, KDE)
  • Node.js: v22.22.0
  • Display: TUI (hermes --tui)
  • Config defaults: thinking: expanded, tools: expanded, activity: hidden (from 67bfd4b8)

Reproduction Steps

  1. Start Hermes TUI:
    hermes --tui
  2. Engage in normal streaming conversation with tool calls and reasoning blocks
  3. Leave thinking and tools sections expanded (default since Apr 24)
  4. Observe RSS every 30 s:
    watch -n 5 'ps -o pid,rss,vsz,comm -p $(pgrep -f "ui-tui/dist/entry.js")'

Expected Behavior

RSS should stay under ~1 GB for indefinite usage. Occasional bump during large streaming payloads, but stable between turns.

Actual Behavior

Phase Time RSS (MB) Notes
Start t+0 ~157 Baseline
Idle/light ~10 min ~247 Slow growth
Active streaming ~20–40 min ~525 → 6,066 Accelerating
Peak ~52 min 13,978 Process unresponsive
Crash ~53 min SIGTERM, auto-restart with new PID

Two confirmed crash cycles (same day)

PID Start Peak RSS Duration before crash
58836 (morning) ~9.4 GB ~20 min
76262 (afternoon) 14:03 13.978 GB ~53 min

Diagnostic Evidence

Heap dump .diagnostics.json at peak (auto-critical)

{
  "memoryUsage": {
    "arrayBuffers": 768803,
    "external": 21238571,
    "heapTotal": 2798071808,
    "heapUsed": 2728422704,
    "rss": 9572970496
  },
  "memoryGrowthRate": {
    "mbPerHour": 17684.5
  }
}

RSS (9.5 GB) >> heapUsed (2.7 GB). The leak is entirely outside V8.

Process comparison at peak

PID     RSS      COMMAND
76262  13,978 MB  node /.../ui-tui/dist/entry.js   ← leaking
58744      98 MB  python -m hermes_cli.main gateway run  ← stable

Suspected Root Cause

Primary suspect: bd929ea5 — Ink text measurement cache

perf(ink): cache text measurements across yoga flex re-passes

File: ui-tui/packages/hermes-ink/src/ink/dom.ts

The commit added _textMeasureCache to ink-text DOM elements, keyed by ${width}|${widthMode}. While bounded to 16 entries per node (FIFO eviction), the underlying Yoga layout system is backed by C++ WASM state. When the Ink reconciler tears down a subtree via freeRecursive() / clearYogaNodeReferences(), it nulls JS references but may leave:

  • WASM text measurement buffers
  • Yoga layout node C++ instances
  • Cache generation counter objects that hold references

Each streaming update triggers markDirty() on expanded sections (default since 67bfd4b8), causing Yoga to re-measure. With continuous thinking + tools streaming, this becomes a fast leak.

Amplifier: 67bfd4b8 - expanded sections by default

From Apr 24, thinking: expanded and tools: expanded dramatically increase the number of Yoga measure/re-layout cycles per frame compared to the previous collapsed-by-default UI.

Additional Context

Heap dump misfire

The memoryMonitor.ts triggers on JS heapUsed (high=1.5 GB, critical=2.5 GB). Because this leak is native, a process can climb to 13+ GB RSS while JS heap sits at 2.7 GB. The monitor dumps 2.5 GB .heapsnapshot files repeatedly with zero diagnostic value for this bug, and disk usage in ~/.hermes/heapdumps/ grows to 24+ GB.

Gateway unaffected, crash log confirms TUI death

~/.hermes/logs/tui_gateway_crash.log has Python bridge alive in sys.stdin loop at SIGTERM delivery. The Node parent dies first; the Python subprocess is orphaned.

Related PRs checked

Possible Fixes (for discussion)

  1. Investigate clearYogaNodeReferences — ensure all WASM nodes are explicitly freed before nulling. Check if yoga-layout WASM bindings need explicit free() calls.
  2. Invalidate _textMeasureCache before clearYogaNodeReferences — the cache is cleared in clearYogaNodeReferences via _textMeasureCache = undefined, but if the Map retains entries referenced by the WASM side, this doesn't help.
  3. Cap _textMeasureCache.entries growth — already 16 entries, but keyed by ${width}|${widthMode}. If width probes are sparse, the cache may churn without actually hitting. Consider a global/shared cache with TTL.
  4. Monitor RSS in memoryMonitor.ts — add rss alongside heapUsed to detect native leaks earlier.

Data Available

Full monitor log at ~/.hermes/logs/tui-rss-monitor.log:

  • 30-second RSS samples across two crash cycles
  • Format: pid,time,rss_kb,rss_mb,vsz_kb,command
  • Captures transition from PID 76262 (peak 13,978 MB) to new PID 81703

Checklist

  • Confirmed on latest main (34c3e671)
  • Confirmed with default config (no custom sections overrides)
  • Reproduced in normal usage (not a stress test)
  • Gateway unaffected

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/tuiTerminal UI (ui-tui/ + tui_gateway/)type/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions