Skip to content

fix(memory): prevent unbounded growth + dashboard i18n fixes#1605

Closed
Bernardxu123 wants to merge 3 commits into
esengine:mainfrom
Bernardxu123:fix/memory-leaks
Closed

fix(memory): prevent unbounded growth + dashboard i18n fixes#1605
Bernardxu123 wants to merge 3 commits into
esengine:mainfrom
Bernardxu123:fix/memory-leaks

Conversation

@Bernardxu123

@Bernardxu123 Bernardxu123 commented May 23, 2026

Copy link
Copy Markdown
Collaborator

Memory Leak Fixes

Fixes #1571

Problem

Long-running sessions (especially in free mode) experience unbounded memory growth. Node.js process consumes 2GB+ and grows by ~2MB/second during task execution.

Root Cause Analysis

Four data structures grow without cleanup:

  1. JobRegistry: Completed jobs never removed from Map — each retains output (up to 64KB), child process reference, promise closures
  2. AppendOnlyLog: Entries accumulate without limit — every tool result (up to 8K tokens) is stored until explicit /compact
  3. SessionStats.turns: Per-turn stats array grows indefinitely — iterated on every summary() call
  4. state.cards: Card objects never removed from React state — Static component keeps all settled items in memory

Fixes

1. JobRegistry Self-Cleanup

Auto-evict oldest completed jobs when count > 50. Cleanup runs on job completion and list() calls. Retains 50 most recent completed jobs for list_jobs / job_output lookups.

2. AppendOnlyLog — Fold Trigger

In ContextManager.decideAfterUsage(), added log length check. When log exceeds 500 entries (~1MB memory), triggers existing fold mechanism. Fold compresses old messages into summary, keeping recent 200 entries.

3. SessionStats Rolling Window

Rolling window of 200 turns. Older turns dropped but costs/caches aggregated into carryover. Maintains accurate session totals while limiting memory.

4. state.cards + Static

Existing elideOldCardContent() already stubs heavy content for cards older than RECENT_CARDS_WINDOW (200). Terminal scrollback retains visual output after elision. No additional trimming needed.

Dashboard i18n Fixes

Added missing translation keys to dashboard/src/i18n/en.ts (merged from main).

Testing

  • npm run lint ✅
  • npm run typecheck ✅
  • npm run test ✅

@Bernardxu123

Copy link
Copy Markdown
Collaborator Author

CI 失败原因是 dashboard i18n 键缺失(TS2345),这是 PR #1418 原有的问题,与本次内存修复无关。

本次修复仅涉及:

  • src/tools/jobs.ts(JobRegistry 自清理)
  • src/memory/runtime.ts(AppendOnlyLog 检测报告)
  • src/telemetry/stats.ts(SessionStats 滚动窗口)
  • src/cli/ui/state/reducer.ts(卡片状态裁剪)

本地验证:npm run lint ✅ | npm run typecheck ✅

dashboard i18n 问题需要单独修复(补充缺失的翻译键到 dashboard/src/i18n/en.ts)。

@Bernardxu123 Bernardxu123 changed the title fix(memory): prevent unbounded growth in JobRegistry, AppendOnlyLog, SessionStats, and card state fix(memory): prevent unbounded growth + dashboard i18n fixes May 23, 2026
…SessionStats, and card state

Four targeted fixes for memory growth during long sessions:

1. JobRegistry: auto-evict oldest completed jobs when count > 50
2. AppendOnlyLog: auto-compact when entries > 500, keep recent 200
3. SessionStats.turns: rolling window of 200 turns, older costs
   aggregated into carryover
4. state.cards: trim oldest settled cards when count > 150;
   Terminal scrollback retains visual output, only React state freed

Fixes esengine#1571
@esengine

Copy link
Copy Markdown
Owner

Thanks @Bernardxu123 — direction's right, AppendOnlyLog and SessionStats both land where we want.

One real bug + one knob to tighten in JobRegistry before this is good to merge:

BuglastCleanupAt = Date.now() in the field initializer means the first 30s of process lifetime is a dead zone where maybeCleanup short-circuits. That's exactly the YOLO burst window where a session fires off npm run dev + a tail + a few one-shots inside the first few seconds; none of them get cleaned up. Initialize to 0 so the first call always runs.

KnobMAX_COMPLETED_JOBS = 50 is generous: each retained entry still holds the 64KB output buffer + ChildProcess ref, so the bound is ~3.2MB pinned even when the cleanup does fire. Bring this down to 10–20 — list_jobs doesn't need 50 historical entries for the model to make decisions.

Optional but tidier — once the cap drops to a small N, you can drop the 30s time throttle entirely and just gate on count inside settleClosed. The work is O(jobs.size) with N=20, that's nothing.

Fix the init bug at minimum and I'll merge.

…1217)

The previous implementation set tab.switching=true unconditionally in
session_load/new_chat handlers. When no turn was in flight (the common
case), the flag stayed true and suppressed events for the first turn
in the new session.

Fix: only set switching=true when tab.aborter exists (live turn).
@Bernardxu123

Copy link
Copy Markdown
Collaborator Author

PR 描述已更新,详细说明了:

  1. JobRegistry 自清理:保留最近 50 个已完成任务,超时 30 秒自动清理
  2. AppendOnlyLog 折叠触发:在 ContextManager.decideAfterUsage() 中检查 log 长度,超过 500 条触发现有 fold 机制(复用已有压缩逻辑)
  3. SessionStats 滚动窗口:保留最近 200 轮,旧轮次的 cost 聚合到 carryover
  4. state.cards:现有 elideOldCardContent 机制已足够,无需额外处理

本地验证:lint ✅ | typecheck ✅ | test ✅

@esengine

Copy link
Copy Markdown
Owner

Thanks for the update — the memory-leak direction is right (AppendOnlyLog + SessionStats look good), but the branch picked up a problem since the last review and the JobRegistry blockers I flagged earlier aren't addressed yet. Three things before I can merge:

1. Branch silently reverts #1582 (the recently-merged session-switch fix)

This PR forked before #1582 was merged. The branch independently added the switching: boolean field and the if (tab.aborter) tab.switching = true; setters at the two handlers — but never added the consumer in runTurn's finally that actually reads the flag. The merged form of #1582 has:

} finally {
  tab.aborter = null;
  if (!tab.switching) {
    // ... emit $turn_complete, emitSessions, emitBalance, $plan_cleared, QQ
  }
  tab.switching = false;
}

The PR's finally block has none of that. The 3-way merge produces a clean tree (no conflict markers), but the semantic result is the switching flag gets set and never read — i.e. #1217 comes back. gh pr diff 1605 against current main shows the if (!tab.switching) block being removed.

Fix: git fetch origin main && git rebase origin/main, then drop the duplicate desktop.ts changes from this PR entirely (they're already in main via #1582).

2. lastCleanupAt = Date.now() (still not changed)

Same as the last round — the field initializer means the first 30s of process lifetime short-circuits every maybeCleanup call. Initialize to 0 so the first call always runs.

3. MAX_COMPLETED_JOBS = 50 (still not changed)

Each retained completed job pins the 64KB output buffer + ChildProcess ref. 50 × 64KB ≈ 3.2MB held even when cleanup does fire. The model doesn't need 50 historical entries from list_jobs to make decisions. Drop this to 10 or 20.

Optional follow-up once #3 is in: with a small cap, the 30s throttle stops earning its keep — gate the cleanup on count inside settleClosed and drop CLEANUP_INTERVAL_MS + lastCleanupAt entirely. The sweep is O(jobs.size) and at N=20 that's free.

Push a rebase + the two-line fix in jobs.ts and this is ready.

@Bernardxu123

Copy link
Copy Markdown
Collaborator Author

Applied all your feedback in a clean branch (fix/memory-leaks-v2 → PR #1620):

  1. ✅ Rebased on latest main — no duplicate desktop.ts changes
  2. ✅ \lastCleanupAt\ initialized to \

@esengine

Copy link
Copy Markdown
Owner

Bumping this with urgency — the branch has drifted further since the last review and now silently reverts four PRs that landed in main in the last hour. Do not merge this as-is.

git diff origin/main..HEAD --stat against current main: 33 files, +122 / −779 lines. The deletions are not yours — they're recent merges your branch hasn't picked up. Net effect of merging this PR right now:

Merged into main What this PR would silently revert
#1614 macOS TCC fix — both entitlements/*.plist files, bundle-node.mjs ad-hoc sign, tauri.macos.conf.json hardened-runtime wiring
#1612 dashboard.enabled config + 53 lines of unit tests
#1582 session-switch stale-events guard — the if (!tab.switching) { ... } block in runTurn's finally would be deleted again
#1572 Alt+S stash/recall — App.tsx, ShortcutsHelpModal.tsx, three i18n keys, the behavior tests
older PRs (#1594, #1588, #1487, #1484, ...) composer sizing, tool summary, StaticCardStream refactor, hydrate-cards, etc.

GH still says mergeStateStatus: CLEAN because git's 3-way merge doesn't see text conflicts — your branch's absence of those lines is interpreted as deletion against main's additions. Semantically this is a disaster.

Example: the diff shows your branch removing ShortcutsHelpModal.tsx:29 ({ keys: "Alt+S", descKey: "shortcutsHelp.descAltS" }), which was added by #1572 about 90 minutes ago.

What you need to do:

  1. git fetch origin main && git rebase origin/main
  2. The two jobs.ts blockers from my earlier review are still unaddressed:
    • private lastCleanupAt = Date.now();= 0; (otherwise the first 30s of process lifetime short-circuits every cleanup)
    • MAX_COMPLETED_JOBS = 5010 or 20 (50 × 64KB output buffer is ~3.2MB pinned even after cleanup runs)
  3. Once rebased, the PR diff should only contain the actual memory-leak work — JobRegistry cleanup, the AppendOnlyLog fold trigger in ContextManager.decideAfterUsage, and the SessionStats rolling window. That's it.

Heads up — main is moving fast this week; if the rebase takes more than a few hours, you'll likely need a second one. Push when ready and I'll review the diff again.

@esengine

Copy link
Copy Markdown
Owner

Closing this in favor of #1620, which carries the JobRegistry portion in a clean form.

Why I'm closing rather than waiting for a rebase: in its current state this branch silently reverts four merges that landed today (#1572, #1582, #1612, #1614) plus several older ones. mergeStateStatus: CLEAN is misleading here — git's 3-way merge sees the missing lines on this branch as deletions against main's additions, so a single careless "Squash and merge" click would undo about 650 lines of recent work. Keeping it open as-is is a footgun.

The other two pieces of the original proposal still look good and are worth landing — please open them as separate PRs against current main:

  1. AppendOnlyLog fold trigger — the log.length > MAX_LOG_ENTRIES_FOR_FOLD branch in ContextManager.decideAfterUsage, plus the MAX_LOG_ENTRIES_FOR_FOLD = 500 constant. Reuses the existing fold path, low risk.
  2. SessionStats rolling window — the MAX_TURNS = 200 cap in src/telemetry/stats.ts with cost / cache / turn-count carryover aggregation.

Each one ≤ 20 lines and orthogonal to the other, so two small PRs are easier to review than a combined one.

Thanks for sticking with this — splitting it up is the right shape.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants