fix(core): prevent OOM by compacting API history, UI history, and triggering under memory pressure#4824
Conversation
…l mode microcompactHistory was gated behind UserQuery || Cron message types. Goal-mode loops use SendMessageType.Hook, so tool outputs accumulated indefinitely without ever being truncated, causing old_space OOM. Moving the call outside the type guard lets it run for all message types including Hook, ensuring tool results are compacted based on the idle time threshold regardless of message type.
When V8 heap usage reaches the hard threshold (65% of heap limit), the MemoryPressureMonitor now calls microcompactHistory with a forced zero-minute idle threshold, replacing old tool-result content with [Old tool result content cleared]. This frees heap memory directly, unlike the existing evict_cold_cache step which only touches the FileReadCache. The compact_history step is added to the hard threshold cleanup pipeline, preserving recent tool results (configurable via toolResultsNumToKeep, default 5).
…eshold to prevent OOM
|
@yiliang114 If I can get through a 10-hour goal session without V8 heap OOM, then we can revisit the threshold. The 5GB |
…adCache after compact_history Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>
…recency guard for tool_group Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>
…nders and allow repeat triggers Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>
…educe V8 heap pressure Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>
…and type
assertions
|
Thanks for the review! @wenshao All R1 comments addressed:
The original version hit OOM after 2h20m, UI compaction only fired once due to the one-shot |
wenshao
left a comment
There was a problem hiding this comment.
Test coverage gaps — the three new OOM-prevention mechanisms lack dedicated tests:
useMemoryMonitor: no test thatcompactOldItemsis called whenheapUsedexceeds 5 GB, no test for the 5-minute cooldown, no test for the interaction between the warning interval and the compaction check (see finding below aboutclearIntervalkilling both).memoryPressureMonitor.compact_history: no test for the new cleanup step (client-not-initialized, successful compaction, nothing-to-compact, exception paths).client.ts: no test thatSendMessageType.HooktriggersmicrocompactHistory— the primary fix for goal-mode OOM.
— qwen3.7-max via Qwen Code /review
…terval Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>
…ToolResult to avoid O(history) per tool call Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>
…instead of blanking fields Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>
…ure steps, fix async test drain Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>
… compact_history step, Hook/ToolResult gating) Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>
|
Thanks for the review! @wenshao All R2 comments addressed:
One thing from a 2h test run under the R1 version: compaction was working, UI fired 4 times, API fired 90 times, objects were being freed. But V8 GC wasn't reclaiming the space fast enough, heap kept climbing from 5.4GB to 8GB. Turns out we were only cleaning up but not forcing GC. Set |
wenshao
left a comment
There was a problem hiding this comment.
All R1/R2 issues from prior rounds have been addressed. Build passes, 226 tests pass, tsc/eslint clean.
Needs Human Review (low-confidence findings — terminal only):
memoryPressureMonitor.ts:519—compact_historycallsmicrocompactHistory+setHistorybut relies on the subsequentclear_file_cachestep to blanket-wipe the fileReadCache. Works today due to step ordering, but the coupling is implicit and fragile.useMemoryMonitor.ts:15—MEMORY_UI_COMPACT_THRESHOLDis a fixed 5 GB absolute value. The default V8 heap limit is ~4 GB; on machines where the CLI relaunches with less than 10 GB heap, this threshold is unreachable. The server-side monitor uses ratio-based thresholds (0.5/0.65/0.8) — consider consistency.- Test coverage gaps:
compactOldItemslacks tests for mixed-type history (interleaved thoughts + tool_groups), non-compactableresultDisplaytypes (TodoResultDisplay,AnsiOutputDisplay), andgemini_thoughttype (onlygemini_thought_contenttested).shouldCompactgate lacksRetry/Notificationexclusion tests.compact_historystep never exercises themeta-present path (setHistorycall unverified).
— qwen3.7-max via Qwen Code /review
… microcompact stats bug, and decouple fileReadCache clearing - Broaden hasOldOutput check to `!= null` instead of only string/fileDiff - Fix toolsKept/mediaKept counting already-cleared items as kept - Add explicit fileReadCache.clear() in compact_history step - Optimize useMemoryMonitor to reuse Date.now() call
- Add test for mixed-type history (interleaved thoughts + tool_groups) - Add test for gemini_thought type (not just gemini_thought_content) - Add test for non-string resultDisplay types (TodoResultDisplay, AnsiOutputDisplay, AgentResultDisplay) - Add test for null resultDisplay (should not compact) - Add test for non-compactable types (Retry, Notification, user, gemini) Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>
|
@wenshao Appreciate the follow-up review! R3 feedback addressed:
All of the above strategies help slow down heap growth. From our test runs, the compaction triggered multiple times and reduced heap by 40 to 150MB each time. RSS stays high because V8 doesn't immediately return memory to the OS, but the heap pressure is clearly relieved. |
…ING_THRESHOLD Hardcoded "10.50 GB" failed on macOS CI where system RAM differs, causing MEMORY_WARNING_THRESHOLD to be 85% of RAM instead of 7 GB. Compute the expected text from the actual threshold value.
|
Thanks for the review! @wenshao R6/R7 feedback addressed:
|
wenshao
left a comment
There was a problem hiding this comment.
R8 review at 3a98d88 (qwen3.7-max --comment). tsc/eslint clean, 423 tests pass, CI all_pass. 0 Critical, 2 Suggestion inline, 2 low-confidence terminal.
Additional finding (overlap with existing R7 comment at memoryPressureMonitor.ts:519):
The toolResultsThresholdMinutes override logic (positive→0, negative→pass-through, undefined→0) has no dedicated test verifying each branch. The existing compact_history happy-path test exercises the default value (5→0), but doesn't test the explicit -1 pass-through or a positive-value override. Consider adding test cases for: (a) toolResultsThresholdMinutes: 60 → overridden to 0, (b) toolResultsThresholdMinutes: -1 → preserved as -1 (compaction skipped), (c) toolResultsThresholdMinutes: undefined → defaults to 0.
— qwen3.7-max via Qwen Code /review
- Wrap compactOldItems() in try/catch inside setInterval callback to prevent uncaughtException from crashing the CLI. - Wrap microcompactHistory() in try/catch inside sendMessageStream so a compaction failure degrades gracefully instead of aborting the agent loop (critical for goal-mode Hook messages). - Add UI_COMPACT_CLEARED_MESSAGE guard to hasOldOutput check in compactOldItems, preventing spurious re-renders when re-compacting already-cleared tool groups. Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>
Thought subjects can contain sensitive inferences about the user's codebase. Log only the length to avoid persisting verbatim content to debug log files that may be attached to bug reports. Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>
Add clearContextOnIdle.toolResultsThresholdMinutes to createMockConfig type definition to fix TS2353, and add 3 tests covering the override logic (positive→0, negative preserved, undefined→0).
Local Verification ReportPR: #4824 Test Results
Pre-existing Failure (not from this PR)
Verdict✅ PASS — All PR-specific tests pass. No new test failures, no new typecheck errors, no lint issues. Verified locally by wenshao |
…mpaction try/catch verify that microcompactHistory() and compactOldItems() exceptions are caught and logged without crashing the host loop. - client.test.ts: mock microcompactHistory to throw, verify debugLogger.error is called and sendMessage completes normally. - useMemoryMonitor.test.ts: mock compactOldItems to throw on first call, verify error is logged and subsequent interval ticks still trigger compaction after cooldown. Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>
|
Thanks for the review and approval! @wenshao Summary of commits since last round: |
Co-authored-by: Shaojin Wen <shaojin.wensj@alibaba-inc.com>
|
@wenshao Appreciate the thorough review! R10 feedback addressed:
|
| // Should not compact non-compactable types | ||
| expect(result.current.history).toBe(before); | ||
| }); | ||
| }); |
There was a problem hiding this comment.
[Suggestion] The compactOldItems counting fix (commit 595701096) correctly aligns totalToolGroupsWithOutput with the map-phase predicate, but no regression test covers the specific scenario it fixes — a history containing already-compacted tool groups (where resultDisplay === UI_COMPACT_CLEARED_MESSAGE).
Without this test, a future refactor that drops the !== UI_COMPACT_CLEARED_MESSAGE guard would silently re-introduce the over-compaction bug from R10.
Suggested test:
it('should not re-compact already-compacted tool groups (idempotent)', () => {
// 15 already-compacted + 15 fresh tool_groups = 30 total
// → totalToolGroupsWithOutput = 15, toolGroupsToCompact = 0
// → no additional compaction occurs
// Call compactOldItems() a second time → same reference returned (no-op)
});Two additional untested scenarios: (1) all tool groups already compacted, (2) tool group with mixed output (some tools real, some cleared).
— qwen3.7-max via Qwen Code /review
DragonnZhang
left a comment
There was a problem hiding this comment.
No new issues found. This PR has been through 10+ thorough review rounds with 31 inline comments, all of which have been addressed in the current code. The three core changes (Hook-message microcompaction, compact_history cleanup step, and UI history compaction) are correctly implemented with proper error handling, edge case coverage, and comprehensive tests. LGTM. -- Qwen Code /review
|
Thanks @DragonnZhang for the approve! |
|
@wenshao Thanks for the detailed review! Would it be okay if I address your R11 test suggestion in a separate follow-up PR? Would you mind reviewing this one again? |
What this PR does
Fixed: #4815
Three targeted fixes to prevent old-space exhaustion during long-running sessions, plus hardening from review feedback:
Run microcompaction on Hook messages (goal-mode continuation) —
microcompactHistorywas gated behindUserQuery || Cronmessage types. Goal-mode loops useSendMessageType.Hook, so tool outputs accumulated indefinitely without ever being truncated. Moving the call outside the type guard lets it run for all non-ToolResult message types (avoids O(history) overhead per tool call).Add
compact_historystep to moderate memory-pressure cleanup — When V8 heap usage reaches thehardthreshold (65 % of heap limit), the MemoryPressureMonitor now callsmicrocompactHistorywith a forced zero-minute idle threshold, replacing old tool-result content with[Old tool result content cleared]. This frees heap memory directly, unlike the existingevict_cold_cachestep which only touches the FileReadCache. Compaction errors are caught and logged without rethrowing, so subsequent cleanup steps (e.g.trigger_gc) still execute.Compact UI history at dynamic V8 heap threshold — The previous two fixes only address the API-side history (
GeminiChat.history). The CLI-side UI history (HistoryItem[]) — which stores thinking content (gemini_thought_content, 74 % of items in production) and tool output (tool_group.resultDisplay) — was never cleaned up. A newcompactOldItems()method inuseHistoryManagerremoves old thinking items (keeps last 20) and replaces old toolresultDisplaywith[Old tool result content cleared]. Thresholds (MEMORY_UI_COMPACT_THRESHOLD,MEMORY_PHYSICAL_DELETE_THRESHOLD) are computed dynamically from V8'sheap_size_limiton each check instead of at module load, since V8 grows the limit at runtime.Hardening (from review feedback)
debugLogger.debug()arguments are guarded withisEnabled()hot-path checks — string concatenation,process.memoryUsage(), and.toFixed()no longer execute when no debug session is active.MEMORY_WARNING_THRESHOLDusesmin(7 GB, 85% of system RAM)to prevent OOM on low-RAM machines. UI compact and physical-delete thresholds are functions (not stale module-load constants) so they track V8's dynamically growingheap_size_limit.compact_historycleanup step catches exceptions and logs them without rethrowing, allowing subsequent steps (clear_file_cache,trigger_gc) to still run.resetChatusesgetHistoryLength()instead ofgetHistory().lengthto avoid astructuredCloneallocation spike under heap pressure.SendMessageType.Cron(triggers microcompaction) andSendMessageType.Retry(skips),compact_historystep paths (success, exception, client-not-initialized), and UI compaction with mixed-type history.Why it's needed
Long-running sessions (goal mode or sustained interactive use) accumulate large amounts of data in both API history and UI history. The API side had partial coverage (microcompaction for tool outputs, tryCompress for full compaction), but the UI side had zero cleanup.
Production evidence: A goal-mode session accumulated 2,778 UI history items in ~1 hour, with 74 % being
gemini_thought_content(thinking chunks). Despite each item being only ~235 bytes of text, V8 heap grew from 106 MB to 8 GB due to GC inefficiency at scale — late-session items cost ~2.8 MB each vs ~0.8 MB early on. The session OOM'd at the 8 GB V8 heap limit.The root cause:
gemini_thought_contentitems are synthetic (designed to be disposable) but were never disposed. Tool output intool_group.resultDisplay(file diffs, shell output, grep results) also accumulated indefinitely on the UI side.Reviewer Test Plan
Evidence (Before & After)
Before:
microcompactHistoryonly runs forUserQueryandCronmessages — goal-modeHookmessages are excludedHistoryItem[]) has no compaction — thinking items and tool outputs accumulate for the entire session lifetimeAfter:
microcompactHistoryruns for all message types exceptToolResultheap_size_limit), UI history is compacted: old thinking items removed (last 20 kept), old toolresultDisplayclearedheap_size_limitisEnabled()— zero cost when inactivetoolResultsNumToKeep, default 5) are preserved on the API side; recent 20 thinking items preserved on the UI sideHow to Verify
Build and run
npm run build && npm run bundle npm startStart a goal-mode session and let it run for 30+ minutes with tool-heavy operations (file reads, shell commands). Monitor
process.memoryUsage().heapUsed— it should plateau or decrease after microcompaction kicks in, instead of growing unbounded.In an interactive session, trigger moderate memory pressure (e.g., read many large files). Verify the debug log contains:
[COMPACT_HISTORY] cleared N tool result(s)— API-side compaction[UI_COMPACT] heapUsed=...MB exceeds ...MB threshold, compacting UI history— UI-side compaction[COMPACT_UI_HISTORY] removed N thought item(s), compacted M tool group(s)— UI compaction resultsRegression: scroll back through conversation history (Ctrl+O detailed mode) — recent tool outputs and thinking should be intact; older tool outputs will show
[Old tool result content cleared]placeholders, and old thinking items will be removed. Verify the model can still re-read files and execute commands normally.中文
这个 PR 做了什么
长时间运行的会话(goal mode 或持续交互)会不断积累 API 历史和 UI 历史,最终导致 V8 堆内存耗尽(OOM)。这个 PR 做了三件事来防住这个问题:
Hook 消息也跑 microcompaction — 之前只有
UserQuery和Cron类型的消息会触发历史压缩,goal mode 用的Hook消息被漏掉了,工具输出会无限堆积。现在除了ToolResult(避免每次工具调用都遍历整个历史),其他消息类型都会触发。内存压力时主动压缩 API 历史 — V8 堆使用达到 65% 阈值时,MemoryPressureMonitor 会调用
microcompactHistory,把旧的工具调用结果替换成占位符,直接释放堆内存。压缩过程中的异常会被 catch 并记录日志,不会中断后续的清理步骤(比如trigger_gc)。UI 历史也做压缩 —
useHistoryManager新增compactOldItems(),清理旧的 thinking 条目(保留最近 20 条)和旧的工具输出。阈值改为每次检测时动态计算(V8 的heap_size_limit会随运行时增长),不再是模块加载时算一次的死值。其他改进
isEnabled()守卫,没有活跃 session 时不会做字符串拼接和process.memoryUsage()调用MEMORY_WARNING_THRESHOLD改为min(7GB, 85% 系统内存),低内存机器上不会设得太高resetChat用 O(1) 的getHistoryLength()代替getHistory().length,避免堆压力大时的 structuredClone 开销为什么需要这个
线上实测:一个 goal mode 跑了约 1 小时,积累 2778 条 UI 历史,其中 74% 是
gemini_thought_content。虽然每条只有 ~235 字节文本,但 V8 堆从 106 MB 涨到 8 GB 后 OOM。后期每条条目实际占用 ~2.8 MB(前期 ~0.8 MB),因为 V8 GC 在大量小对象场景下效率急剧下降。根本原因:thinking 条目本身就是设计来用完即弃的,但从来没被弃过。工具输出(文件 diff、shell 输出、grep 结果)在 UI 侧也是只增不减。
怎么验证
npm run build && npm run bundle && npm startprocess.memoryUsage().heapUsed— 压缩生效后应该趋于缓慢增长。[COMPACT_HISTORY]和[UI_COMPACT]输出。