feat(core): add memory pressure monitor#4403
Conversation
49ba200 to
5be2c95
Compare
| */ | ||
| getPressureLevel(): 'normal' | 'soft' | 'hard' | 'critical' { | ||
| const mem = process.memoryUsage(); | ||
| const rssRatio = mem.rss / this.effectiveMemoryLimit; |
There was a problem hiding this comment.
[Critical] Division by zero when effectiveMemoryLimit is 0.
heapSizeLimit on the next line has an explicit > 0 guard, but effectiveMemoryLimit does not. If os.totalmem() returns 0 (possible in some containerized or unusual environments) and both cgroup reads return undefined, then computeEffectiveMemoryLimit() returns 0. This makes rssRatio = Infinity, which always evaluates to 'critical' — causing the monitor to aggressively clear FileReadCache every cleanupCooldownMs (default 5 s), degrading performance with no diagnostic output explaining why.
| const rssRatio = mem.rss / this.effectiveMemoryLimit; | |
| const rssRatio = | |
| this.effectiveMemoryLimit > 0 | |
| ? mem.rss / this.effectiveMemoryLimit | |
| : 0; |
— qwen-latest-series-invite-beta-v36 via Qwen Code /review
|
|
||
| if (freedRatio < 0.01) { | ||
| this.consecutiveIneffectiveCleanups++; | ||
| if (this.consecutiveIneffectiveCleanups === 3) { |
There was a problem hiding this comment.
[Suggestion] The memory-cleanup-ineffective event fires only at exactly === 3, never again. By contrast, recordCleanupFailure (line 320) uses >= 3 and keeps emitting on every subsequent failure. If cleanup remains ineffective for 10+ consecutive cycles, only the 3rd emits an event — cycles 4, 5, … are silent. Downstream consumers lose visibility into ongoing memory pressure after the initial alert.
Consider changing to >= 3 to match the failure-path pattern, or emit at periodic thresholds (3, 6, 12…) to limit noise while maintaining visibility:
| if (this.consecutiveIneffectiveCleanups === 3) { | |
| if (this.consecutiveIneffectiveCleanups >= 3) { |
— qwen-latest-series-invite-beta-v36 via Qwen Code /review
| if (limit > hostTotal) return undefined; | ||
|
|
||
| return limit; | ||
| } catch { |
There was a problem hiding this comment.
[Suggestion] All cgroup read errors are silently swallowed — permission denied, corrupt file content, unexpected format all fall back to hostTotal with no log. If a container has a misconfigured memory limit, the monitor falls back to the host total (potentially much larger), never triggers cleanup, and the container gets OOM-killed while the monitor logs show "normal pressure."
Consider adding a debugLogger.warn(...) when returning undefined for unexpected values, including the file path and raw content:
| } catch { | |
| } catch { | |
| debugLogger.debug(`Failed to read cgroup memory limit from ${filePath}`); | |
| return undefined; |
— qwen-latest-series-invite-beta-v36 via Qwen Code /review
| debugLogger.debug(`FileReadCache stale eviction: ${evicted} entries`); | ||
| break; | ||
| } | ||
| case 'trigger_gc': { |
There was a problem hiding this comment.
[Suggestion] When enableExplicitGC is true (via QWEN_MEMORY_ENABLE_GC=1) but the process wasn't started with --expose-gc, this step silently no-ops with no log or warning. An operator debugging critical-level OOM could see enableExplicitGC=true in config and assume GC is running when it isn't.
| case 'trigger_gc': { | |
| case 'trigger_gc': { | |
| if (typeof global.gc === 'function') { | |
| const before = process.memoryUsage().rss; | |
| global.gc(); | |
| const after = process.memoryUsage().rss; | |
| debugLogger.debug(`global.gc() freed ${before - after} bytes`); | |
| } else { | |
| debugLogger.warn( | |
| 'trigger_gc requested but global.gc is not available; ' + | |
| 'start Node.js with --expose-gc', | |
| ); | |
| } | |
| break; | |
| } |
— qwen-latest-series-invite-beta-v36 via Qwen Code /review
5be2c95 to
33a7353
Compare
|
Update pushed in 33a7353. Addressed the memory pressure monitor review feedback:
Verification:
|
|
Quick follow-up on this PR. Given the discussion in #3000 and the remaining #4181-#4184 roadmap, I wanted to check whether maintainers still want this PR reviewed as a standalone low-overhead runtime mitigation. This PR does not try to replace If the preferred direction is to fold this into #4184 or avoid a separate memory-pressure monitor, I am happy to close this PR to reduce maintainer overhead. Otherwise, the branch is currently green and ready for re-review. |
| ) { | ||
| this.queuedCleanupRecommendation = recommendation; | ||
| debugLogger.debug( | ||
| `Queued escalated cleanup "${recommendation.action}" while ` + |
There was a problem hiding this comment.
[Critical] Queued cleanup recommendation can be silently downgraded by a lower-ranked arrival.
The comparison on line 248 checks recommendation.action > this.activeCleanupAction but does NOT compare against the already-queued this.queuedCleanupRecommendation. Consider this sequence:
aggressive(rank 3) cleanup starts →activeCleanupAction = 'aggressive'- Between
await Promise.resolve()boundaries,light(rank 1) arrives →1 > 3is false → skipped ✅
But:
light(rank 1) cleanup starts →activeCleanupAction = 'light'aggressive(rank 3) arrives →3 > 1is true → queued ✅moderate(rank 2) arrives →2 > 1is true → overwrites the queued aggressive with moderate ❌
The most aggressive needed cleanup is silently replaced by a weaker one.
| `Queued escalated cleanup "${recommendation.action}" while ` + | |
| if (this.cleanupInProgress) { | |
| if ( | |
| cleanupActionRank(recommendation.action) > | |
| cleanupActionRank(this.activeCleanupAction) | |
| ) { | |
| const queuedRank = this.queuedCleanupRecommendation | |
| ? cleanupActionRank(this.queuedCleanupRecommendation.action) | |
| : -1; | |
| if (cleanupActionRank(recommendation.action) > queuedRank) { | |
| this.queuedCleanupRecommendation = recommendation; | |
| debugLogger.debug( | |
| `Queued escalated cleanup "${recommendation.action}" while ` + | |
| `"${this.activeCleanupAction}" is in progress`, | |
| ); | |
| } | |
| } else { | |
| debugLogger.debug('Cleanup already in progress, skipping'); | |
| } | |
| return; | |
| } |
— qwen3.7-max via Qwen Code /review
| this.config = { ...(pressureConfig ?? DEFAULT_PRESSURE_CONFIG) }; | ||
| validateMemoryPressureConfig(this.config); | ||
| this.effectiveMemoryLimit = this.computeEffectiveMemoryLimit(); | ||
| this.heapSizeLimit = getHeapStatistics().heap_size_limit; |
There was a problem hiding this comment.
[Critical] heapSizeLimit is cached once at construction and never refreshed.
V8 dynamically adjusts heap_size_limit over the lifetime of a long-running session (e.g., on a 16 GB machine the initial limit might be ~1.5 GB, growing to 4+ GB as the old generation expands). A stale, too-small denominator inflates heapRatio = mem.heapUsed / this.heapSizeLimit, causing early critical cleanup — even when V8 could grow the heap much further before actually running out. Since ratio = Math.max(rssRatio, heapRatio), the stale heap ratio becomes the dominant signal.
getHeapStatistics() is a cheap V8 intrinsic (no I/O, no GC trigger). Refresh on each call:
| this.heapSizeLimit = getHeapStatistics().heap_size_limit; | |
| this.effectiveMemoryLimit = this.computeEffectiveMemoryLimit(); | |
| debugLogger.info( |
Then in getPressureLevel():
const heapStats = getHeapStatistics();
const heapRatio = heapStats.heap_size_limit > 0
? mem.heapUsed / heapStats.heap_size_limit : 0;— qwen3.7-max via Qwen Code /review
| this.consecutiveCleanupFailures = 0; | ||
| const memAfter = process.memoryUsage().rss; | ||
| // RSS can lag behind object graph cleanup, so this is diagnostic | ||
| // logging only and does not drive failure detection. |
There was a problem hiding this comment.
[Critical] Comment and code contradict each other — RSS measurement drives failure detection.
The comment on lines 268–269 says: "RSS can lag behind object graph cleanup, so this is diagnostic logging only and does not drive failure detection."
This is factually wrong: logCleanupResult (invoked via setImmediate using the memAfter captured on line 267) increments consecutiveIneffectiveCleanups and emits memory-cleanup-ineffective events when freedRatio < 0.01. The measurement directly drives both counter state and user-visible events.
Additionally, memAfter is captured in .then() before setImmediate fires — but in the default path (enableExplicitGC: false), cleanup only drops FileReadCache Map references. V8 has not yet GC'd the orphaned entries, so RSS is essentially unchanged and freedRatio < 0.01 fires on virtually every cleanup, generating false memory-cleanup-ineffective events.
Move the RSS measurement inside setImmediate so it benefits from the deferred timing:
| // logging only and does not drive failure detection. | |
| .then(() => { | |
| this.consecutiveCleanupFailures = 0; | |
| setImmediate(() => { | |
| try { | |
| const memAfter = process.memoryUsage().rss; | |
| this.logCleanupResult(memBefore, memAfter, recommendation); | |
| } catch (err) { | |
| debugLogger.error( | |
| `Cleanup measurement failed: ${getErrorMessage(err)}`, | |
| ); | |
| } | |
| }); | |
| }) |
— qwen3.7-max via Qwen Code /review
|
|
||
| const limit = Number(raw); | ||
| if (!Number.isFinite(limit) || limit <= 0) { | ||
| debugLogger.warn( |
There was a problem hiding this comment.
[Critical] No lower bound on cgroup memory limit — a value of "1" (1 byte) passes all validation checks.
A cgroup value like "1" passes Number.isFinite, > 0, Number.isSafeInteger, and <= hostTotal — producing rssRatio = rss / 1 which is astronomically high (~10 billion), triggering aggressive cleanup after every single tool execution. If enableExplicitGC is on with --expose-gc, this becomes a synchronous GC after every tool call — complete denial of service.
No realistic container/VM has a memory limit below a few hundred MB. Add a sensible floor:
| debugLogger.warn( | |
| const limit = Number(raw); | |
| if (!Number.isFinite(limit) || limit <= 0) { | |
| debugLogger.warn( | |
| `Ignoring out-of-range cgroup memory limit from ${filePath}: ${raw}`, | |
| ); | |
| return undefined; | |
| } | |
| // Reject unrealistically small limits (below 64 MiB). | |
| const MIN_CGROUP_MEMORY_LIMIT = 64 * 1024 * 1024; | |
| if (limit < MIN_CGROUP_MEMORY_LIMIT) { | |
| debugLogger.warn( | |
| `Ignoring unrealistically small cgroup memory limit from ${filePath}: ${raw}`, | |
| ); | |
| return undefined; | |
| } |
— qwen3.7-max via Qwen Code /review
| cleanupActionRank(recommendation.action) > | ||
| cleanupActionRank(this.lastCleanupAction); | ||
| if ( | ||
| !isEscalation && |
There was a problem hiding this comment.
[Suggestion] Cooldown blocking branch has no test verifying it actually blocks.
All tests that set cleanupCooldownMs to a nonzero value (e.g., 60000) exercise the isEscalation bypass branch — they never test that a same-level or lower-level cleanup is actually suppressed within the cooldown window. All other tests set cleanupCooldownMs: 0, making the guard always pass.
The cooldown is the primary throttling mechanism. If the comparison logic were inverted, the test suite would not detect it. Consider adding:
it('blocks same-level cleanup within the cooldown window', async () => {
const evictSpy = vi.fn().mockReturnValue(0);
const monitor = new MemoryPressureMonitor(
createMockConfig({ fileReadCache: { evictNotAccessedSince: evictSpy } }),
{ ...DEFAULT_PRESSURE_CONFIG, cleanupCooldownMs: 60_000 },
);
setMemUsage(9 * 1024 * 1024 * 1024); // soft pressure
monitor.performCheck();
await drainCleanupMeasurement();
expect(evictSpy).toHaveBeenCalledTimes(1);
// Second check at same level within cooldown → should be a no-op
monitor.performCheck();
await drainCleanupMeasurement();
expect(evictSpy).toHaveBeenCalledTimes(1); // still 1
});— qwen3.7-max via Qwen Code /review
| this.consecutiveIneffectiveCleanups = 0; | ||
| } | ||
|
|
||
| private recordCleanupFailure( |
There was a problem hiding this comment.
[Suggestion] The consecutiveIneffectiveCleanups = 0 reset on the effective-cleanup branch is untested.
Tests verify the counter increments (when RSS drops < 1%) and can be reset via the public resetConsecutiveFailures(). But no test verifies that an effective cleanup (RSS drops ≥ 1%) automatically resets the counter. If this line were accidentally removed, the monitor would emit spurious memory-cleanup-ineffective events after just 3 total ineffective cleanups spread across effective ones — breaking the consecutive-invariant.
Consider adding a test that interleaves effective and ineffective cleanups and asserts the counter resets.
— qwen3.7-max via Qwen Code /review
|
|
||
| void this.runCleanupSteps(recommendation.steps) | ||
| .then(() => { | ||
| this.consecutiveCleanupFailures = 0; |
There was a problem hiding this comment.
[Suggestion] The consecutiveCleanupFailures = 0 auto-reset after successful cleanup is untested in a failure-then-success scenario.
Tests verify failure counting and the public resetConsecutiveFailures() method, but no test runs failures followed by a successful cleanup and asserts the counter returns to zero. If this reset were removed, the counter would monotonically increase, eventually emitting memory-cleanup-failed events even after recovery.
Consider adding:
it('resets consecutive failures after a successful cleanup', async () => {
const evictSpy = vi.fn().mockImplementation(() => { throw new Error('fail'); });
const monitor = new MemoryPressureMonitor(
createMockConfig({ fileReadCache: { evictNotAccessedSince: evictSpy } }),
{ ...DEFAULT_PRESSURE_CONFIG, cleanupCooldownMs: 0 },
);
setMemUsage(9 * 1024 * 1024 * 1024);
monitor.performCheck();
await drainCleanupMeasurement();
expect(monitor.getConsecutiveFailures()).toBe(1);
evictSpy.mockReturnValue(0); // now succeed
monitor.performCheck();
await drainCleanupMeasurement();
expect(monitor.getConsecutiveFailures()).toBe(0);
});— qwen3.7-max via Qwen Code /review
| return this.consecutiveCleanupFailures; | ||
| } | ||
|
|
||
| resetConsecutiveFailures(): void { |
There was a problem hiding this comment.
[Critical] resetConsecutiveFailures() only resets counters — it does not cancel in-flight cleanup. startNewSession() (config.ts:1928-1929) calls getFileReadCache().clear() then resetConsecutiveFailures(), but if cleanup is in-flight at that moment, the async resolution (.then → setImmediate → clear_file_cache) fires against the FileReadCache after startNewSession has cleared it and the new session may have started re-populating it.
The same applies to a queued escalation in queuedCleanupRecommendation — it is drained by finishCleanupAndRunQueued after the session boundary.
| resetConsecutiveFailures(): void { | |
| resetConsecutiveFailures(): void { | |
| this.consecutiveCleanupFailures = 0; | |
| this.consecutiveIneffectiveCleanups = 0; | |
| // Cancel any in-flight cleanup so its async tail cannot wipe the | |
| // new session's freshly-populated FileReadCache. | |
| this.cleanupInProgress = false; | |
| this.activeCleanupAction = 'none'; | |
| this.queuedCleanupRecommendation = undefined; | |
| } |
A generation counter (bumped here, checked in the .then/setImmediate chain) would be more robust if simply resetting state has side effects on the currently-running runCleanupSteps promise.
— qwen3.7-max via Qwen Code /review
| ); | ||
|
|
||
| if (freedRatio < 0.01) { | ||
| this.consecutiveIneffectiveCleanups++; |
There was a problem hiding this comment.
[Suggestion] No adaptive backoff when cleanup is repeatedly ineffective. When freedRatio < 0.01 under critical pressure, this counter increments and shouldEmitRepeatedDiagnostic throttles the event — but the next performCheckInternal still triggers executeCleanup({ action: 'aggressive', steps: ['clear_file_cache', 'trigger_gc'] }) at full cooldown cadence (5 s).
If memory pressure comes from non-cache sources (chat history accumulation, native allocations), every 5-second cycle runs a stop-the-world global.gc() (if enabled) plus a full cache clear — neither of which helps — degrading responsiveness without reducing RSS. The memory-cleanup-ineffective event is fire-and-forget; nothing subscribes with corrective action.
Consider applying exponential backoff when consecutiveIneffectiveCleanups exceeds a threshold (e.g., cleanupCooldownMs * min(2^k, 64)) or temporarily suspending the aggressive tier until pressure drops below hardPressureRatio.
— qwen3.7-max via Qwen Code /review
|
Follow-up pushed in 55fb90b. Addressed the latest review feedback from May 27:
Verification after the change:
|
wenshao
left a comment
There was a problem hiding this comment.
LGTM ✅ — Third commit comprehensively addresses all prior-round Critical and Suggestion findings. Generation-based cleanup isolation, adaptive backoff, and expanded test coverage (451 tests pass) are well-implemented. CI all_pass 8/8. — qwen3.7-max via Qwen Code /review
Maintainer Verification ReportPR: #4403 — feat(core): add memory pressure monitor Local Build & Test Results
GitHub Actions CI (on
|
| Job | Status |
|---|---|
| Classify PR | ✅ Pass |
| Lint | ✅ Pass |
| CodeQL | ✅ Pass |
| Test (macOS, Node 22.x) | ✅ Pass |
| Test (Ubuntu, Node 22.x) | ✅ Pass |
| Test (Windows, Node 22.x) | ✅ Pass |
Code Review Summary
Architecture & Design: Sound. The monitor fires reactively after tool execution (no background timer), uses dual pressure signals (RSS + V8 heap), and applies graduated escalation (soft → hard → critical). Generation-based cancellation correctly handles session resets. queueMicrotask coalescing for concurrent tool completions is the right approach.
Code Quality: High quality, no bugs found. Cleanup pipeline ordering is correct (executeCleanup → runCleanupSteps → setImmediate → finishCleanupAndRunQueued). Map deletion during iteration is spec-compliant. assertNever exhaustiveness guards on switch defaults. Error isolation via emitSafely prevents listener crashes from breaking the cleanup pipeline.
Integration: Minimal and clean.
coreToolScheduler.ts: 6 lines —scheduleCheck()in thefinallyblock ofexecuteSingleToolCallconfig.ts: 90 lines — mirrors existinggetFileReadCache()pattern with prototype inheritancefileReadCache.ts: 30 lines —evictNotAccessedSince()method
Configuration: Well-designed. Strict Number() parsing (rejects partial parses like "0.8extra"), empty-string guards, ordering validation (soft < hard < critical), range bounds [0.3, 0.98], visible stderr warnings on misconfiguration with safe fallback to defaults.
Test Coverage: Comprehensive — 1,233 lines of monitor tests + 197 config + 117 cache + 39 scheduler. Covers all pressure levels, cgroup v1/v2 parsing, escalation queuing, session-reset cancellation, cooldown blocking, failure accounting, ineffective cleanup diagnostics, GC available/unavailable paths, and env var edge cases.
Security: No concerns. Cgroup paths are hardcoded, no user-supplied paths. Memory info is not exposed externally. global.gc() gated behind explicit opt-in.
Performance: Minimal overhead. One process.memoryUsage() + getHeapStatistics() per tool call, cooldown-protected cleanup, exponential back-off for ineffective aggressive cleanups (max ~320s).
Minor Observations (non-blocking)
- P3:
evictNotAccessedSinceskips entries withlastReadAt === undefined— structurally prevented today but not type-enforced. Consider treatingundefinedas "always stale." - P3: Hard throw in
getMemoryPressureMonitorfor missing config is fail-fast but could returnundefinedfor defense-in-depth. - P4: No
setMaxListenerson EventEmitter (default 10 — fine for typical usage, but diagnostic tooling with many listeners would trigger a warning).
Verdict
✅ Ready to merge. Well-designed, well-implemented, thoroughly tested. All issues are P3/P4 non-blockers. The feature is opt-in and has zero impact on existing functionality when disabled.
Verified locally with tmux parallel execution (build + typecheck + lint + focused tests + full core test suite).
yiliang114
left a comment
There was a problem hiding this comment.
LGTM — confirmed no conflict with current main.
Context: the earlier heap-pressure compaction safety net (#4186) was removed in 924736d6b (2026-05-20), so there is currently no runtime memory pressure handling on main. This PR fills that gap with a conservative, side-effect-minimal approach:
- Integration surface is one optional-chaining call in
coreToolScheduler.tsfinallyblock — worst case = no-op. - Cleanup only evicts cold
FileReadCacheentries; does not touch chat history, compaction, or session state. - cgroup v1/v2 awareness is useful for container deployments.
- Session-boundary isolation + adaptive backoff are well-designed.
This aligns with the #3000 Memory Diagnostics roadmap (specifically the "check memory pressure" + "proactive memory management" goals) and partially addresses #4181 (interpretation layer) by exposing a getPressureLevel() API.
CI all green across 3 platforms. 237 focused tests + 451 related core tests pass.
Thanks @ZevGit for the thorough iteration across 3 review rounds.
#4654) * feat(core): auto-dump memory diagnostics to disk on pressure detection When the MemoryPressureMonitor (#4403) detects hard or critical pressure, write a lightweight diagnostics JSON to .qwen/<project>/diagnostics/ before running cleanup. The file survives even if a subsequent operation triggers OOM, giving maintainers actionable data from bug reports without requiring the user to manually run /doctor memory after a crash. Design follows Claude Code's heapDumpService approach: write the cheap JSON first (small write, won't OOM), heavy snapshot second. Diagnostics include process memory stats, V8 heap stats, session history size, and an actionable suggestion for the user. Per-session limits: max 3 dumps, 30s cooldown between dumps. Closes #4651 * ci: retrigger CI after Windows flaky failure * test(core): use path.join in memoryDiagnosticsDumper test for cross-platform The assertion hard-coded POSIX separators ('/tmp/test-project/diagnostics/'), which fails on Windows where path.join produces backslashes. Build the expected substring with path.join + path.sep so it matches the dumper's actual output on every platform. * fix(core): two-phase memory diagnostics write to survive OOM Two critical issues from review: 1. The async collectMemoryDiagnostics() runs before writeFileSync, but it spawns a `ps` subprocess and reads /proc — fork() under critical memory pressure can fail or be OOM-killed, leaving no file on disk despite the "cheap write first" design comment. 2. dumpCount and lastDumpTime were updated after the await, so concurrent dumps (e.g. hard→critical escalation) would both pass the cap/cooldown guards and overwrite each other. Fix: - Reserve the dump slot synchronously (++dumpCount, lastDumpTime) before any await, so concurrent calls correctly hit the cap. - Phase 1: synchronously write a minimal JSON (process.memoryUsage + v8.getHeapStatistics, no fork/exec) with collectionComplete=false. Because async functions execute synchronously up to the first await, this is guaranteed on disk before the caller's next statement runs. - Phase 2: enrich with full diagnostics asynchronously and overwrite the file with collectionComplete=true. If Phase 2 crashes, the minimal Phase 1 file still survives for debugging. Tests updated for the two-phase write and gain two new cases covering the sync-Phase-1 guarantee and the synchronous slot reservation. * fix(core): point memory diagnostics suggestion at /compress (the actual command) The suggestion text told users to run /compact, which does not exist in this repository — the actual command is /compress (see compressCommand.ts). Pointing users at a nonexistent slash command in a diagnostics report makes the suggestion unactionable.
Summary
Reviewer Focus
Review Follow-up
Latest update:
55fb90be9addresses the May 27 review feedback.resetForNewSession().Validation
Commands run locally:
Observed locally:
npm run buildstill prints existing VS Code companion curly-rule warnings and Browserslist freshness warnings; these are outside this PR.GitHub Actions on
55fb90be9:Quickest reviewer verification path:
Scope and Risk
Linked Issues