Skip to content

feat(core): add memory pressure monitor#4403

Merged
yiliang114 merged 3 commits into
QwenLM:mainfrom
ZevGit:feat/memory-pressure-monitor
May 31, 2026
Merged

feat(core): add memory pressure monitor#4403
yiliang114 merged 3 commits into
QwenLM:mainfrom
ZevGit:feat/memory-pressure-monitor

Conversation

@ZevGit

@ZevGit ZevGit commented May 21, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Added low-overhead runtime memory-pressure handling for long-running sessions.
  • The monitor classifies pressure using cgroup-aware RSS and V8 heap usage, then conservatively evicts FileReadCache metadata under soft, hard, or critical pressure.
  • The implementation preserves active conversation state and avoids automatic compaction or user-visible session interruption.
  • Configuration is opt-in/tunable through environment variables with strict parsing, validation, and diagnostics for invalid values.

Reviewer Focus

  • Cleanup policy and thresholds for soft, hard, and critical pressure.
  • Container memory-limit handling for cgroup v2 and cgroup v1.
  • Session-boundary behavior, especially avoiding stale async cleanup tails after a new session starts.
  • Ineffective-cleanup diagnostics and adaptive backoff for repeated aggressive cleanups.
  • The decision to limit automatic cleanup to cache metadata and optional explicit GC, rather than dropping conversation state.

Review Follow-up

Latest update: 55fb90be9 addresses the May 27 review feedback.

  • Added a session-generation guard so stale async cleanup tails and queued escalations cannot clear a newly-started session FileReadCache.
  • Added adaptive backoff for repeated ineffective aggressive cleanup, reducing the risk of repeated cache clears or explicit GC when memory pressure comes from non-cache sources.
  • Split diagnostic counter reset from session-scoped cleanup cancellation via resetForNewSession().
  • Added regression tests for queued cleanup cancellation on session reset, aggressive cleanup backoff, effective-cleanup counter reset, failure recovery, GC availability, and cooldown behavior.
  • Tightened test isolation by restoring Vitest spies after each memory-pressure test.

Validation

Commands run locally:

npx eslint packages/core/src/services/memoryPressureMonitor.ts packages/core/src/services/memoryPressureMonitor.test.ts packages/core/src/services/fileReadCache.ts packages/core/src/services/fileReadCache.test.ts packages/core/src/config/config.ts packages/core/src/config/config.test.ts packages/core/src/core/coreToolScheduler.ts packages/core/src/core/coreToolScheduler.test.ts
cd packages/core && npx vitest run src/services/memoryPressureMonitor.test.ts src/config/config.test.ts src/services/fileReadCache.test.ts src/core/coreToolScheduler.test.ts
npm run lint
npm run build
npm run typecheck
git diff --check

Observed locally:

  • Focused memory/config tests: 237/237 passed.
  • Related core tests: 451/451 passed.
  • Prettier, targeted ESLint, root lint, root typecheck, clean root build, and diff whitespace checks passed.
  • npm run build still prints existing VS Code companion curly-rule warnings and Browserslist freshness warnings; these are outside this PR.

GitHub Actions on 55fb90be9:

  • Classify PR: passed.
  • Lint: passed.
  • Tests on macOS, Ubuntu, and Windows with Node 22.x: passed.
  • CodeQL: passed.
  • Post Coverage Comment: skipped by workflow configuration.

Quickest reviewer verification path:

cd packages/core && npx vitest run src/services/memoryPressureMonitor.test.ts src/config/config.test.ts src/services/fileReadCache.test.ts src/core/coreToolScheduler.test.ts

Scope and Risk

  • The cleanup is intentionally limited to file-read cache metadata and optional explicit GC when configured.
  • Under sustained pressure, inactive file-read metadata may be evicted, which can reduce cache hit rate but should not drop conversation state or disconnect an active AI session.
  • Container behavior is covered by unit tests with mocked cgroup v1/v2 files. No live Docker, Podman, or Linux container smoke test was run locally.
  • No breaking changes. All new environment variables are optional and default to conservative behavior.

Linked Issues

@ZevGit ZevGit force-pushed the feat/memory-pressure-monitor branch 3 times, most recently from 49ba200 to 5be2c95 Compare May 21, 2026 12:10
*/
getPressureLevel(): 'normal' | 'soft' | 'hard' | 'critical' {
const mem = process.memoryUsage();
const rssRatio = mem.rss / this.effectiveMemoryLimit;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] Division by zero when effectiveMemoryLimit is 0.

heapSizeLimit on the next line has an explicit > 0 guard, but effectiveMemoryLimit does not. If os.totalmem() returns 0 (possible in some containerized or unusual environments) and both cgroup reads return undefined, then computeEffectiveMemoryLimit() returns 0. This makes rssRatio = Infinity, which always evaluates to 'critical' — causing the monitor to aggressively clear FileReadCache every cleanupCooldownMs (default 5 s), degrading performance with no diagnostic output explaining why.

Suggested change
const rssRatio = mem.rss / this.effectiveMemoryLimit;
const rssRatio =
this.effectiveMemoryLimit > 0
? mem.rss / this.effectiveMemoryLimit
: 0;

— qwen-latest-series-invite-beta-v36 via Qwen Code /review


if (freedRatio < 0.01) {
this.consecutiveIneffectiveCleanups++;
if (this.consecutiveIneffectiveCleanups === 3) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] The memory-cleanup-ineffective event fires only at exactly === 3, never again. By contrast, recordCleanupFailure (line 320) uses >= 3 and keeps emitting on every subsequent failure. If cleanup remains ineffective for 10+ consecutive cycles, only the 3rd emits an event — cycles 4, 5, … are silent. Downstream consumers lose visibility into ongoing memory pressure after the initial alert.

Consider changing to >= 3 to match the failure-path pattern, or emit at periodic thresholds (3, 6, 12…) to limit noise while maintaining visibility:

Suggested change
if (this.consecutiveIneffectiveCleanups === 3) {
if (this.consecutiveIneffectiveCleanups >= 3) {

— qwen-latest-series-invite-beta-v36 via Qwen Code /review

if (limit > hostTotal) return undefined;

return limit;
} catch {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] All cgroup read errors are silently swallowed — permission denied, corrupt file content, unexpected format all fall back to hostTotal with no log. If a container has a misconfigured memory limit, the monitor falls back to the host total (potentially much larger), never triggers cleanup, and the container gets OOM-killed while the monitor logs show "normal pressure."

Consider adding a debugLogger.warn(...) when returning undefined for unexpected values, including the file path and raw content:

Suggested change
} catch {
} catch {
debugLogger.debug(`Failed to read cgroup memory limit from ${filePath}`);
return undefined;

— qwen-latest-series-invite-beta-v36 via Qwen Code /review

debugLogger.debug(`FileReadCache stale eviction: ${evicted} entries`);
break;
}
case 'trigger_gc': {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] When enableExplicitGC is true (via QWEN_MEMORY_ENABLE_GC=1) but the process wasn't started with --expose-gc, this step silently no-ops with no log or warning. An operator debugging critical-level OOM could see enableExplicitGC=true in config and assume GC is running when it isn't.

Suggested change
case 'trigger_gc': {
case 'trigger_gc': {
if (typeof global.gc === 'function') {
const before = process.memoryUsage().rss;
global.gc();
const after = process.memoryUsage().rss;
debugLogger.debug(`global.gc() freed ${before - after} bytes`);
} else {
debugLogger.warn(
'trigger_gc requested but global.gc is not available; ' +
'start Node.js with --expose-gc',
);
}
break;
}

— qwen-latest-series-invite-beta-v36 via Qwen Code /review

@ZevGit ZevGit force-pushed the feat/memory-pressure-monitor branch from 5be2c95 to 33a7353 Compare May 22, 2026 02:55
@ZevGit

ZevGit commented May 22, 2026

Copy link
Copy Markdown
Contributor Author

Update pushed in 33a7353.

Addressed the memory pressure monitor review feedback:

  • Hardened cgroup memory limit parsing and fallback observability.
  • Added defensive handling for memoryUsage failures and queued cleanup startup failures.
  • Added exhaustive switch guards for cleanup recommendations, cleanup steps, and action ranking.
  • Documented the Object.create child Config monitor isolation contract.

Verification:

  • Focused core tests: 443/443 passed
  • Prettier / ESLint / git diff --check passed
  • npm run typecheck passed
  • npm run lint passed
  • Clean npm run build passed
  • GitHub CI is green across lint, macOS/Ubuntu/Windows tests, and CodeQL.

@ZevGit

ZevGit commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

Quick follow-up on this PR.

Given the discussion in #3000 and the remaining #4181-#4184 roadmap, I wanted to check whether maintainers still want this PR reviewed as a standalone low-overhead runtime mitigation.

This PR does not try to replace /doctor memory, heap snapshots, memory timeline diagnostics, or bounded tool-output offload. Its scope is narrower: detect RSS/heap pressure during normal tool execution and conservatively release FileReadCache metadata under pressure.

If the preferred direction is to fold this into #4184 or avoid a separate memory-pressure monitor, I am happy to close this PR to reduce maintainer overhead. Otherwise, the branch is currently green and ready for re-review.

) {
this.queuedCleanupRecommendation = recommendation;
debugLogger.debug(
`Queued escalated cleanup "${recommendation.action}" while ` +

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] Queued cleanup recommendation can be silently downgraded by a lower-ranked arrival.

The comparison on line 248 checks recommendation.action > this.activeCleanupAction but does NOT compare against the already-queued this.queuedCleanupRecommendation. Consider this sequence:

  1. aggressive (rank 3) cleanup starts → activeCleanupAction = 'aggressive'
  2. Between await Promise.resolve() boundaries, light (rank 1) arrives → 1 > 3 is false → skipped ✅

But:

  1. light (rank 1) cleanup starts → activeCleanupAction = 'light'
  2. aggressive (rank 3) arrives → 3 > 1 is true → queued ✅
  3. moderate (rank 2) arrives → 2 > 1 is true → overwrites the queued aggressive with moderate

The most aggressive needed cleanup is silently replaced by a weaker one.

Suggested change
`Queued escalated cleanup "${recommendation.action}" while ` +
if (this.cleanupInProgress) {
if (
cleanupActionRank(recommendation.action) >
cleanupActionRank(this.activeCleanupAction)
) {
const queuedRank = this.queuedCleanupRecommendation
? cleanupActionRank(this.queuedCleanupRecommendation.action)
: -1;
if (cleanupActionRank(recommendation.action) > queuedRank) {
this.queuedCleanupRecommendation = recommendation;
debugLogger.debug(
`Queued escalated cleanup "${recommendation.action}" while ` +
`"${this.activeCleanupAction}" is in progress`,
);
}
} else {
debugLogger.debug('Cleanup already in progress, skipping');
}
return;
}

— qwen3.7-max via Qwen Code /review

this.config = { ...(pressureConfig ?? DEFAULT_PRESSURE_CONFIG) };
validateMemoryPressureConfig(this.config);
this.effectiveMemoryLimit = this.computeEffectiveMemoryLimit();
this.heapSizeLimit = getHeapStatistics().heap_size_limit;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] heapSizeLimit is cached once at construction and never refreshed.

V8 dynamically adjusts heap_size_limit over the lifetime of a long-running session (e.g., on a 16 GB machine the initial limit might be ~1.5 GB, growing to 4+ GB as the old generation expands). A stale, too-small denominator inflates heapRatio = mem.heapUsed / this.heapSizeLimit, causing early critical cleanup — even when V8 could grow the heap much further before actually running out. Since ratio = Math.max(rssRatio, heapRatio), the stale heap ratio becomes the dominant signal.

getHeapStatistics() is a cheap V8 intrinsic (no I/O, no GC trigger). Refresh on each call:

Suggested change
this.heapSizeLimit = getHeapStatistics().heap_size_limit;
this.effectiveMemoryLimit = this.computeEffectiveMemoryLimit();
debugLogger.info(

Then in getPressureLevel():

const heapStats = getHeapStatistics();
const heapRatio = heapStats.heap_size_limit > 0
  ? mem.heapUsed / heapStats.heap_size_limit : 0;

— qwen3.7-max via Qwen Code /review

this.consecutiveCleanupFailures = 0;
const memAfter = process.memoryUsage().rss;
// RSS can lag behind object graph cleanup, so this is diagnostic
// logging only and does not drive failure detection.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] Comment and code contradict each other — RSS measurement drives failure detection.

The comment on lines 268–269 says: "RSS can lag behind object graph cleanup, so this is diagnostic logging only and does not drive failure detection."

This is factually wrong: logCleanupResult (invoked via setImmediate using the memAfter captured on line 267) increments consecutiveIneffectiveCleanups and emits memory-cleanup-ineffective events when freedRatio < 0.01. The measurement directly drives both counter state and user-visible events.

Additionally, memAfter is captured in .then() before setImmediate fires — but in the default path (enableExplicitGC: false), cleanup only drops FileReadCache Map references. V8 has not yet GC'd the orphaned entries, so RSS is essentially unchanged and freedRatio < 0.01 fires on virtually every cleanup, generating false memory-cleanup-ineffective events.

Move the RSS measurement inside setImmediate so it benefits from the deferred timing:

Suggested change
// logging only and does not drive failure detection.
.then(() => {
this.consecutiveCleanupFailures = 0;
setImmediate(() => {
try {
const memAfter = process.memoryUsage().rss;
this.logCleanupResult(memBefore, memAfter, recommendation);
} catch (err) {
debugLogger.error(
`Cleanup measurement failed: ${getErrorMessage(err)}`,
);
}
});
})

— qwen3.7-max via Qwen Code /review


const limit = Number(raw);
if (!Number.isFinite(limit) || limit <= 0) {
debugLogger.warn(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] No lower bound on cgroup memory limit — a value of "1" (1 byte) passes all validation checks.

A cgroup value like "1" passes Number.isFinite, > 0, Number.isSafeInteger, and <= hostTotal — producing rssRatio = rss / 1 which is astronomically high (~10 billion), triggering aggressive cleanup after every single tool execution. If enableExplicitGC is on with --expose-gc, this becomes a synchronous GC after every tool call — complete denial of service.

No realistic container/VM has a memory limit below a few hundred MB. Add a sensible floor:

Suggested change
debugLogger.warn(
const limit = Number(raw);
if (!Number.isFinite(limit) || limit <= 0) {
debugLogger.warn(
`Ignoring out-of-range cgroup memory limit from ${filePath}: ${raw}`,
);
return undefined;
}
// Reject unrealistically small limits (below 64 MiB).
const MIN_CGROUP_MEMORY_LIMIT = 64 * 1024 * 1024;
if (limit < MIN_CGROUP_MEMORY_LIMIT) {
debugLogger.warn(
`Ignoring unrealistically small cgroup memory limit from ${filePath}: ${raw}`,
);
return undefined;
}

— qwen3.7-max via Qwen Code /review

cleanupActionRank(recommendation.action) >
cleanupActionRank(this.lastCleanupAction);
if (
!isEscalation &&

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] Cooldown blocking branch has no test verifying it actually blocks.

All tests that set cleanupCooldownMs to a nonzero value (e.g., 60000) exercise the isEscalation bypass branch — they never test that a same-level or lower-level cleanup is actually suppressed within the cooldown window. All other tests set cleanupCooldownMs: 0, making the guard always pass.

The cooldown is the primary throttling mechanism. If the comparison logic were inverted, the test suite would not detect it. Consider adding:

it('blocks same-level cleanup within the cooldown window', async () => {
  const evictSpy = vi.fn().mockReturnValue(0);
  const monitor = new MemoryPressureMonitor(
    createMockConfig({ fileReadCache: { evictNotAccessedSince: evictSpy } }),
    { ...DEFAULT_PRESSURE_CONFIG, cleanupCooldownMs: 60_000 },
  );
  setMemUsage(9 * 1024 * 1024 * 1024); // soft pressure
  monitor.performCheck();
  await drainCleanupMeasurement();
  expect(evictSpy).toHaveBeenCalledTimes(1);

  // Second check at same level within cooldown → should be a no-op
  monitor.performCheck();
  await drainCleanupMeasurement();
  expect(evictSpy).toHaveBeenCalledTimes(1); // still 1
});

— qwen3.7-max via Qwen Code /review

this.consecutiveIneffectiveCleanups = 0;
}

private recordCleanupFailure(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] The consecutiveIneffectiveCleanups = 0 reset on the effective-cleanup branch is untested.

Tests verify the counter increments (when RSS drops < 1%) and can be reset via the public resetConsecutiveFailures(). But no test verifies that an effective cleanup (RSS drops ≥ 1%) automatically resets the counter. If this line were accidentally removed, the monitor would emit spurious memory-cleanup-ineffective events after just 3 total ineffective cleanups spread across effective ones — breaking the consecutive-invariant.

Consider adding a test that interleaves effective and ineffective cleanups and asserts the counter resets.

— qwen3.7-max via Qwen Code /review


void this.runCleanupSteps(recommendation.steps)
.then(() => {
this.consecutiveCleanupFailures = 0;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] The consecutiveCleanupFailures = 0 auto-reset after successful cleanup is untested in a failure-then-success scenario.

Tests verify failure counting and the public resetConsecutiveFailures() method, but no test runs failures followed by a successful cleanup and asserts the counter returns to zero. If this reset were removed, the counter would monotonically increase, eventually emitting memory-cleanup-failed events even after recovery.

Consider adding:

it('resets consecutive failures after a successful cleanup', async () => {
  const evictSpy = vi.fn().mockImplementation(() => { throw new Error('fail'); });
  const monitor = new MemoryPressureMonitor(
    createMockConfig({ fileReadCache: { evictNotAccessedSince: evictSpy } }),
    { ...DEFAULT_PRESSURE_CONFIG, cleanupCooldownMs: 0 },
  );
  setMemUsage(9 * 1024 * 1024 * 1024);
  monitor.performCheck();
  await drainCleanupMeasurement();
  expect(monitor.getConsecutiveFailures()).toBe(1);

  evictSpy.mockReturnValue(0); // now succeed
  monitor.performCheck();
  await drainCleanupMeasurement();
  expect(monitor.getConsecutiveFailures()).toBe(0);
});

— qwen3.7-max via Qwen Code /review

return this.consecutiveCleanupFailures;
}

resetConsecutiveFailures(): void {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Critical] resetConsecutiveFailures() only resets counters — it does not cancel in-flight cleanup. startNewSession() (config.ts:1928-1929) calls getFileReadCache().clear() then resetConsecutiveFailures(), but if cleanup is in-flight at that moment, the async resolution (.thensetImmediateclear_file_cache) fires against the FileReadCache after startNewSession has cleared it and the new session may have started re-populating it.

The same applies to a queued escalation in queuedCleanupRecommendation — it is drained by finishCleanupAndRunQueued after the session boundary.

Suggested change
resetConsecutiveFailures(): void {
resetConsecutiveFailures(): void {
this.consecutiveCleanupFailures = 0;
this.consecutiveIneffectiveCleanups = 0;
// Cancel any in-flight cleanup so its async tail cannot wipe the
// new session's freshly-populated FileReadCache.
this.cleanupInProgress = false;
this.activeCleanupAction = 'none';
this.queuedCleanupRecommendation = undefined;
}

A generation counter (bumped here, checked in the .then/setImmediate chain) would be more robust if simply resetting state has side effects on the currently-running runCleanupSteps promise.

— qwen3.7-max via Qwen Code /review

);

if (freedRatio < 0.01) {
this.consecutiveIneffectiveCleanups++;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] No adaptive backoff when cleanup is repeatedly ineffective. When freedRatio < 0.01 under critical pressure, this counter increments and shouldEmitRepeatedDiagnostic throttles the event — but the next performCheckInternal still triggers executeCleanup({ action: 'aggressive', steps: ['clear_file_cache', 'trigger_gc'] }) at full cooldown cadence (5 s).

If memory pressure comes from non-cache sources (chat history accumulation, native allocations), every 5-second cycle runs a stop-the-world global.gc() (if enabled) plus a full cache clear — neither of which helps — degrading responsiveness without reducing RSS. The memory-cleanup-ineffective event is fire-and-forget; nothing subscribes with corrective action.

Consider applying exponential backoff when consecutiveIneffectiveCleanups exceeds a threshold (e.g., cleanupCooldownMs * min(2^k, 64)) or temporarily suspending the aggressive tier until pressure drops below hardPressureRatio.

— qwen3.7-max via Qwen Code /review

@ZevGit

ZevGit commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up pushed in 55fb90b.

Addressed the latest review feedback from May 27:

  • Added session-boundary cleanup isolation with a generation guard so stale async cleanup tails and queued escalations cannot clear a newly-started session's FileReadCache.
  • Added adaptive backoff for repeated ineffective aggressive cleanup so critical pressure from non-cache sources does not keep clearing cache / running explicit GC at the base cooldown cadence.
  • Split the public reset API so normal diagnostic counter reset remains separate from session-scoped cleanup cancellation.
  • Added regression coverage for queued cleanup cancellation on session reset and aggressive cleanup backoff, plus tightened test spy isolation.

Verification after the change:

  • Focused memory/config tests: 237/237 passed
  • Related core tests: 451/451 passed
  • Prettier / targeted ESLint / git diff --check passed
  • npm run lint passed
  • npm run typecheck passed
  • clean npm run build passed
  • GitHub CI is green across Lint, macOS/Ubuntu/Windows tests, and CodeQL.

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ✅ — Third commit comprehensively addresses all prior-round Critical and Suggestion findings. Generation-based cleanup isolation, adaptive backoff, and expanded test coverage (451 tests pass) are well-implemented. CI all_pass 8/8. — qwen3.7-max via Qwen Code /review

@wenshao

wenshao commented May 27, 2026

Copy link
Copy Markdown
Collaborator

Maintainer Verification Report

PR: #4403 — feat(core): add memory pressure monitor
Commit: 55fb90be9 (HEAD of feat/memory-pressure-monitor)
Verified by: wenshao
Date: 2026-05-27


Local Build & Test Results

Check Result Notes
npm run build ✅ Pass Clean build after dist/ cleanup. Pre-existing vscode-ide-companion curly-brace warnings (not from this PR)
npm run lint ✅ Pass No lint errors
npm run typecheck ⚠️ Pre-existing failure TS2739 in packages/cli/src/serve/httpAcpBridge.tsnot touched by this PR
Targeted ESLint (PR files) ✅ Pass All 8 PR-changed files pass ESLint
git diff --check ✅ Pass No whitespace issues
Focused tests (4 suites) 451/451 passed memoryPressureMonitor, config, fileReadCache, coreToolScheduler
Full packages/core tests 9264/9265 passed (1 pre-existing failure) The single failure is in anthropicContentGenerator.test.ts (User-Agent string mismatch claude-cli vs QwenCode) — not touched by this PR

GitHub Actions CI (on 55fb90be9)

Job Status
Classify PR ✅ Pass
Lint ✅ Pass
CodeQL ✅ Pass
Test (macOS, Node 22.x) ✅ Pass
Test (Ubuntu, Node 22.x) ✅ Pass
Test (Windows, Node 22.x) ✅ Pass

Code Review Summary

Architecture & Design: Sound. The monitor fires reactively after tool execution (no background timer), uses dual pressure signals (RSS + V8 heap), and applies graduated escalation (soft → hard → critical). Generation-based cancellation correctly handles session resets. queueMicrotask coalescing for concurrent tool completions is the right approach.

Code Quality: High quality, no bugs found. Cleanup pipeline ordering is correct (executeCleanuprunCleanupStepssetImmediatefinishCleanupAndRunQueued). Map deletion during iteration is spec-compliant. assertNever exhaustiveness guards on switch defaults. Error isolation via emitSafely prevents listener crashes from breaking the cleanup pipeline.

Integration: Minimal and clean.

  • coreToolScheduler.ts: 6 lines — scheduleCheck() in the finally block of executeSingleToolCall
  • config.ts: 90 lines — mirrors existing getFileReadCache() pattern with prototype inheritance
  • fileReadCache.ts: 30 lines — evictNotAccessedSince() method

Configuration: Well-designed. Strict Number() parsing (rejects partial parses like "0.8extra"), empty-string guards, ordering validation (soft < hard < critical), range bounds [0.3, 0.98], visible stderr warnings on misconfiguration with safe fallback to defaults.

Test Coverage: Comprehensive — 1,233 lines of monitor tests + 197 config + 117 cache + 39 scheduler. Covers all pressure levels, cgroup v1/v2 parsing, escalation queuing, session-reset cancellation, cooldown blocking, failure accounting, ineffective cleanup diagnostics, GC available/unavailable paths, and env var edge cases.

Security: No concerns. Cgroup paths are hardcoded, no user-supplied paths. Memory info is not exposed externally. global.gc() gated behind explicit opt-in.

Performance: Minimal overhead. One process.memoryUsage() + getHeapStatistics() per tool call, cooldown-protected cleanup, exponential back-off for ineffective aggressive cleanups (max ~320s).

Minor Observations (non-blocking)

  1. P3: evictNotAccessedSince skips entries with lastReadAt === undefined — structurally prevented today but not type-enforced. Consider treating undefined as "always stale."
  2. P3: Hard throw in getMemoryPressureMonitor for missing config is fail-fast but could return undefined for defense-in-depth.
  3. P4: No setMaxListeners on EventEmitter (default 10 — fine for typical usage, but diagnostic tooling with many listeners would trigger a warning).

Verdict

Ready to merge. Well-designed, well-implemented, thoroughly tested. All issues are P3/P4 non-blockers. The feature is opt-in and has zero impact on existing functionality when disabled.


Verified locally with tmux parallel execution (build + typecheck + lint + focused tests + full core test suite).

@yiliang114 yiliang114 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — confirmed no conflict with current main.

Context: the earlier heap-pressure compaction safety net (#4186) was removed in 924736d6b (2026-05-20), so there is currently no runtime memory pressure handling on main. This PR fills that gap with a conservative, side-effect-minimal approach:

  • Integration surface is one optional-chaining call in coreToolScheduler.ts finally block — worst case = no-op.
  • Cleanup only evicts cold FileReadCache entries; does not touch chat history, compaction, or session state.
  • cgroup v1/v2 awareness is useful for container deployments.
  • Session-boundary isolation + adaptive backoff are well-designed.

This aligns with the #3000 Memory Diagnostics roadmap (specifically the "check memory pressure" + "proactive memory management" goals) and partially addresses #4181 (interpretation layer) by exposing a getPressureLevel() API.

CI all green across 3 platforms. 237 focused tests + 451 related core tests pass.

Thanks @ZevGit for the thorough iteration across 3 review rounds.

@yiliang114 yiliang114 merged commit 3fc1849 into QwenLM:main May 31, 2026
10 checks passed
yiliang114 added a commit that referenced this pull request Jun 2, 2026
#4654)

* feat(core): auto-dump memory diagnostics to disk on pressure detection

When the MemoryPressureMonitor (#4403) detects hard or critical pressure,
write a lightweight diagnostics JSON to .qwen/<project>/diagnostics/ before
running cleanup. The file survives even if a subsequent operation triggers
OOM, giving maintainers actionable data from bug reports without requiring
the user to manually run /doctor memory after a crash.

Design follows Claude Code's heapDumpService approach: write the cheap JSON
first (small write, won't OOM), heavy snapshot second. Diagnostics include
process memory stats, V8 heap stats, session history size, and an actionable
suggestion for the user.

Per-session limits: max 3 dumps, 30s cooldown between dumps.

Closes #4651

* ci: retrigger CI after Windows flaky failure

* test(core): use path.join in memoryDiagnosticsDumper test for cross-platform

The assertion hard-coded POSIX separators ('/tmp/test-project/diagnostics/'),
which fails on Windows where path.join produces backslashes. Build the
expected substring with path.join + path.sep so it matches the dumper's
actual output on every platform.

* fix(core): two-phase memory diagnostics write to survive OOM

Two critical issues from review:

1. The async collectMemoryDiagnostics() runs before writeFileSync, but it
   spawns a `ps` subprocess and reads /proc — fork() under critical memory
   pressure can fail or be OOM-killed, leaving no file on disk despite the
   "cheap write first" design comment.

2. dumpCount and lastDumpTime were updated after the await, so concurrent
   dumps (e.g. hard→critical escalation) would both pass the cap/cooldown
   guards and overwrite each other.

Fix:

- Reserve the dump slot synchronously (++dumpCount, lastDumpTime) before
  any await, so concurrent calls correctly hit the cap.
- Phase 1: synchronously write a minimal JSON (process.memoryUsage +
  v8.getHeapStatistics, no fork/exec) with collectionComplete=false.
  Because async functions execute synchronously up to the first await,
  this is guaranteed on disk before the caller's next statement runs.
- Phase 2: enrich with full diagnostics asynchronously and overwrite the
  file with collectionComplete=true. If Phase 2 crashes, the minimal
  Phase 1 file still survives for debugging.

Tests updated for the two-phase write and gain two new cases covering the
sync-Phase-1 guarantee and the synchronous slot reservation.

* fix(core): point memory diagnostics suggestion at /compress (the actual command)

The suggestion text told users to run /compact, which does not exist in
this repository — the actual command is /compress (see compressCommand.ts).
Pointing users at a nonexistent slash command in a diagnostics report
makes the suggestion unactionable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants