feat(telemetry): add runtime memory/CPU sampling with OTel metric reporting#4868
Conversation
…orting Add a lightweight RuntimeSampleRing to the memory pressure monitor that captures memory (RSS, heap, external) and CPU (user/system/percent) on every pressure check. The ring buffer always records (for local crash diagnostics), while OTel metric reporting is gated behind isPerformanceMonitoringActive() — zero overhead when telemetry is off. When the monitor triggers a diagnostics dump (hard/critical pressure), the recent samples are included in the dump JSON as recentSamples[], giving maintainers a memory/CPU trend leading up to the event. Closes #4183
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
- Fix Critical: record() no longer calls process.memoryUsage() redundantly; reuses the snapshot from getPressureLevelWithMem() - Fix JSDoc: class comment now correctly states ring buffer is always active - Fix elapsed=0: return a shallow copy (not internal reference) to prevent caller mutation of ring buffer state - Fix cpuPercent: normalize by os.cpus().length so values stay 0-100% - Remove dead fields: cpuUserMs/cpuSystemMs had no downstream consumer - Export RuntimeSampleRing class for direct unit testing - Add runtimeSampleRing.test.ts with coverage for: normal record, elapsed=0 edge case, ring eviction, getAll copy semantics, and reset - Add dumper test verifying recentSamples appears in serialized output
… mock - performCheckInternal now calls the public getPressureLevel() (which tests spy on) instead of a private getPressureLevelWithMem(), restoring the test spy contract for 4 cleanup tests - Remove unused getPressureLevelWithMem() — the mem snapshot for record() is fetched directly via process.memoryUsage() in the sampling try-catch block - Add cpus() to the node:os mock in memoryPressureMonitor.test.ts to satisfy the new CPU_CORE_COUNT module-level constant
…-cpu-metrics # Conflicts: # packages/core/src/services/memoryPressureMonitor.ts
…and test coverage - performCheckInternal now reads process.memoryUsage() once per check cycle and shares the snapshot between getPressureLevel() and runtimeSamples.record(), removing the redundant syscall while keeping the public getPressureLevel() spy contract (it still gets called, with an optional pre-fetched snapshot) - fix cpuPercent JSDoc: the value is a percentage of total system capacity (normalized by core count), not per-core - add test: OTel reporting path (recordMemoryUsage ×2 + recordCpuUsage ×1 when isPerformanceMonitoringActive is true, nothing when false) via a partial mock of telemetry/metrics.js - add test: diagnostics dumper receives non-empty runtime samples on hard pressure - add test: CPU delta from a same-tick (elapsed <= 0) sample accumulates into the next sample instead of being lost - adjust queued-cleanup-failure test mock sequence for the per-check sampling snapshot introduced by this branch
qwen-code-ci-bot
left a comment
There was a problem hiding this comment.
Hey @yiliang114, thanks for the PR! The feature itself looks interesting, but the PR body doesn't follow our PR template.
The template requires these sections:
- What this PR does — describe the change in prose
- Why it's needed — motivation and problem being solved
- Reviewer Test Plan — with How to verify, Evidence (Before & After), Tested on, and optional Environment
- Risk & Scope — tradeoffs, out-of-scope items, breaking changes
- Linked Issues — closing keywords or references
- 中文说明 — full Chinese translation in a
<details>block
Your current body uses ## Summary and ## Test plan, which don't match. Could you restructure it to follow the template? It helps reviewers (and future-you reading git blame) understand the why and how to verify at a glance.
Once that's updated I'll continue the review. 🙏
中文说明
@yiliang114 你好,感谢你的 PR!功能本身看起来有意思,但 PR 正文没有按照我们的 PR 模板 来写。
模板要求以下章节:
- What this PR does — 用自然语言描述改动
- Why it's needed — 动机和要解决的问题
- Reviewer Test Plan — 包含 How to verify、Evidence (Before & After)、Tested on,以及可选的 Environment
- Risk & Scope — 权衡、超出范围的部分、破坏性变更
- Linked Issues — 关闭关键词或引用
- 中文说明 — 在
<details>块中提供完整中文翻译
当前正文使用的是 ## Summary 和 ## Test plan,和模板不匹配。能否按照模板重新组织一下?这样能帮助审查者(以及以后看 git blame 的人)快速理解 为什么做 和 怎么验证。
更新后我会继续审查。🙏
— Qwen Code · qwen3.7-max
There was a problem hiding this comment.
Pull request overview
Adds a bounded runtime memory/CPU sampling mechanism to support memory-pressure diagnostics and (optionally) report performance metrics via OpenTelemetry. This extends the existing MemoryPressureMonitor/MemoryDiagnosticsDumper flow by capturing a short timeline leading up to hard/critical pressure events.
Changes:
- Introduces
RuntimeSampleRingto record recent RSS/heap/external + normalized CPU usage samples during each pressure check. - Gates OTel metric reporting for memory/CPU behind
isPerformanceMonitoringActive()and wires reporting into the monitor’s check loop. - Extends diagnostics dumps to include
recentSamples(serialized in both “Phase 1” and “Phase 2” payloads) and adds/updates tests accordingly.
Reviewed changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/core/src/telemetry/trace-context.ts | Import formatting only. |
| packages/core/src/telemetry/trace-context.test.ts | Test formatting only. |
| packages/core/src/skills/bundled/qc-helper/SKILL.md | Markdown table spacing tweak. |
| packages/core/src/services/runtimeSampleRing.test.ts | Adds unit tests for the new runtime sampling ring. |
| packages/core/src/services/memoryPressureMonitor.ts | Adds runtime sampling ring + optional OTel reporting; passes samples into diagnostics dumps. |
| packages/core/src/services/memoryPressureMonitor.test.ts | Updates mocks and adds coverage for telemetry gating + dump sample passing. |
| packages/core/src/services/memoryDiagnosticsDumper.ts | Extends dump payload to include recentSamples. |
| packages/core/src/services/memoryDiagnosticsDumper.test.ts | Adds test validating recentSamples appears in both dump phases. |
| packages/core/src/config/config.test.ts | Formatting-only dynamic import layout. |
| packages/cli/src/ui/hooks/useGeminiStream.ts | Formatting-only (line breaks + dependency array). |
| packages/cli/src/commands/extensions/utils.ts | Formatting-only (multi-line if). |
| packages/cli/src/commands/extensions/consent.ts | Formatting-only (multi-line if). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Runtime verification report (local, Linux, Node v22.22.2, head b689fef)Built this branch in an isolated worktree and runtime-tested it end-to-end: a real interactive TUI session driven via tmux against a local mock OpenAI server that returns streamed Verified working ✅1. Ring records on every check; hard-pressure dump carries 3 model-driven tool calls → 3 checks. The first check saw heapUsed ≈104 MB against the ~150 MB heap cap → { "trigger": "hard", "dumpNumber": 1, "collectionComplete": true,
"recentSamples": [{ "ts": 1781105802347, "rss": 219328512, "heapUsed": 109437688,
"heapTotal": 118632448, "external": 21384918, "cpuPercent": 0.56 }] }2. Ring accumulates across checks/turns; timestamps strictly monotonic A second prompt after the 30 s dump cooldown produced dump #2 containing all 4 samples accumulated so far (3 from turn 1 + 1 from turn 2). Values are believable: CPU ~19 % during back-to-back tool rounds, 0.02 % after a long idle gap (delta-window semantics): Dump cooldown and per-session numbering behaved as designed (checks 2–3 of turn 1 produced no extra dump; dumps are capped per session). 3. OTel reporting when telemetry is enabled Same scenario + 4. Telemetry off → ring still records. The dumps in (1)/(2) were produced with telemetry off, confirming the always-on ring / gated-OTel split from the runtime side. 5. Default-config control run — default thresholds, unconstrained heap, telemetry off: 3 tool calls, no dump produced, no visible behavior change. Silent for normal users. 6. New unit tests pass locally: ❌ Merge blocker: this branch breaks 5 test suites in CI (all 3 platforms)The red Test workflow on this PR is introduced by this branch, not pre-existing. Root cause is the new module-level constant: // packages/core/src/services/memoryPressureMonitor.ts:37
const CPU_CORE_COUNT = os.cpus().length || 1;It executes
Counterfactual verified locally: these suites fail on this branch and pass on current main (e.g. f665de1 already fixed this for let cpuCoreCount: number | undefined;
function getCpuCoreCount(): number {
if (cpuCoreCount === undefined) {
try {
cpuCoreCount = os.cpus().length || 1;
} catch {
cpuCoreCount = 1;
}
}
return cpuCoreCount;
}…and call Minor, non-blocking
VerdictFeature behavior is solid end-to-end on a real build: per-check sampling, ring accumulation, dump enrichment, OTel gating, and default-config silence all check out. Recommend holding the merge until the module-level Repro setup
|
…mpling
Move the CPU core count off the module-level `os.cpus()` constant into a
lazy, memoized getCpuCoreCount(). Evaluating os.cpus() at import time broke
every test suite that vi.mock('node:os') without a cpus export, since this
module is transitively imported by config.ts — turning CI red on all three
platforms.
While there, harden RuntimeSampleRing.record():
- prefer os.availableParallelism() (honors cgroup quotas) over os.cpus().length
- clamp cpuPercent to >= 0 for non-monotonic process.cpuUsage() in containers/VMs
- on a same-ms tick, capture the fresh memory snapshot (only cpuPercent stays
stale) and still push the sample so the ring is never empty on the first call
Add tests covering fresh-memory-on-same-tick and first-call-same-ms recording.
qwen-code-ci-bot
left a comment
There was a problem hiding this comment.
Downgraded from Approve to Comment: CI still running. No high-confidence review findings — 5 low-confidence items listed in terminal for human review. — qwen3.7-max via Qwen Code /review
…ailure When process.memoryUsage() throws, readMemoryUsage() already logs and returns undefined. Passing that undefined into getPressureLevel() made it call readMemoryUsage() again via its memSnapshot fallback, firing a second failing syscall and logging the same error twice per check cycle. Guard the call with `mem ? getPressureLevel(mem) : 'normal'`. Add a performCheck()-level test asserting it does not throw, skips OTel reporting, and logs the read failure exactly once.
qwen-code-ci-bot
left a comment
There was a problem hiding this comment.
No new issues found. All previously identified concerns (lazy os.cpus(), CPU normalization, shared memory snapshot, stale memory fields, test coverage, JSDoc accuracy) have been properly resolved. The implementation is solid: bounded ring buffer, correct elapsed <= 0 edge case handling, properly gated OTel reporting, and comprehensive test coverage. LGTM! ✅ — qwen3.7-max via Qwen Code /review
|
Thanks for the PR! Template looks good ✓ On direction: this is well-aligned. The sampling ring directly addresses the gap identified in #4183 — single-point snapshots can't distinguish leaks from transient spikes, and a bounded trend buffer is the right answer. The OTel metric reporting leverages existing On approach: scope feels right for what it delivers. The core addition (ring buffer + integration with the existing monitor and dumper) is ~700 lines of real code plus ~280 lines of tests. One minor note: several files in the diff (consent.ts, useGeminiStream.ts, config.test.ts, trace-context.test.ts, SKILL.md) appear to be formatting-only changes, and NOTICES.txt has dependency version bumps — these are unrelated to the feature and worth flagging during review so they don't get lost or cause merge conflicts. The design itself is clean: ring is bounded at 60 entries, always-on for local diagnostics, OTel gated behind Moving on to code review. 🔍 中文说明感谢贡献! 模板完整 ✓ 方向:对齐良好。采样环直接解决了 #4183 中指出的问题——单点快照无法区分内存泄漏和瞬时峰值,有界趋势缓冲区是正确的答案。OTel 指标上报复用了已有的 方案:范围合理。核心新增(环形缓冲 + 与现有监控器和 dumper 的集成)约 700 行实际代码加约 280 行测试。一个小提醒:diff 中有几个文件(consent.ts、useGeminiStream.ts、config.test.ts、trace-context.test.ts、SKILL.md)看起来只是格式化改动,NOTICES.txt 有依赖版本升级——这些与特性无关,审查时注意一下。设计本身很干净:环上限 60 条、始终记录本地诊断、OTel 由 进入代码审查 🔍 — Qwen Code · qwen3.7-max |
…reuse ring in dumper Address 4 review findings from CI pipeline review: 1. Critical — wrap process.cpuUsage() in safeCpuUsage() try/catch so restricted containers that lack /proc/self/stat don't crash the entire memory-pressure system (constructor, record, reset). 2. Clamp cpuPercent to Math.min(100, ...) to handle cgroup CPU bursting where reported CPU-time transiently exceeds wall-clock × core count. 3. Add hasLoggedSamplingError flag: first sampling failure logs at error level (operators see "metrics enabled but broken"), subsequent failures demote to debug (avoids log spam). 4. Phase 1 of memoryDiagnosticsDumper reuses the last ring sample for memoryUsage instead of a redundant process.memoryUsage() syscall that could throw under extreme memory pressure. All 4 behaviors have dedicated test coverage (5 new tests, 90 total).
| this.cleanupGeneration++; | ||
| this.resetConsecutiveFailures(); | ||
| this.diagnosticsDumper.resetForNewSession(); | ||
| this.runtimeSamples.reset(); |
There was a problem hiding this comment.
[Suggestion] hasLoggedSamplingError is not reset in resetForNewSession(), while every other session-scoped diagnostic field is. After a session reset, if sampling starts failing again in the new session, all failures will be silently logged at debug level — the operator won't see the first-failure-at-error-level signal that the field's comment describes.
| this.runtimeSamples.reset(); | |
| this.runtimeSamples.reset(); | |
| this.hasLoggedSamplingError = false; |
— qwen3.7-max via Qwen Code /review
| @@ -66,6 +67,7 @@ export class MemoryDiagnosticsDumper { | |||
| */ | |||
| async dump( | |||
| trigger: 'hard' | 'critical', | |||
There was a problem hiding this comment.
[Suggestion] The class-level JSDoc (line 58) still describes Phase 1 as using process.memoryUsage(), but this PR changed the code to prefer the last ring sample. The doc should be updated to match, e.g.:
Phase 1 (synchronous): writes a minimal JSON with the most recent ring-buffer sample (falling back to
process.memoryUsage()if no samples exist) andv8.getHeapStatistics()
— qwen3.7-max via Qwen Code /review
|
Verdict: PASS ✅ Two tmux sessions executed against merge commit Scenario 1 — Diagnostics Dump with
Scenario 2 — OTel Metric Reporting:
All three aspects of the PR verified working as described. 中文说明结论:通过 ✅ 在合并提交 场景 1 — 诊断 dump 与
场景 2 — OTel 指标上报:
PR 的三个方面均已验证可正常工作。 — Qwen Code · qwen3.7-max |
|
This PR went through four rounds of review and came out stronger each time. The initial critical issues — test mock shifts from 中文说明这个 PR 经历了四轮 review,每轮都变得更强。初始的关键问题—— — Qwen Code · qwen3.7-max |
qwen-code-ci-bot
left a comment
There was a problem hiding this comment.
Approved via triage pipeline. All critical review issues resolved, tmux E2E testing passed (90/90 unit tests, diagnostics dumps with recentSamples[] verified, OTel metrics verified). One minor follow-up: hasLoggedSamplingError could be reset in resetForNewSession(). macOS CI failure is unrelated (flaky AskUserQuestionDialog test).
|
Qwen triage pipeline did not complete successfully. See workflow logs. |
qwen-code-ci-bot
left a comment
There was a problem hiding this comment.
| const sample = localRing.record(mem); | ||
|
|
||
| expect(sample.cpuPercent).toBe(100); | ||
| }); |
There was a problem hiding this comment.
[Suggestion] The upper clamp (Math.min(100, ...)) has a dedicated test right here, but the lower clamp (Math.max(0, ...) at memoryPressureMonitor.ts:~131) has no counterpart. A regression removing the lower bound would produce negative cpuPercent values in containers where process.cpuUsage() returns non-monotonically — the exact scenario the source comment documents.
it('clamps cpuPercent at 0 when CPU usage delta is negative', () => {
let mockCpu = { user: 10000, system: 10000 };
vi.spyOn(process, 'cpuUsage').mockImplementation(() => ({ ...mockCpu }));
let mockTime = 1000;
vi.spyOn(Date, 'now').mockImplementation(() => mockTime);
const localRing = new RuntimeSampleRing();
mockTime = 1100;
// Second reading is *less* than the first — simulates non-monotonic accounting
mockCpu = { user: 5000, system: 5000 };
const mem = { rss: 100, heapUsed: 50, heapTotal: 80, external: 10, arrayBuffers: 0 };
const sample = localRing.record(mem);
expect(sample.cpuPercent).toBe(0);
});— qwen3.7-max via Qwen Code /review
Runtime verification report #2 (local, Linux, Node v22.22.2, head dac6187)Follow-up to my report on b689fef, which recommended holding the merge over the import-time Previous merge blocker: resolved ✅
Verified working on dac6187 ✅1. Hard-pressure dump carries { "trigger": "hard", "dumpNumber": 1, "collectionComplete": true,
"recentSamples": [{ "ts": 1781229271664, "rss": 217542656, "heapUsed": 109679288,
"heapTotal": 117956608, "external": 21363189, "cpuPercent": 0.34 }] }2. Ring accumulates across turns; timestamps strictly monotonic; clamp holds — a second prompt after the 30 s dump cooldown produced dump #2 with all 4 samples accumulated (3 from turn 1 + 1 from turn 2). Programmatic check over every sample in every dump: timestamps strictly increasing, all 3. New dumper behavior observed directly (dac6187) — Phase 1 now derives So the crash-path write really does avoid the redundant syscall, and a completed dump still ends up with full fidelity. 4. OTel path — same scenario + 5. Dump ↔ OTel cross-check is byte-exact — the OTel session's own dump #1 holds one sample: 6. Telemetry off → ring still records — all dumps in (1)–(3) were produced with telemetry off. 7. Per-session isolation — each new session started its ring fresh (its dump #1 has exactly 1 sample) and dump numbering restarted per session, as designed. 8. Default-config control run — default thresholds, unconstrained heap, telemetry off: two turns of tool calls produced no dump and no visible behavior change. Silent for normal users. macOS CI note (not this PR)The red Minor, non-blocking
VerdictPASS — recommend merge. The sole blocker from my previous report is fixed exactly as suggested (plus Repro setup
|
| this.cleanupGeneration++; | ||
| this.resetConsecutiveFailures(); | ||
| this.diagnosticsDumper.resetForNewSession(); | ||
| this.runtimeSamples.reset(); |
There was a problem hiding this comment.
[Suggestion] hasLoggedSamplingError is not reset in resetForNewSession(). Every other session-scoped flag (consecutiveCleanupFailures, runtimeSamples, cleanupInProgress, etc.) is reset, but this one leaks across sessions. Once a transient sampling error occurs in session N, every subsequent session permanently downgrades the first sampling failure to debug level — making it invisible to operators who restarted specifically to diagnose the problem.
| this.runtimeSamples.reset(); | |
| this.runtimeSamples.reset(); | |
| this.hasLoggedSamplingError = false; |
— qwen3.7-max via Qwen Code /review
|
Qwen triage pipeline did not complete successfully. See workflow logs. |
|
Qwen triage pipeline did not complete successfully. See workflow logs. |
…r labels
The /review skill leaves git worktrees under .qwen/tmp/ in the persistent
self-hosted workspace. A leftover worktree pins its branch, so a later
actions/checkout cannot clean the repo ('cannot delete branch ... used by
worktree'), falls back to a full recreate, and that fails with
'upload-pack: not our ref'. This was cross-PR contamination: a prior
review of #5002 broke product-decision checkout for #4868.
- add a 'Clean stale review worktrees' step before each self-hosted
checkout (product-decision / review / tmux-testing) that prunes
.qwen/tmp worktrees and stale qwen-review/* branches; no-op on a
fresh runner
- simplify self-hosted runs-on to ['self-hosted', 'ecs-qwen'] to match
the unified runner label set
What this PR does
This PR adds a bounded runtime sampling ring to the memory pressure monitor. On every pressure check it captures RSS, heap (used/total), external memory, and normalized CPU usage into a 60-entry ring. The ring always records locally so a crash diagnostics dump can carry the memory/CPU trend leading up to the event; the same samples are optionally reported as OpenTelemetry metrics (
qwen-code.memory.usage,qwen-code.cpu.usage).When the monitor triggers a hard/critical diagnostics dump, the recent samples are now embedded in the dump JSON under
recentSamples[], so maintainers can see how memory and CPU moved in the seconds before the dump fired.Why it's needed
Memory pressure dumps previously captured a single point-in-time snapshot. When triaging OOM-adjacent reports there was no way to tell whether memory was climbing steadily, spiking, or already saturated — the trend that actually distinguishes a leak from a transient. The ring gives that trend for free on the next dump, and feeds the same data to OTel for anyone running telemetry.
Reviewer Test Plan
How to verify
OTel reporting is gated behind
isPerformanceMonitoringActive(), so it costs nothing when telemetry is off; the ring itself always records. To exercise both paths:npx vitest run packages/core/src/services/runtimeSampleRing.test.ts packages/core/src/services/memoryPressureMonitor.test.ts packages/core/src/services/memoryDiagnosticsDumper.test.ts.NODE_OPTIONS=--max-old-space-size=150) with lowered thresholds (QWEN_MEMORY_PRESSURE_SOFT/HARD/CRITICAL=0.30/0.33/0.95), drive a few tool calls to firescheduleCheck(), then inspect the dump in~/.qwen/projects/<dir>/diagnostics/for a populatedrecentSamples[].--telemetry --telemetry-target=local --telemetry-outfile=<file>and confirm onememory.usage(rss + heap_used) and onecpu.usagedatapoint per check, withcpu.usagewithin 0–100.Evidence (Before & After)
Non–user-visible (diagnostics + telemetry only): N/A for screenshots. Locally I ran the unit suites in step 1 (all green) plus the constrained-heap dump check in step 2. A reviewer also posted an independent runtime verification report on this PR covering the OTel path and default-config control run — see #4868 (comment)
Tested on
macOS: full unit suite locally. Linux: covered by the reviewer's runtime report linked above. Windows: via CI only.
Environment (optional)
Local:
npm run build+node dist/cli.js, plus the constrained-heap interactive run described above.Risk & Scope
isPerformanceMonitoringActive().os.availableParallelism()(honors cgroup quotas) and falls back toos.cpus().length, resolved lazily on first sample rather than at import time.cpuPercentis clamped to ≥0 to survive non-monotonicprocess.cpuUsage()in some containers/VMs.cpuPercentis averaged over the window since the previous sample, so after a long idle gap it dilutes toward 0 — fine for crash-trend dumps (each sample carries itsts), but worth noting for anyone building dashboards onqwen-code.cpu.usage.Linked Issues
Closes #4183
中文说明
这个 PR 做了什么
给内存压力监控器加了一个有界的运行时采样环(ring buffer)。每次压力检查时,把 RSS、堆(已用/总量)、external 内存,以及归一化后的 CPU 使用率写入一个 60 条的环形缓冲区。这个环始终在本地记录,因此崩溃诊断 dump 可以带上事件发生前的内存/CPU 趋势;同样的样本还会可选地以 OpenTelemetry 指标上报(
qwen-code.memory.usage、qwen-code.cpu.usage)。当监控器触发 hard/critical 诊断 dump 时,最近的样本现在会以
recentSamples[]字段嵌入 dump JSON,方便维护者看到 dump 触发前几秒内存和 CPU 的变化。为什么需要
之前的内存压力 dump 只能捕获单个时间点的快照。排查 OOM 相关问题时,无法判断内存是稳步上涨、突刺,还是已经饱和——而这个趋势恰恰是区分泄漏和瞬时抖动的关键。这个环让下一次 dump 免费带上该趋势,并把同样的数据喂给 OTel,供开启 telemetry 的人使用。
Reviewer Test Plan
如何验证
OTel 上报由
isPerformanceMonitoringActive()控制,telemetry 关闭时零开销;环本身始终记录。两条路径都可验证:npx vitest run packages/core/src/services/runtimeSampleRing.test.ts packages/core/src/services/memoryPressureMonitor.test.ts packages/core/src/services/memoryDiagnosticsDumper.test.ts。NODE_OPTIONS=--max-old-space-size=150)下、调低阈值(QWEN_MEMORY_PRESSURE_SOFT/HARD/CRITICAL=0.30/0.33/0.95)跑交互会话,触发几次 tool call 以触发scheduleCheck(),然后检查~/.qwen/projects/<dir>/diagnostics/下的 dump,确认recentSamples[]有数据。--telemetry --telemetry-target=local --telemetry-outfile=<file>,确认每次检查产生一条memory.usage(rss + heap_used)和一条cpu.usage数据点,且cpu.usage在 0–100 之间。证据(Before & After)
仅诊断 + telemetry,无 UI 改动:截图 N/A。本地我跑了步骤 1 的单测套件(全绿)以及步骤 2 的受限堆 dump 检查。另有 reviewer 在本 PR 上发布了独立的运行时验证报告,覆盖 OTel 路径与默认配置对照组——见 #4868 (comment)
测试平台
macOS:本地全量单测。Linux:由上文链接的 reviewer 运行时报告覆盖。Windows:仅 CI。
环境(可选)
本地:
npm run build+node dist/cli.js,以及上述受限堆交互运行。风险与范围
isPerformanceMonitoringActive()控制。os.availableParallelism()(尊重 cgroup 配额),回退到os.cpus().length,且在首次采样时惰性解析,而非 import 时。cpuPercent被 clamp 到 ≥0,以应对部分容器/VM 中process.cpuUsage()非单调的情况。cpuPercent是相对上一个样本时间窗口的平均值,长时间空闲后会被稀释趋近 0——对崩溃趋势 dump 没问题(每个样本都带ts),但基于qwen-code.cpu.usage做 dashboard 的人需要注意。关联 Issue
Closes #4183