test(perf): add daemon baseline harness (#4175 Wave 1 PR 1)#4205
Conversation
First implementation PR of the Mode B v0.16 rollout (issue #4175 Wave 1 PR 1). Captures reference performance metrics for the `qwen serve` daemon so subsequent Mode B PRs (M2 MCP shared pool, M3 architecture refactor, M4 multi-client safety) can be measured against a known baseline rather than guessed-at numbers. ## What it captures The new `integration-tests/cli/qwen-serve-baseline.test.ts` runs five describe blocks against a real `qwen serve` daemon: - RSS scaling across 1 / 5 / 10 same-workspace `createOrAttachSession` calls (sampled via `ps -o rss=`). - Same-workspace attach latency for the 2nd and 5th attach. - MCP child amplification with two configured idle-mcp servers, measured via two-level `pgrep -P` walk (daemon → ACP child → MCP grandchildren). - SSE backpressure invariants exercised at the unit layer by instantiating `EventBus` directly: queue overflow → synthetic `client_evicted` frame; replay across reconnect honors `lastEventId` up to ring size. - Prompt p50 / p99 (skipped when `QWEN_TEST_MODEL_KEY` is unset, with an explicit reason recorded in the snapshot). Each run writes a structured JSON snapshot to `<INTEGRATION_TEST_FILE_DIR>/perf-baseline.json` plus a Markdown summary, with `gitCommit` / platform / config preserved for cross-PR correlation. ## Honest documentation of current limits The captured snapshot includes a `notes` field flagging that with the default `sessionScope: 'single'`, N successive `createOrAttachSession` calls return the same sessionId — so the RSS and MCP metrics here measure "N attaches to one shared session", not "N distinct sessions". Once Wave 2 PR 5 lands per-request `sessionScope: 'thread'` override, the harness will be updated to optionally force distinct sessions and surface the P1 MCP N×M amplification before M2 fixes it. ## Reused / new Reused: existing daemon spawn pattern from `qwen-serve-routes.test.ts` (port-0 + stdout regex + SIGTERM teardown), `pgrep -P` pattern from `qwen-serve-streaming.test.ts:144`, `EventBus` invariants from `eventBus.test.ts`, `DaemonClient` SDK, integration-tests `globalSetup.ts` env var conventions. New (this PR): - `integration-tests/cli/_daemon-harness.ts` (~280 lines) — extracts the inline daemon spawn pattern into a shared helper plus adds `getRssMB`, `startRssPolling`, `countDescendants`, `percentiles`, `consumeSseEvents`, `writeWorkspaceSettings`. Future serve test files can import instead of inlining. - `integration-tests/fixtures/idle-mcp/{server.mjs,package.json}` — a minimal stdio MCP fixture that responds to `initialize` / `tools/list` and idles. Lets the harness count real MCP children via `pgrep` without depending on a network npm package in CI. - `integration-tests/baselines/baseline-stage-1.json` — the first captured baseline at this commit. Future Mode B PRs can diff their run against this file; updating it is a deliberate one-line change in a follow-up PR. ## Reference patterns from opencode JSDoc on the main test file documents the shape borrowed from `opencode/test/memory/abort-leak.test.ts` (forced-GC heap-growth), `opencode/src/cli/heap.ts` (RSS poll + threshold-triggered `writeHeapSnapshot`, useful for Wave 6 production tooling), and `opencode/src/util/cpu-watchdog.ts` (event-loop lag drift sampling). The harness here is daemon-level multi-session — a shape neither opencode nor qwen-code had before. ## Engineering principles checklist - [x] Independently mergeable (test-only; no production code touched) - [x] Backward compatible (no removed routes / event fields / CLI behavior) - [x] Default off (PR CI does not run integration tests; baseline runs in release CI / nightly / manual) - [x] `qwen serve` Stage 1 routes / SDK behavior preserved (no production code changed) - [x] Gradual migration (no client adapter migration in this PR) - [x] Reversible (revert = delete files, no other side effects) - [x] Tests-first (this IS the test PR; harness exercises real daemon end-to-end; Windows skipped via existing `process.platform === 'win32'` precedent) ## Test plan - [x] `KEEP_OUTPUT=true TEST_CLI_PATH=$(pwd)/packages/cli/dist/index.js QWEN_BASELINE_SKIP_PROMPT_LATENCY=1 QWEN_BASELINE_RSS_SAMPLE_DURATION_MS=2000 npx vitest run integration-tests/cli/qwen-serve-baseline.test.ts` — 6 passed / 1 skipped (prompt latency requires model key) - [x] `npx tsc --noEmit -p integration-tests/tsconfig.json` — only pre-existing tsconfig `paths` glob warning remains, no new errors 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
📋 Review SummaryThis PR delivers a well-structured performance baseline harness for the 🔍 General Feedback
🎯 Specific Feedback🟢 Medium
🔵 Low
✅ Highlights
|
There was a problem hiding this comment.
Pull request overview
Adds a POSIX-only integration baseline harness for qwen serve performance and resource metrics to support future Mode B rollout comparisons.
Changes:
- Adds daemon spawn/RSS/process/SSE helper utilities.
- Adds baseline tests for RSS, attach latency, MCP child count, SSE backpressure, and optional prompt latency.
- Adds an idle MCP fixture and a captured Stage 1 baseline JSON snapshot.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
integration-tests/cli/_daemon-harness.ts |
Shared daemon test helpers for spawning, RSS polling, process counting, percentiles, SSE consumption, and workspace settings. |
integration-tests/cli/qwen-serve-baseline.test.ts |
New baseline integration suite covering daemon performance/resource metrics. |
integration-tests/fixtures/idle-mcp/server.mjs |
Minimal long-running MCP stdio fixture for child-process counting. |
integration-tests/fixtures/idle-mcp/package.json |
Package metadata/bin entry for the idle MCP fixture. |
integration-tests/baselines/baseline-stage-1.json |
Committed reference baseline snapshot for Stage 1 daemon behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
Fixes eslint no-undef error: 'process' is not defined. Replace process.exit(0) with exit(0) from node:process import.
wenshao
left a comment
There was a problem hiding this comment.
idle-mcp/server.mjs file uses process.exit() without importing process from node:process. Autofix pushed in e045b2d.
Design note: The committed baseline (baseline-stage-1.json) is never loaded or compared against by the test harness. Only catastrophic threshold assertions are checked. If the intent is automated regression detection, consider adding a comparison step that diffs the new snapshot against the committed baseline.
| '--hostname', | ||
| '127.0.0.1', | ||
| '--workspace', | ||
| workspaceCwd, |
There was a problem hiding this comment.
[Suggestion] The daemon token is passed as a CLI argument (--token in spawn args), visible via ps aux to all users. The default token is a harmless test value, but the spawnDaemon interface accepts an arbitrary string — any real token passed by a future caller would leak into the process table. Consider passing credentials through environment variables instead.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| stdio: ['ignore', 'pipe', 'pipe'], | ||
| env: { ...process.env, ...opts.env }, | ||
| }); | ||
|
|
There was a problem hiding this comment.
[Suggestion] { ...process.env, ...opts.env } propagates the entire parent environment — including QWEN_TEST_MODEL_KEY, CI secrets, etc. — to every spawned daemon. Tests that don't need the model key (RSS scaling, attach latency, MCP amplification) still receive it. If the daemon or an MCP child logs environment variables during a crash, real credentials could appear in CI logs. Consider filtering to only the env vars each test needs.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
| elapsedMs: number; | ||
| } | ||
|
|
||
| export async function consumeSseEvents( |
There was a problem hiding this comment.
[Suggestion] consumeSseEvents is exported but never imported or called by any test file. The SSE backpressure tests use EventBus directly (unit-level), bypassing the HTTP integration path. Either remove the unused export, or add a test that exercises the daemon's /api/sse endpoint through this helper.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
|
|
||
| describe('prompt latency', () => { | ||
| it.skipIf(SKIP_PROMPT_LATENCY)( | ||
| `p50 / p99 over ${PROMPT_ITERATIONS} prompts`, |
There was a problem hiding this comment.
[Suggestion] The SSE backpressure snapshot records hardcoded config literals (ringSize: 4000, heartbeatIntervalMs: 15000) rather than values observed from the running EventBus. If someone changes the EventBus defaults, the snapshot silently continues reporting the old values. Since EventBus fields are private readonly, either expose read-only accessors or annotate the snapshot to clarify these are design-time constants.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
wenshao
left a comment
There was a problem hiding this comment.
Previous review (commit f70ad24) found 2 Critical + 8 Suggestion issues. The process.exit lint fix in this commit resolves the Critical on server.mjs:62. Two Critical issues remain unfixed: temp workspace leak on spawnDaemon failure (line 228) and firstByteP measuring any SSE event type instead of model response (line 375). Re-posting these on the new commit since the prior inline comments are stale.
| elapsedMs: number; | ||
| } | ||
|
|
||
| export async function consumeSseEvents( |
There was a problem hiding this comment.
[Suggestion] consumeSseEvents is exported but never imported or called by any test file. Either remove the unused export or add a test that exercises it.
— glm-5.1 via Qwen Code /review
Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
本地真实场景测试报告Head: CI 状态变化(之前 stale speculative-merge 问题已解)
Rebase 之后 CI 主路径转绿,验证了之前判断:"失败不是 PR 代码问题,是 stale speculative-merge ref 跨越 PR #4082 → PR #4211 fix 之间的窗口"。 真实场景测试 = 新增的 integration tests 本身PR body 明说 "No production code touched. All changes live under integration-tests/" — TUI 行为没动;真正的实测就是跑 QWEN_SANDBOX=false TEST_CLI_PATH=$DIST/cli.js \
OPENAI_API_KEY=… OPENAI_BASE_URL=… OPENAI_MODEL=deepseek-chat \
QWEN_BASELINE_ENABLE_PROMPT_LATENCY=1 \
npx vitest run --root ./integration-tests qwen-serve-baseline结果: 加 tmux daemon spawn smoke直接起 Daemon 健康启动,符合预期。 4 个 Critical 复核
本地测试 & 类型检查
LGTM ✅ |
wenshao
left a comment
There was a problem hiding this comment.
已发布详细测试报告(见 issue comment)。Rebase 之后 CI 主路径绿(Lint + ubuntu Test + CodeQL),4 个 Critical 全修;本地 baseline integration tests 真起 daemon + 真起 MCP + 真 LLM 20 iteration 全过(70.7s),tsc 0 错。LGTM ✅
|
Heads-up: the baseline harness added here broke the Integration Tests (Docker) release job — it was the sole failure in the 2026-05-17 scheduled release (run 25976629117). The Targeted fix in #4234 — extends the existing Windows |
|
@tanzhenxin 抱歉,这个 sandbox PID namespace 场景我在 #4205 里漏掉了,导致 Docker release job 被 baseline harness 卡住。 我已经 review 并 approve 了 #4234。这个 targeted skip 修得很准确:host-side |
The qwen-serve-baseline harness walks the daemon process tree using host-side `pgrep -P`. Under the Docker/Podman sandbox the daemon's `qwen --acp` child and its MCP grandchildren run inside the container's PID namespace, which host `pgrep` cannot observe, so the MCP-grandchild descendant walk always sees zero and times out. The test passes in the no-sandbox job but failed every retry in the Docker release job. Extend the existing Windows `SKIP` gate to also skip when sandbox is enabled, matching the precedent in acp-integration.test.ts and cron-tools.test.ts. Refs #4205
Post-merge update (2026-05-17)
Merged on 2026-05-16 as 0788ed7. Final PR branch included the baseline harness plus review follow-ups:
e045b2dac: importedexitfromnode:processin the idle MCP fixture.28becf164: removed the stale baseline lint disable.acb6efeff: hardened daemon baseline cleanup, RSS sampling validity, MCP fixture validation, prompt first-byte measurement, andpgrephandling.998e3cc24: synced latestmainbefore merge, including feat(serve): per-request sessionScope override on POST /session (#4175 Wave 2 PR 5) #4209 (session_scope_override) and fix(cli): pass rewind selector test props #4211 test fix.Follow-up: #4214 tracks post-#4209 alignment for stale integration-test/user-doc expectations around
session_scope_override.Summary
First implementation PR of the Mode B v0.16 rollout (issue #4175 Wave 1 PR 1, ack'd by @wenshao on 2026-05-16). Captures reference performance metrics for
qwen serveso subsequent Mode B PRs (M2 MCP shared pool, M3 architecture refactor, M4 multi-client safety) can be measured against a known baseline rather than guessed-at numbers.No production code touched. All changes live under
integration-tests/.What this PR adds
integration-tests/cli/qwen-serve-baseline.test.tsintegration-tests/cli/_daemon-harness.tsspawnDaemonextracts the inline daemon spawn pattern fromqwen-serve-routes.test.ts:52-103; plusgetRssMB,startRssPolling,countDescendants,percentiles,consumeSseEvents,writeWorkspaceSettingsintegration-tests/fixtures/idle-mcp/{server.mjs,package.json}initialize+tools/listwith one no-op tool, then idles. Used to count real MCP fixture processes through the daemon/ACP process tree without pulling a network npm package into CIintegration-tests/baselines/baseline-stage-1.jsonHonest documentation of current limits
The captured snapshot's
notesfield flags that with the defaultsessionScope: 'single', N successivecreateOrAttachSessioncalls return the same sessionId — so the RSS / MCP metrics measure "N attaches to one shared session", not "N distinct sessions". Wave 2 PR 5 (#4209) has now landed, so the next baseline follow-up can passsessionScope: 'thread'to measure distinct-session cost and surface the P1 MCP N×M amplification before M2 fixes it. This PR intentionally kept the original single-scope baseline.Reused vs new
Reused:
qwen-serve-routes.test.ts:52-103pgrep -Ppattern fromqwen-serve-streaming.test.ts:144EventBusdirect instantiation pattern frompackages/cli/src/serve/eventBus.test.tsDaemonClientSDKglobalSetup.tsenv var conventions (TEST_CLI_PATH,INTEGRATION_TEST_FILE_DIR,KEEP_OUTPUT)New (no equivalent existed in qwen-code or opencode):
JSDoc on
_daemon-harness.tsreferences the opencode patterns the design borrows from (per the #4175 description's external implementation references):test/memory/abort-leak.test.ts— forced-GC heap-growth shapesrc/cli/heap.ts— periodic RSS poll + threshold-triggeredwriteHeapSnapshot(useful for Wave 6 production tooling)src/util/cpu-watchdog.ts— event-loop lag drift samplingCaptured baseline (this PR's reference snapshot)
{ "version": 1, "capturedAt": "2026-05-16T08:36:52.441Z", "gitCommit": "df32345d0553560cf342c1b4e6ad44df5646111b", "platform": { "os": "darwin", "arch": "arm64", "nodeVersion": "v24.12.0" }, "rssScaling": { "session1MB": 223.5, "session5MB": 224.2, "session10MB": 223.9, "sampleCount": 60, "droppedSampleCount": 0, "growthPerSessionMB": 0 }, "attachLatency": { "session2Ms": 3, "session5Ms": 1, "thresholdMs": 1000 }, "mcpAmplification": { "mcpServersConfigured": 2, "childrenAt1Session": 4, "childrenAt3Sessions": 4, "childrenAt5Sessions": 4, "linearAmplification": false }, "sseBackpressure": { "ringSize": 4000, "maxQueuedDefault": 256, "evictionAtOverflow": true, "replayUpToRing": true, "heartbeatIntervalMs": 15000 }, "promptLatency": { "iterations": 0, "firstByteMs": null, "totalMs": null, "skipped": true, "skipReason": "No recognized model credential env var is set; prompt latency requires real model access. Set QWEN_BASELINE_ENABLE_PROMPT_LATENCY=1 to force-run with non-env auth." } }The flat RSS curve and constant MCP fixture process count are expected for default
single-scope semantics — this is the reference data point before futuresessionScope: 'thread'follow-up measurement.Engineering principles checklist (per #4175 §PR-level acceptance checklist)
Test plan
node scripts/lint.js --eslintnpm run buildnpm run typecheckKEEP_OUTPUT=true QWEN_BASELINE_SKIP_PROMPT_LATENCY=1 QWEN_BASELINE_RSS_SAMPLE_DURATION_MS=500 npx vitest run integration-tests/cli/qwen-serve-baseline.test.ts— prompt latency skipped without model credentialsprocess.platform === 'win32'gate (matchesqwen-serve-streaming.test.ts:53precedent)Configuration
QWEN_BASELINE_PROMPT_ITERATIONSQWEN_BASELINE_RSS_SAMPLE_INTERVAL_MSQWEN_BASELINE_RSS_SAMPLE_DURATION_MSQWEN_BASELINE_HEAVY1, increase iterations to 100 + sample longerQWEN_BASELINE_SKIP_PROMPT_LATENCYStack context
This is sub-PR 1 of 25 in #4175 Wave 1. Adopted into the codeagents design docs at commit
d5dbbebper @wenshao's confirmation. The next PRs in Wave 1 (PR 2 capability registry / PR 3 DaemonSessionClient / PR 4 typed events) can run in parallel and don't depend on this one.🤖 Generated with Qwen Code