feat(test): add daemon connection stress test + refactor perf harness#4862
Conversation
… (issue #4514 T3.4) Extract shared helpers from baseline/benchmark tests into dedicated modules and add a new mock-ACP connection stress test suite. Refactoring (PR1 scope): - _daemon-harness.ts: export gitHead(), makeTempWorkspace(), sleep(), ScenarioResult, lastSeenId tracking in consumeSseEvents - _daemon-benchmark-helpers.ts: extract /usr/bin/time wrappers (spawnDaemonWithTime, parseTimeOutput, measureProcessTreeRss, measureCliStartupWithProfiler) from benchmark test - _daemon-perf-report.ts: shared formatPercentiles, collectPlatformInfo, writeSnapshotArtifacts, resolveOutputDir - Slim baseline + benchmark tests to import from new modules New features (PR2 scope): - fixtures/mock-acp-child/agent.mjs: mock ACP agent using real AgentSideConnection SDK, env-controlled modes (echo/reject/ crash-on-prompt/hang) - mock-acp-typecheck.test.ts: compile-time Agent interface check - qwen-daemon-loadtest.test.ts: 5 scenarios gated by QWEN_LOADTEST_ENABLED=1 — rapid lifecycle, SSE slow-consumer eviction, Last-Event-ID reconnect, ACP crash recovery, burst concurrent sessions - vitest.loadtest.config.ts: isolated config with root anchoring 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
qwen-code-ci-bot
left a comment
There was a problem hiding this comment.
Hi @doudouOUC — thanks for the PR! The daemon loadtest work looks genuinely useful, but the PR body doesn't follow our pull request template. Please update it to include the required sections so reviewers can properly evaluate the change:
## What this PR does— currently## Summary, which is close but the template heading is required## Why it's needed— missing. What problem does the daemon stress test solve? Why now?## Reviewer Test Plan— currently## Test plan. Please rename and fill in### How to verify,### Evidence (Before & After)(N/A for test infra is fine), and the### Tested ontable## Risk & Scope— missing. Main risk, what's out of scope, breaking changes## Linked Issues—Issue: #4514 T3.4is referenced but doesn't use closing keywords (Closes #4514/Fixes #4514)
Please update the PR body to match the template and re-request review. Happy to take another look once it's updated.
中文说明
你好 @doudouOUC — 感谢提交 PR!daemon 压力测试的工作看起来很有价值,但 PR 正文没有遵循我们的 PR 模板。请更新为模板要求的各个部分,以便审查者能正确评估:
## What this PR does— 当前用的是## Summary,接近但需要改成模板标题## Why it's needed— 缺失。daemon 压力测试解决什么问题?为什么现在做?## Reviewer Test Plan— 当前用的是## Test plan。请改名并填写### How to verify、### Evidence (Before & After)(测试基础设施可以填 N/A)、### Tested on表格## Risk & Scope— 缺失。主要风险、不在范围内的内容、破坏性变更## Linked Issues— 引用了Issue: #4514 T3.4但没有使用关闭关键词(Closes #4514/Fixes #4514)
请按照模板更新 PR 正文后重新请求审查。更新后我很乐意再看看。
— Qwen Code · qwen3.7-max
There was a problem hiding this comment.
Pull request overview
This PR adds a gated daemon connection stress/load test suite (using a mock ACP child) and refactors the existing baseline/benchmark integration tests by extracting shared performance harness/reporting helpers into reusable modules.
Changes:
- Extract shared benchmark/perf reporting utilities into
_daemon-benchmark-helpers.tsand_daemon-perf-report.ts, and extend_daemon-harness.tswith reusable helpers. - Add a mock ACP child fixture (
agent.mjs) plus a type-shape compliance test for the mock agent interface. - Add a gated load test (
qwen-daemon-loadtest.test.ts) with an isolated Vitest config, and exclude it from the default integration test run.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| integration-tests/vitest.loadtest.config.ts | Adds an isolated Vitest config for running only the loadtest suite. |
| integration-tests/vitest.config.ts | Excludes the loadtest suite from the default integration test run. |
| integration-tests/fixtures/mock-acp-child/agent.mjs | Adds a mock ACP agent process used by stress/load tests. |
| integration-tests/cli/qwen-serve-baseline.test.ts | Refactors baseline test to use extracted helpers/reporting utilities. |
| integration-tests/cli/qwen-daemon-vs-cli-benchmark.test.ts | Refactors benchmark test to use extracted harness/reporting utilities. |
| integration-tests/cli/qwen-daemon-loadtest.test.ts | Adds gated daemon connection stress/load scenarios. |
| integration-tests/cli/mock-acp-typecheck.test.ts | Adds a compile-time shape check for the mock agent interface. |
| integration-tests/cli/_daemon-perf-report.ts | New shared report primitives (platform info, output dir, formatting, artifact writes). |
| integration-tests/cli/_daemon-harness.ts | Extends harness exports for reuse across baseline/benchmark/loadtest. |
| integration-tests/cli/_daemon-benchmark-helpers.ts | New benchmark-only helpers (notably /usr/bin/time wrappers) extracted from the benchmark test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Copilot correctly noted that console.debug and console.dir also write to stdout in Node, which would corrupt the NDJSON pipe. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
Review feedback — round 1
Resolved 3/3 inline threads. |
… recovery - All 5 scenarios now use try/catch/finally so snapshot.status reflects actual test outcome - SSE eviction scenario asserts evicted === true (near-deterministic with maxQueued=16 + 80+ events) - Crash recovery verifies end-to-end by creating a fresh session post-crash, not just health check 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
Review feedback — round 2 (wenshao)
Resolved 3/3 threads. |
✅ Local verification — daemon connection stress test + perf-harness refactor (Linux)Verified on the true post-merge state (the PR is rebased on current Built the full daemon chain for the harness ( 1. New loadtest suite — 5/5 scenarios PASS ✅
2. Typecheck (
|
🔍 Local verification report — real build + daemon integration/load tests
Verdict: ✅ Mergeable as test-infra. The refactoring is faithful and the harness mostly works; zero production code is touched and the whole load-test suite is excluded from default CI. One new scenario (SSE eviction) needs hardening to be reliable cross-platform — details + a suggested fix below. Environment
Results
Loadtest scenario breakdown: Two failures that are environmental, not this PR (both proven)
The one real finding — SSE eviction scenario is environment-sensitive
Likely cause: the flood fires prompts sequentially awaited: for (let i = 0; i < promptCount; i++) {
await d.client.prompt(session.sessionId, { prompt: [{ type: 'text', text: `flood-${i}` }] });
}Awaiting each prompt throttles event production to roughly the slow consumer's drain rate ( Suggested fix: make the flood genuinely concurrent so it outpaces the consumer regardless of host timing, e.g. fire without awaiting and then settle: await Promise.allSettled(
Array.from({ length: promptCount }, (_, i) =>
d.client.prompt(session.sessionId, { prompt: [{ type: 'text', text: `flood-${i}` }] })),
);This isn't a CI blocker (the suite is gated behind Refactoring soundnessThe ~500-line extraction into Scope: build + refactor regression + live load-test on Linux/Node-22 against the real merge result. No production code is modified by this PR. Not covered: macOS/Node-24 (author's platform) where the SSE-eviction scenario reportedly passes. |
chiga0
left a comment
There was a problem hiding this comment.
Code Review Overview (AI Generated)
PR: #4862 — feat(test): add daemon connection stress test + refactor perf harness
Type: New Feature + Refactor (test infrastructure)
Change size: +1268/-505 across 10 files
HEAD: e1b0df4e03
Findings Summary
- Critical/Major: 0 items
- Minor: 0 items
- Nit: 0 items
Assessment
Refactoring quality: All 3 extracted modules (_daemon-harness.ts additions, _daemon-benchmark-helpers.ts, _daemon-perf-report.ts) preserve exact behavior — function signatures, regex patterns, spawn args, and logic flow are identical to the originals. makeTempWorkspace cleanly generalizes the prefix parameter. ~500 lines of duplication eliminated without any behavioral drift.
New stress tests: The 5 scenarios are well-structured with a clean withDaemon helper pattern (try/catch/finally + proper disposal). All scenarios correctly track snapshot status through assertion lifecycle (commit 3 addressed this well). SSE eviction assertion (evicted === true) is near-deterministic with the configured maxQueued=16 + 80 events. Crash recovery validates end-to-end recovery (new session creation, not just health check). The mock ACP agent using real AgentSideConnection SDK provides high-fidelity testing without real model latency.
Test gating: QWEN_LOADTEST_ENABLED=1 + platform check + sandbox check properly excludes the load test from default CI. Dedicated vitest.loadtest.config.ts with 10-min timeout and no retries is appropriate for stress testing.
wenshao follow-up Suggestions: Noted 7 additional Suggestions in review 4450487353 (post-APPROVE). All valid improvement opportunities (eviction detection decoupling, ScenarioResult.error population, reject/hang mode coverage, typecheck Pick → Required<Pick>, loadtest config typecheck include). None blocking.
Key Observations
Clean test infrastructure PR. The refactoring is a textbook extraction — no hidden behavior changes. The new stress tests follow established patterns from the existing baseline/benchmark suites and add genuine value for catching daemon HTTP/SSE lifecycle bugs. Approved.
This review was generated by QoderWork AI
Already have 2 approved, 3ks.
Review feedback — round 3 (wenshao R2)
Fixed: 2 | Pushed back: 5 | Commit: 44dcc98 |
What this PR does
Extract shared helpers from baseline/benchmark daemon tests into dedicated modules and add a new mock-ACP connection stress test suite gated by
QWEN_LOADTEST_ENABLED=1.Refactoring:
_daemon-benchmark-helpers.ts: extract/usr/bin/timewrappers from benchmark test_daemon-perf-report.ts: sharedformatPercentiles,collectPlatformInfo,writeSnapshotArtifacts,resolveOutputDir_daemon-harness.ts: exportgitHead(),makeTempWorkspace(),sleep(),ScenarioResult, addlastSeenIdtoConsumeSseResultNew features:
fixtures/mock-acp-child/agent.mjs: mock ACP agent using realAgentSideConnectionSDK with env-controlled error injection modes (echo/reject/crash-on-prompt/hang)mock-acp-typecheck.test.ts: compile-time Agent interface compliance checkqwen-daemon-loadtest.test.ts: 5 scenarios — rapid lifecycle, SSE slow-consumer eviction, Last-Event-ID reconnect, ACP crash recovery, burst concurrent sessionsvitest.loadtest.config.ts: isolated config withroot: __dirnameanchoringWhy it's needed
Issue #4514 T3.4 — the daemon's HTTP/SSE surface lacks automated stress testing. Without a mock-ACP-based harness, connection lifecycle bugs, SSE backpressure regressions, and crash recovery issues can only be caught by manual testing or real-model integration tests (which are slow and flaky). The refactoring reduces ~500 lines of duplication across baseline/benchmark tests, making the shared infrastructure maintainable as the test suite grows.
Reviewer Test Plan
How to verify
Evidence (Before & After)
N/A — test infrastructure only, no user-facing behavior change.
Tested on
Risk & Scope
QWEN_LOADTEST_ENABLED=1and excluded from default CI. Refactoring is pure import-replacement with no behavioral changes.🤖 Generated with Qwen Code