Skip to content

feat(test): add daemon connection stress test + refactor perf harness#4862

Merged
doudouOUC merged 3 commits into
daemon_mode_b_mainfrom
feat/daemon-loadtest-harness
Jun 8, 2026
Merged

feat(test): add daemon connection stress test + refactor perf harness#4862
doudouOUC merged 3 commits into
daemon_mode_b_mainfrom
feat/daemon-loadtest-harness

Conversation

@doudouOUC

@doudouOUC doudouOUC commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

What this PR does

Extract shared helpers from baseline/benchmark daemon tests into dedicated modules and add a new mock-ACP connection stress test suite gated by QWEN_LOADTEST_ENABLED=1.

Refactoring:

  • _daemon-benchmark-helpers.ts: extract /usr/bin/time wrappers from benchmark test
  • _daemon-perf-report.ts: shared formatPercentiles, collectPlatformInfo, writeSnapshotArtifacts, resolveOutputDir
  • _daemon-harness.ts: export gitHead(), makeTempWorkspace(), sleep(), ScenarioResult, add lastSeenId to ConsumeSseResult
  • Slim baseline + benchmark tests by ~500 lines via imports

New features:

  • fixtures/mock-acp-child/agent.mjs: mock ACP agent using real AgentSideConnection SDK with env-controlled error injection modes (echo/reject/crash-on-prompt/hang)
  • mock-acp-typecheck.test.ts: compile-time Agent interface compliance check
  • qwen-daemon-loadtest.test.ts: 5 scenarios — rapid lifecycle, SSE slow-consumer eviction, Last-Event-ID reconnect, ACP crash recovery, burst concurrent sessions
  • vitest.loadtest.config.ts: isolated config with root: __dirname anchoring

Why it's needed

Issue #4514 T3.4 — the daemon's HTTP/SSE surface lacks automated stress testing. Without a mock-ACP-based harness, connection lifecycle bugs, SSE backpressure regressions, and crash recovery issues can only be caught by manual testing or real-model integration tests (which are slow and flaky). The refactoring reduces ~500 lines of duplication across baseline/benchmark tests, making the shared infrastructure maintainable as the test suite grows.

Reviewer Test Plan

How to verify

# Baseline still passes after refactoring
npx vitest run integration-tests/cli/qwen-serve-baseline.test.ts

# Benchmark still passes after helper extraction
QWEN_BENCHMARK_ENABLED=1 BENCHMARK_ITERATIONS=1 npx vitest run integration-tests/cli/qwen-daemon-vs-cli-benchmark.test.ts

# Type compliance test
npx vitest run integration-tests/cli/mock-acp-typecheck.test.ts

# Loadtest scenarios (requires daemon binary built)
QWEN_LOADTEST_ENABLED=1 npx vitest run --config integration-tests/vitest.loadtest.config.ts

# Default CI excludes loadtest
npx vitest run --config integration-tests/vitest.config.ts

Evidence (Before & After)

N/A — test infrastructure only, no user-facing behavior change.

Tested on

Platform Node Status
macOS (darwin arm64) v24.12.0 TypeScript + ESLint pass

Risk & Scope

  • Risk: Low. All new tests are gated behind QWEN_LOADTEST_ENABLED=1 and excluded from default CI. Refactoring is pure import-replacement with no behavioral changes.
  • Scope: Test infrastructure only. No production code changed.
  • Rollback: Revert this single commit.

🤖 Generated with Qwen Code

… (issue #4514 T3.4)

Extract shared helpers from baseline/benchmark tests into dedicated
modules and add a new mock-ACP connection stress test suite.

Refactoring (PR1 scope):
- _daemon-harness.ts: export gitHead(), makeTempWorkspace(), sleep(),
  ScenarioResult, lastSeenId tracking in consumeSseEvents
- _daemon-benchmark-helpers.ts: extract /usr/bin/time wrappers
  (spawnDaemonWithTime, parseTimeOutput, measureProcessTreeRss,
  measureCliStartupWithProfiler) from benchmark test
- _daemon-perf-report.ts: shared formatPercentiles, collectPlatformInfo,
  writeSnapshotArtifacts, resolveOutputDir
- Slim baseline + benchmark tests to import from new modules

New features (PR2 scope):
- fixtures/mock-acp-child/agent.mjs: mock ACP agent using real
  AgentSideConnection SDK, env-controlled modes (echo/reject/
  crash-on-prompt/hang)
- mock-acp-typecheck.test.ts: compile-time Agent interface check
- qwen-daemon-loadtest.test.ts: 5 scenarios gated by
  QWEN_LOADTEST_ENABLED=1 — rapid lifecycle, SSE slow-consumer
  eviction, Last-Event-ID reconnect, ACP crash recovery, burst
  concurrent sessions
- vitest.loadtest.config.ts: isolated config with root anchoring

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)

@qwen-code-ci-bot qwen-code-ci-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @doudouOUC — thanks for the PR! The daemon loadtest work looks genuinely useful, but the PR body doesn't follow our pull request template. Please update it to include the required sections so reviewers can properly evaluate the change:

  • ## What this PR does — currently ## Summary, which is close but the template heading is required
  • ## Why it's needed — missing. What problem does the daemon stress test solve? Why now?
  • ## Reviewer Test Plan — currently ## Test plan. Please rename and fill in ### How to verify, ### Evidence (Before & After) (N/A for test infra is fine), and the ### Tested on table
  • ## Risk & Scope — missing. Main risk, what's out of scope, breaking changes
  • ## Linked IssuesIssue: #4514 T3.4 is referenced but doesn't use closing keywords (Closes #4514 / Fixes #4514)

Please update the PR body to match the template and re-request review. Happy to take another look once it's updated.

中文说明

你好 @doudouOUC — 感谢提交 PR!daemon 压力测试的工作看起来很有价值,但 PR 正文没有遵循我们的 PR 模板。请更新为模板要求的各个部分,以便审查者能正确评估:

  • ## What this PR does — 当前用的是 ## Summary,接近但需要改成模板标题
  • ## Why it's needed — 缺失。daemon 压力测试解决什么问题?为什么现在做?
  • ## Reviewer Test Plan — 当前用的是 ## Test plan。请改名并填写 ### How to verify### Evidence (Before & After)(测试基础设施可以填 N/A)、### Tested on 表格
  • ## Risk & Scope — 缺失。主要风险、不在范围内的内容、破坏性变更
  • ## Linked Issues — 引用了 Issue: #4514 T3.4 但没有使用关闭关键词(Closes #4514 / Fixes #4514

请按照模板更新 PR 正文后重新请求审查。更新后我很乐意再看看。

Qwen Code · qwen3.7-max

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a gated daemon connection stress/load test suite (using a mock ACP child) and refactors the existing baseline/benchmark integration tests by extracting shared performance harness/reporting helpers into reusable modules.

Changes:

  • Extract shared benchmark/perf reporting utilities into _daemon-benchmark-helpers.ts and _daemon-perf-report.ts, and extend _daemon-harness.ts with reusable helpers.
  • Add a mock ACP child fixture (agent.mjs) plus a type-shape compliance test for the mock agent interface.
  • Add a gated load test (qwen-daemon-loadtest.test.ts) with an isolated Vitest config, and exclude it from the default integration test run.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
integration-tests/vitest.loadtest.config.ts Adds an isolated Vitest config for running only the loadtest suite.
integration-tests/vitest.config.ts Excludes the loadtest suite from the default integration test run.
integration-tests/fixtures/mock-acp-child/agent.mjs Adds a mock ACP agent process used by stress/load tests.
integration-tests/cli/qwen-serve-baseline.test.ts Refactors baseline test to use extracted helpers/reporting utilities.
integration-tests/cli/qwen-daemon-vs-cli-benchmark.test.ts Refactors benchmark test to use extracted harness/reporting utilities.
integration-tests/cli/qwen-daemon-loadtest.test.ts Adds gated daemon connection stress/load scenarios.
integration-tests/cli/mock-acp-typecheck.test.ts Adds a compile-time shape check for the mock agent interface.
integration-tests/cli/_daemon-perf-report.ts New shared report primitives (platform info, output dir, formatting, artifact writes).
integration-tests/cli/_daemon-harness.ts Extends harness exports for reuse across baseline/benchmark/loadtest.
integration-tests/cli/_daemon-benchmark-helpers.ts New benchmark-only helpers (notably /usr/bin/time wrappers) extracted from the benchmark test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread integration-tests/fixtures/mock-acp-child/agent.mjs
Comment thread integration-tests/cli/_daemon-benchmark-helpers.ts
Comment thread integration-tests/cli/_daemon-benchmark-helpers.ts
Copilot correctly noted that console.debug and console.dir also
write to stdout in Node, which would corrupt the NDJSON pipe.

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
@doudouOUC

Copy link
Copy Markdown
Collaborator Author

Review feedback — round 1

Thread Reviewer Action Detail
agent.mjs:34 console.debug/dir Copilot Fixed in 586860c Added console.debug = console.error and console.dir = console.error
_daemon-benchmark-helpers.ts:182 spawn error handler Copilot Deferred Pre-existing code extracted as-is; hardening is a separate PR
_daemon-benchmark-helpers.ts:380 spawn error handler Copilot Deferred Same rationale
PR description format qwen-code-ci-bot Fixed Updated to match PR template (What/Why/Test Plan/Risk sections)

Resolved 3/3 inline threads.

Comment thread integration-tests/cli/qwen-daemon-loadtest.test.ts Outdated
Comment thread integration-tests/cli/qwen-daemon-loadtest.test.ts Outdated
Comment thread integration-tests/cli/qwen-daemon-loadtest.test.ts Outdated
… recovery

- All 5 scenarios now use try/catch/finally so snapshot.status
  reflects actual test outcome
- SSE eviction scenario asserts evicted === true (near-deterministic
  with maxQueued=16 + 80+ events)
- Crash recovery verifies end-to-end by creating a fresh session
  post-crash, not just health check

🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
@doudouOUC

Copy link
Copy Markdown
Collaborator Author

Review feedback — round 2 (wenshao)

Thread Action Detail
loadtest.test.ts:211 snapshot status before expect Fixed in e1b0df4 All 5 scenarios use try/catch/finally; status reflects actual outcome
loadtest.test.ts:272 SSE eviction not asserted Fixed in e1b0df4 Added expect(evicted).toBe(true) with maxQueued=16
loadtest.test.ts:354 crash recovery only checks health Fixed in e1b0df4 Now creates fresh session post-crash via createOrAttachSession

Resolved 3/3 threads.

@wenshao

wenshao commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

✅ Local verification — daemon connection stress test + perf-harness refactor (Linux)

Verified on the true post-merge state (the PR is rebased on current daemon_mode_b_maingit rev-list --left-right --count origin/daemon_mode_b_main...HEAD = 0 3, so the head already is the merge result). The PR touches only integration-tests/ (10 files); packages/{cli,core,sdk-typescript} source is unchanged, so the daemon under test is the existing one. The author's matrix lists macOS only — this run adds Linux (x64) coverage.

Built the full daemon chain for the harness (core → cli → sdk-typescript → web-templates → esbuild bundle): 0 TypeScript errors across the whole graph; the bundled daemon (dist/cli.js, which globalSetup.ts wires in via TEST_CLI_PATH) boots and listens.

1. New loadtest suite — 5/5 scenarios PASS ✅

QWEN_LOADTEST_ENABLED=1 vitest run --config integration-tests/vitest.loadtest.config.ts (real daemon + mock-ACP child, no model):

Scenario Result Time
rapid lifecycle (20× create→prompt→close) 3.06s
SSE slow-consumer eviction through HTTP 60.96s
Last-Event-ID reconnect under concurrent load 1.16s
ACP child crash → session error recovery 0.96s
burst: 10 concurrent sessions 1.10s

Test Files 1 passed · Tests 5 passed, 67.7s total. The daemon's HTTP/SSE surface survives the lifecycle churn, backpressure flood, reconnect, and a crash-on-prompt child without wedging.

2. Typecheck (tsc --noEmit over all integration-tests) — clean for the PR ✅

0 errors in all 10 changed files (incl. mock-acp-typecheck.test.ts, whose whole point is the compile-time Agent-interface assertion). The only diagnostic emitted is tsconfig.json(8,13) TS5063 for the "//" paths-comment — and that file is byte-identical to base (not in this PR's changeset), so it's pre-existing.

3. Refactor safety net — baseline test: 6 passed, 1 skipped, 1 pre-existing env failure ✅

vitest run --root ./integration-tests cli/qwen-serve-baseline.test.ts (the test slimmed by the helper extraction):

  • 6 passed — RSS scaling, attach latency, MCP grandchildren count as sessions grow, 2× SSE backpressure unit, etc. These exercise the extracted _daemon-perf-report.ts + _daemon-harness.ts helpers, confirming the extraction is behavior-preserving.
  • ↓ 1 skipped — prompt-latency (no model-credential env), expected.
  • × 1 failed — MCP child amplification > clientCount matches external pgrep observation (expected 4 MCP grandchildren, the sandbox's external pgrep saw 2). This is not a PR regression: the failing test is at line 480, which lies outside every diff hunk (hunks touch only lines 33–228 and 704), i.e. its code is byte-unchanged by this PR. It is an environment-sensitive process-tree observation test (pgrep visibility of spawned MCP children under a root/container sandbox); its sibling subtest in the same block passed.

Methodology note

act/real CI is not reproducible locally; this exercises the real built daemon + mock-ACP child end-to-end. The _daemon-benchmark-helpers.ts extraction (the /usr/bin/time wrappers used only by the benchmark suite) typechecks clean and follows the same extraction pattern as the two runtime-validated modules; the full benchmark run is cold-start-bound and did not complete within the time budget in this sandbox (a property of the benchmark, not the PR).

Verdict

Verified. The new mock-ACP loadtest harness works end-to-end (5/5 on Linux), the perf-harness refactor is behavior-preserving (6/6 runnable baseline assertions pass; the lone failure is a pre-existing, env-sensitive pgrep test byte-unchanged by this PR), and all 10 changed files typecheck clean. No blockers from a runtime-verification standpoint.

@wenshao

wenshao commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

🔍 Local verification report — real build + daemon integration/load tests

中文 TL;DR: 在本地用 tmux,基于 PR 的真实合并结果(PR 已是 base 分支 daemon_mode_b_main 最新态,0 落后)做了「构建 + 重构回归 + 真实跑 loadtest」验证。结论:重构是忠实的(benchmark 装上 /usr/bin/time 后 10/10 全过、typecheck 过、baseline 除一个既有环境问题外全过),新的 loadtest 框架 5 个场景过了 4 个。唯一一个真问题:SSE 慢消费者驱逐场景在本 Linux 沙箱里确定性失败(连跑两次都 evicted=false)。 由于整套 loadtest 默认被 CI 排除、且本 PR 不碰任何生产代码,可以合并;但建议把那个 SSE 场景的“灌流”改成并发再依赖它。其余两个失败(benchmark 的 /usr/bin/time ENOENT、baseline 的 pgrep MCP 计数)都已证明是环境问题、与本 PR 无关。

Verdict: ✅ Mergeable as test-infra. The refactoring is faithful and the harness mostly works; zero production code is touched and the whole load-test suite is excluded from default CI. One new scenario (SSE eviction) needs hardening to be reliable cross-platform — details + a suggested fix below.

Environment

  • Base is daemon_mode_b_main. PR e1b0df4e0 is 0 commits behind base 82879b300 → PR head is the merge result.
  • Linux (kernel 6.12), Node v22.22.2 (author validated on macOS / Node v24). Built CI-style (npm install + npm run build, exit 0); daemon binary dist/cli.js produced. All via tmux.

Results

Check Result
Default CI config excludes the loadtest vitest list collects 0 qwen-daemon-loadtest files
mock-acp-typecheck.test.ts (Agent interface compliance) ✅ passed
qwen-daemon-vs-cli-benchmark.test.ts (refactored, QWEN_BENCHMARK_ENABLED=1) 10/10 (after installing /usr/bin/time — see below)
qwen-serve-baseline.test.ts (refactored) ✅ all pass except one pre-existing env test (below)
qwen-daemon-loadtest.test.ts — 5 mock-ACP scenarios 4 / 5 passed

Loadtest scenario breakdown:

✓ rapid lifecycle: create-prompt-close cycles        3171ms
× SSE slow consumer triggers eviction through HTTP   61002ms   ← deterministic fail here
✓ Last-Event-ID reconnect under concurrent load      1389ms
✓ ACP child crash → session error recovery           1081ms
✓ burst: concurrent sessions with mock prompts       1080ms

Two failures that are environmental, not this PR (both proven)

  1. Benchmark — spawn /usr/bin/time ENOENT. GNU time isn't installed in this sandbox; the benchmark (and its extracted _daemon-benchmark-helpers.ts) wraps /usr/bin/time -v. After apt-get install time, the benchmark passes 10/10 → the helper extraction is correct.
  2. Baseline — clientCount matches external pgrep observation. Timed out waiting for 4 MCP grandchildren … mcpGrandchildren=[2 pids]. The PR does not touch this test, and it fails identically on the clean base branch (no PR) — the sandbox only spawns 2 MCP grandchildren, not 4. Pre-existing/environmental.

The one real finding — SSE eviction scenario is environment-sensitive

SSE slow consumer triggers eviction through HTTP fails at expect(evicted).toBe(true) — the slow consumer is never evicted within 60s in this environment. I re-ran it twice in isolation; it failed both times → deterministic here, not flaky.

Likely cause: the flood fires prompts sequentially awaited:

for (let i = 0; i < promptCount; i++) {
  await d.client.prompt(session.sessionId, { prompt: [{ type: 'text', text: `flood-${i}` }] });
}

Awaiting each prompt throttles event production to roughly the slow consumer's drain rate (consumerDelayMs: 200), so the 16-deep queue doesn't reliably overflow → the daemon's queue-overflow eviction doesn't trip on this Linux/Node-22 host (it presumably did on the author's macOS/Node-24). The daemon eviction logic itself is base code and isn't exercised to failure here — the scenario just doesn't generate enough backpressure.

Suggested fix: make the flood genuinely concurrent so it outpaces the consumer regardless of host timing, e.g. fire without awaiting and then settle:

await Promise.allSettled(
  Array.from({ length: promptCount }, (_, i) =>
    d.client.prompt(session.sessionId, { prompt: [{ type: 'text', text: `flood-${i}` }] })),
);

This isn't a CI blocker (the suite is gated behind QWEN_LOADTEST_ENABLED=1 and excluded from default config), but worth hardening before this scenario is relied on in cron/CI.

Refactoring soundness

The ~500-line extraction into _daemon-benchmark-helpers.ts / _daemon-perf-report.ts / harness exports is faithful: benchmark passes 10/10 (with /usr/bin/time), mock-acp-typecheck passes, and the baseline test's only failure reproduces identically on base. The new vitest.loadtest.config.ts is correctly isolated (10-min timeout, fileParallelism:false, SDK alias) and the default config correctly excludes the loadtest.


Scope: build + refactor regression + live load-test on Linux/Node-22 against the real merge result. No production code is modified by this PR. Not covered: macOS/Node-24 (author's platform) where the SSE-eviction scenario reportedly passes.

@doudouOUC doudouOUC requested review from chiga0 and yiliang114 June 8, 2026 14:32
Comment thread integration-tests/cli/qwen-daemon-loadtest.test.ts
Comment thread integration-tests/cli/qwen-daemon-loadtest.test.ts
Comment thread integration-tests/cli/qwen-daemon-loadtest.test.ts
Comment thread integration-tests/fixtures/mock-acp-child/agent.mjs
Comment thread integration-tests/cli/mock-acp-typecheck.test.ts
Comment thread integration-tests/cli/_daemon-perf-report.ts
Comment thread integration-tests/vitest.loadtest.config.ts

@chiga0 chiga0 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Overview (AI Generated)

PR: #4862 — feat(test): add daemon connection stress test + refactor perf harness
Type: New Feature + Refactor (test infrastructure)
Change size: +1268/-505 across 10 files
HEAD: e1b0df4e03

Findings Summary

  • Critical/Major: 0 items
  • Minor: 0 items
  • Nit: 0 items

Assessment

Refactoring quality: All 3 extracted modules (_daemon-harness.ts additions, _daemon-benchmark-helpers.ts, _daemon-perf-report.ts) preserve exact behavior — function signatures, regex patterns, spawn args, and logic flow are identical to the originals. makeTempWorkspace cleanly generalizes the prefix parameter. ~500 lines of duplication eliminated without any behavioral drift.

New stress tests: The 5 scenarios are well-structured with a clean withDaemon helper pattern (try/catch/finally + proper disposal). All scenarios correctly track snapshot status through assertion lifecycle (commit 3 addressed this well). SSE eviction assertion (evicted === true) is near-deterministic with the configured maxQueued=16 + 80 events. Crash recovery validates end-to-end recovery (new session creation, not just health check). The mock ACP agent using real AgentSideConnection SDK provides high-fidelity testing without real model latency.

Test gating: QWEN_LOADTEST_ENABLED=1 + platform check + sandbox check properly excludes the load test from default CI. Dedicated vitest.loadtest.config.ts with 10-min timeout and no retries is appropriate for stress testing.

wenshao follow-up Suggestions: Noted 7 additional Suggestions in review 4450487353 (post-APPROVE). All valid improvement opportunities (eviction detection decoupling, ScenarioResult.error population, reject/hang mode coverage, typecheck PickRequired<Pick>, loadtest config typecheck include). None blocking.

Key Observations

Clean test infrastructure PR. The refactoring is a textbook extraction — no hidden behavior changes. The new stress tests follow established patterns from the existing baseline/benchmark suites and add genuine value for catching daemon HTTP/SSE lifecycle bugs. Approved.


This review was generated by QoderWork AI

@doudouOUC doudouOUC dismissed qwen-code-ci-bot’s stale review June 8, 2026 15:22

Already have 2 approved, 3ks.

@doudouOUC doudouOUC merged commit a680b4d into daemon_mode_b_main Jun 8, 2026
16 checks passed
@doudouOUC doudouOUC deleted the feat/daemon-loadtest-harness branch June 8, 2026 15:23
@doudouOUC

Copy link
Copy Markdown
Collaborator Author

Review feedback — round 3 (wenshao R2)

Thread Action Detail
loadtest:260 evictionReason fragility Fixed in 44dcc98 Added evicted: boolean to ConsumeSseResult
loadtest:371 crash recovery prompt Not taking Session creation already proves ACP child recovery; prompting in crash mode would re-crash
loadtest:215 ScenarioResult.error empty Fixed in 44dcc98 Error captured in catch blocks via pushScenario helper
agent.mjs:79 reject/hang unused modes Not taking Modes for future expansion; adding tests expands PR scope
typecheck:13 Pick vs full Agent Not taking AgentSideConnection handles missing methods via methodNotFound
perf-report:42 label discarded with env Not taking Inherited behavior, baseName prevents overwrites
loadtest.config:19 include typecheck Not taking Typecheck runs in default CI, not daemon-dependent

Fixed: 2 | Pushed back: 5 | Commit: 44dcc98

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants