test(integration): harden flaky sleep-interception e2e against skipped tool calls#4936
Conversation
…d tool calls The model sometimes answered the old prompts without ever calling the shell tool, timing out waitForToolCall and burning ~230s per CI run with retries. Prompts now explicitly require a run_shell_command call, assertions key off recorded telemetry (blocked call with success: false and Monitor guidance in the error) instead of model narration, and failures print debug info.
E2E Test ReportLocal validation against a real model —
For comparison, the flaky CI failure mode this PR addresses (run 27256480970) spent 230.9s on this suite failing one test through 3 attempts. |
|
Thanks for the PR! Template looks good ✓ On direction: well-aligned. Flaky e2e tests that burn ~230s per CI retry are a real productivity drain, and the root cause — the model answering from prediction instead of calling the tool — is a genuine reliability gap. Fixing the assertion strategy to key off telemetry rather than narration is the right move. No CHANGELOG reference needed; this is test infrastructure, not user-facing behavior. On approach: scope feels right. The changes are tightly scoped to the flaky test file plus additive telemetry fields in the test helper. The prompt rewrites, telemetry-based assertions, debug output on failure, and graceful degradation for the podman fallback all address distinct facets of the same problem without scope creep. Moving on to code review. 🔍 中文说明感谢贡献! 模板完整 ✓ 方向:对齐。不稳定的 e2e 测试每次 CI 重试浪费约 230 秒,是真实的生产力损耗;根本原因——模型凭预测直接回答而非调用工具——是可靠性缺口。将断言策略从自然语言回复迁移到遥测记录是正确的方向。无需 CHANGELOG 参考,这是测试基础设施,不涉及用户可见行为。 方案:范围合理。变更集中在不稳定的测试文件和测试辅助模块的增量遥测字段上。提示词重写、基于遥测的断言、失败时的调试输出、podman 回退路径的优雅降级,各自针对同一问题的不同方面,没有范围蔓延。 进入代码审查 🔍 — Qwen Code · qwen3.7-max |
Code ReviewIndependent proposal: given the flake (model answers without calling the tool → 60s timeout), I'd do the same three things: (1) make prompts explicitly require a tool call, (2) assert on telemetry tool calls instead of narration text, (3) add debug output on failure so future flakes are self-diagnosing. I'd also expose Comparison: the PR matches this proposal closely. The No blockers found. The changes are focused, well-tested, and follow project conventions. Test ResultsBuild: ✅ passes. Core typecheck: ✅ passes. Integration test run from the PR branch: Ran twice in tmux (two consecutive runs), both 4/4 pass. Total suite time ~38s, compared to the flaky CI failure at ~230s with retries. The telemetry-based assertions resolve quickly when the tool call happens, eliminating the 60s polling timeout that caused the original flake. 中文说明代码审查独立方案: 针对该偶发失败(模型不调用工具直接回答 → 60 秒超时),我会做同样的三件事:(1) 让提示词明确要求工具调用,(2) 基于遥测工具调用而非自然语言回复做断言,(3) 失败时添加调试输出以便未来偶发失败自行定位。同时会从遥测属性中暴露 对比: PR 与我的方案高度一致。 未发现阻塞性问题。变更集中、测试充分、遵循项目规范。 测试结果构建:✅ 通过。Core 类型检查:✅ 通过。在 PR 分支上运行集成测试,两次连续运行均 4/4 通过,总耗时约 38 秒(对比不稳定 CI 失败的约 230 秒)。基于遥测的断言在工具调用发生时快速解析,消除了导致原始偶发失败的 60 秒轮询超时。 — Qwen Code · qwen3.7-max |
ReflectionStepping back: this PR does exactly what a test-hardening PR should do. The motivation is concrete (real CI failures burning real time), the fix targets the actual root cause (model skipping tool calls) rather than papering over it with longer timeouts, and the implementation is minimal — two files, additive-only in the test helper, no production code touched. My independent proposal matched the PR's approach almost exactly, which is a good sign: the solution is the obvious one for this problem. The telemetry-based assertion strategy is strictly better than narration matching, the debug output on failure makes future flakes self-diagnosing, and the graceful degradation for the podman fallback shows attention to edge cases without overengineering. Both test runs from the PR branch passed 4/4 in ~38s — a 6× improvement over the flaky failure mode. No concerns worth flagging. Verdict: approve. ✅ 中文说明反思退后一步看:这个 PR 做了测试加固 PR 该做的事。动机具体(真实的 CI 失败浪费真实时间),修复针对实际根因(模型跳过工具调用)而非用更长超时掩盖问题,实现最小化——两个文件,测试辅助模块纯增量,不涉及生产代码。 我的独立方案与 PR 的方案几乎完全一致,这是个好信号:方案是该问题的显而易见解。基于遥测的断言策略严格优于自然语言匹配,失败时的调试输出让未来偶发失败可自诊断,podman 回退路径的优雅降级体现了对边缘情况的关注而没有过度工程。 PR 分支两次测试运行均 4/4 通过,耗时约 38 秒——相比不稳定失败模式提升约 6 倍。无需标记的顾虑。 结论:批准。 ✅ — Qwen Code · qwen3.7-max |
qwen-code-ci-bot
left a comment
There was a problem hiding this comment.
LGTM, looks ready to ship. ✅
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
qwen-code-ci-bot
left a comment
There was a problem hiding this comment.
No issues found. LGTM! ✅ — qwen3.7-max via Qwen Code /review
DragonnZhang
left a comment
There was a problem hiding this comment.
Reviewed the diff (2 files, +141/-32). Changes are test-only and well-structured:\n\n- Prompts rewritten to require actual tool calls instead of model narration\n- Core assertions migrated to telemetry-based checks (blocked/allowed tool calls)\n- New Monitor-guidance check added with graceful degradation for stdout fallback\n- Narration checks downgraded to warn-only via validateModelOutput\n- Added status/error fields to test rig telemetry (additive, optional)\n\nCI passes on all platforms. No high-confidence bugs, security issues, or logic errors found.
What this PR does
sleeptool call, allowed and intentional-sleep runs must show a successful one.Why it's needed
Reviewer Test Plan
How to verify
From
integration-tests/, run the suite against a real model:Expected: all 4 tests pass. The blocked cases require telemetry to record a failed foreground
sleep 5call whose error message mentions the Monitor tool; the allow and intentional-sleep cases require a successful call. A regression would have looked like the old flake: a 60s hang waiting for a tool call that never happened, with no diagnostics.Evidence (Before & After)
Three consecutive local runs against a real model (deepseek-v4-flash via DashScope, throwaway QWEN_HOME):
Tested on
Environment (optional)
node dist/cli.js(fresh build + bundle), no sandbox, DashScope OpenAI-compatible endpoint.Risk & Scope
Linked Issues
Follow-up to #4878 (same test file, different flake source).
中文说明
本 PR 做了什么
sleep调用,放行和 intentional-sleep 场景必须记录到一次成功调用。为什么需要
评审验证方案
如何验证
在
integration-tests/目录下,用真实模型运行该套件(命令见上文英文部分)。预期:4 个测试全部通过。拦截场景要求遥测记录到一次失败的前台
sleep 5调用且错误信息提到 Monitor 工具;放行和 intentional-sleep 场景要求记录到成功调用。回归的表现会和旧的偶发失败一样:等待一次从未发生的工具调用而挂起 60 秒,且没有任何诊断信息。证据(前后对比)
连续三次本地真实模型运行(DashScope 的 deepseek-v4-flash,一次性 QWEN_HOME),结果见上文表格:3 次运行均 4/4 通过。
已测试平台
macOS ✅,Windows⚠️ ,Linux ⚠️ (CI 覆盖三平台)。
风险与范围
关联 Issue
#4878 的后续(同一测试文件,不同的不稳定来源)。