fix(core): give complete intentional-sleep guidance on first rejection for sleep chains#4948
Conversation
…n for sleep chains When the model tries `sleep 5 && cmd`, the old error message said 'split follow-up commands into a separate invocation' but omitted the `# intentional-sleep: <reason>` syntax. The model would then try standalone `sleep 5`, get blocked a second time, and only then learn the escape hatch. Now the first rejection for non-standalone sleep tells the model both steps: split into two calls and use the intentional-sleep comment. This reduces failures from 2 to 1. Also reuses the already-computed `strippedCommand` variable instead of calling `stripShellWrapper` a second time.
E2E Test ReportTest: Sleep chain double-failure regressionMethod: Headless CLI ( Prompt: "I just started a background server. Please wait 5 seconds then check if port 8080 is responding by running: sleep 5 && curl -s http://localhost:8080. If the sleep command is blocked, tell me exactly what error message you received, then try again following the guidance. Report every attempt and its result." Before (main branch) — 2 failures{"is_error":true,"text":"Blocked: sleep 5 followed by: curl -s http://localhost:8080. Run blocking commands in the background with is_background: true. For streaming events (watching logs, polling APIs), use the Monitor tool. The intentional-sleep escape hatch only applies to standalone sleep commands; split follow-up commands into a separate invocation."}
{"is_error":true,"text":"Blocked: standalone sleep 5. Run blocking commands in the background with is_background: true. For streaming events (watching logs, polling APIs), use the Monitor tool. If you genuinely need a standalone delay (rate limiting, deliberate pacing), add a trailing comment like `# intentional-sleep: wait for MCP rate limit reset` (up to 10 minutes)."}
{"is_error":false,"text":"Command: sleep 5 # intentional-sleep: wait for server startup\nExit Code: 0"}
{"is_error":false,"text":"Command exited with code: 7"}Model needed 3 tool calls (2 blocked, 1 sleep success) before reaching After (this PR) — 1 failure{"is_error":true,"text":"Blocked: sleep 5 followed by: curl -s http://localhost:8080. Run blocking commands in the background with is_background: true. For streaming events (watching logs, polling APIs), use the Monitor tool. Split into two calls: first `sleep N # intentional-sleep: <reason>` (standalone), then the follow-up command."}
{"is_error":false,"text":"Command: sleep 5 # intentional-sleep: waiting for server on port 8080 to be ready\nExit Code: 0"}
{"is_error":false,"text":"Command exited with code: 7"}Model needed 2 tool calls (1 blocked, 1 sleep success) before reaching Unit testsResult: ✅ PASSFailures reduced from 2 → 1. Model learns the full escape hatch on first rejection. |
|
Thanks for the PR! Template looks good ✓ On direction: clearly aligned. This is a straightforward DX improvement — the shell tool's error message was leaking guidance across two rejections when it could give the full answer in one. Reducing wasted tool calls is exactly the kind of polish worth shipping. On approach: scope is tight and correct. One message string change, one variable reuse, one test update. Nothing to cut. Moving on to code review. 🔍 中文说明感谢贡献! 模板完整 ✓ 方向:明确对齐。这是一个直接的体验改进——shell 工具的错误信息本来需要两次拒绝才能给出完整指引,现在一次就能说清楚。减少无效的工具调用是值得发布的优化。 方案:范围紧凑且正确。一处消息字符串修改、一处变量复用、一处测试更新。没有可砍的部分。 进入代码审查 🔍 — Qwen Code · qwen3.7-max |
wenshao
left a comment
There was a problem hiding this comment.
No review findings. Downgraded from Approve to Comment: CI still running. — qwen3.7-plus via Qwen Code /review
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
Code ReviewThe change is minimal and correct. Two things worth noting:
No blockers, no conventions violations. Real-Scenario TestingPrompt: "Run this exact command: sleep 5 && curl -s http://localhost:8080. If the command is blocked, report the exact error, then retry following the guidance." Before (main branch —
|
|
This is a clean, focused PR that does one thing well. The independent proposal I wrote before reading the diff was identical: change the non-standalone error message to include the full The before/after evidence is convincing: the model went from 2 rejections to 1, following the complete guidance on the first try. That's a real token and latency saving for every sleep-chain interaction. The unit tests are updated appropriately and pass on both branches. Nothing to flag. Ships it. ✅ 中文说明这是一个干净、聚焦的 PR,做好了一件事。 在读 diff 之前我独立提出的方案与此完全一致:修改非独立 sleep 的错误消息以包含完整的 前/后证据令人信服:模型从 2 次拒绝减少到 1 次,在第一次就遵循了完整指引。这对每次 sleep 链交互都是实实在在的 token 和延迟节省。单元测试适当更新,两个分支均通过。 没有问题需要标记。可以合并。✅ — Qwen Code · qwen3.7-max |
qwen-code-ci-bot
left a comment
There was a problem hiding this comment.
LGTM, looks ready to ship. ✅
What this PR does
When the shell tool blocks a sleep chain like
sleep 5 && curl http://localhost, the error message now tells the model the full solution in a single rejection — split into two calls and use# intentional-sleep: <reason>— instead of revealing the escape hatch only after a second failure.Also reuses the already-computed
strippedCommandvariable instead of callingstripShellWrappera second time in the same validation method.Why it's needed
The old error message for
sleep N && cmdsaid "split follow-up commands into a separate invocation" but omitted the# intentional-sleep:syntax. The model would split, try standalonesleep 5, get blocked again, and only then learn it needs the comment. This wasted two tool calls before the model could succeed.E2E testing confirmed the model consistently fails twice before succeeding with the old message, and only once with the new message.
Reviewer Test Plan
How to verify
Build and bundle, then run a headless prompt that triggers the sleep chain path. Before: two
is_error: truetool results before the model succeeds. After: only one.Evidence (Before & After)
Before (2 failures):
After (1 failure):
Tested on
Environment (optional)
Headless CLI (
node dist/cli.js --yolo --output-format stream-json), unit tests (npx vitest run).Risk & Scope
sleep 5alone) is unchanged.中文说明
当 shell 工具拦截
sleep 5 && curl http://localhost这样的 sleep 链式命令时,错误信息现在会在第一次拒绝时就告诉模型完整的解决方案——拆分为两个调用并使用# intentional-sleep: <reason>注释——而不是让模型在第二次失败后才发现这个转义机制。同时复用了已经计算好的strippedCommand变量,避免重复调用stripShellWrapper。E2E 测试确认模型在旧消息下一致失败两次,新消息下只失败一次。