fix(acp): prevent session/prompt hang when client ignores mid-turn drain requests#4925
Conversation
…ain requests The mid-turn queue drain request sent after every tool batch awaited the client's response with no deadline. A client that silently drops unknown methods instead of rejecting with JSON-RPC -32601 never answers, wedging every tool-calling prompt turn. Race the drain against a 2s timeout and latch the feature off after three consecutive timeouts. Also make the integration-test ACP client spec-conforming: reply -32601 to unknown agent->client requests instead of ignoring them.
E2E Test ReportAll runs against the bundled CLI ( 1. Regression scenario — agent fix alone, non-conforming clientHarness fix stashed (client silently ignores unknown methods, exactly the pre-PR CI behavior), agent rebuilt with the timeout:
Before the fix this scenario hung 60s and failed all 3 retries in 3 consecutive Release runs. 2. Full run — both fixes (CI state), final bundle
12/12 passed, zero retries (the three previously-failing tests had needed |
|
Thanks for the PR! Template looks good ✓ — all required sections present, bilingual, evidence table filled. On direction: this is a straightforward regression fix. PR #4728 added a mid-turn drain request that awaits the client's reply with no deadline, and non-conforming ACP clients (including the integration-test harness itself) silently drop unknown methods, causing every tool-calling prompt to hang forever. Three consecutive Release pipeline failures confirm the real-world impact. This is squarely within scope — ACP protocol robustness is core to the product. Claude Code's CHANGELOG has analogous mid-turn message handling fixes (e.g., "silently discarding messages enqueued mid-turn"), confirming the area is relevant. On approach: the scope is tight and proportional. The 2-second timeout via No concerns. Moving on to code review. 🔍 中文说明感谢贡献! 模板完整 ✓ — 所有必要章节齐全,双语,证据表格已填写。 方向:这是一个直接的回归修复。#4728 引入了一个无限等待客户端回复的轮次中 drain 请求,而不符合规范的 ACP 客户端(包括集成测试自身的 harness)会静默丢弃未知方法,导致所有调用工具的 prompt 永久挂起。连续三次 Release 流水线失败证实了实际影响。这完全在项目范围内 — ACP 协议健壮性是产品核心。Claude Code 的 CHANGELOG 也有类似的轮次中消息处理修复(如"静默丢弃轮次中入队的消息"),确认该领域相关。 方案:范围紧凑且与问题成比例。2 秒超时 + 没有顾虑。进入代码审查 🔍 — Qwen Code · qwen3.7-max |
Code ReviewThe implementation is clean and correct. A few observations: The The strike-counter design is well thought through: reset on success ( The integration-test fix in both The two new unit tests cover the key paths: the latch trip after 3 consecutive timeouts, and the strike reset on a successful drain between timeouts. Both pass. No blockers, no AGENTS.md violations. Test Results109/109 Session tests pass (including 2 new), zero regressions. Full suite (all 109 tests, no filter): Note on tmux real-scenario testing: this PR fixes an ACP protocol-level hang that requires a non-conforming ACP client to reproduce. The meaningful verification is the integration-test suite (which the author's E2E report shows at 12/12 pass) and the unit tests above. A tmux-driven CLI smoke test wouldn't exercise the ACP drain path. 中文说明代码审查实现干净且正确。几个观察:
计数器设计经过深思熟虑:成功时重置( 集成测试修复在 两个新的单元测试覆盖了关键路径:3 次连续超时后的闩锁触发,以及超时之间成功 drain 时的计数器重置。两者都通过。 没有阻塞问题,没有违反 AGENTS.md。 测试结果109/109 Session 测试通过(包括 2 个新增),零回归。 关于 tmux 真实场景测试的说明: 此 PR 修复的是 ACP 协议级别的挂起问题,需要不合规的 ACP 客户端才能复现。有意义的验证是集成测试套件(作者的 E2E 报告显示 12/12 通过)和上述单元测试。tmux 驱动的 CLI 冒烟测试不会触发 ACP drain 路径。 — Qwen Code · qwen3.7-max |
|
This is a well-scoped regression fix with clear motivation, correct implementation, and thorough test coverage. Going back to my independent proposal: I would have added a simple timeout and disabled the drain on first timeout. The PR's 3-strike approach is strictly better — it tolerates transient client slowness without giving up the feature prematurely, while still bounding the worst-case stall to ~6 seconds (3 × 2s) before permanent disable. The strike-reset-on-success is the detail that makes this robust for conforming-but-busy clients. The code is straightforward — no abstractions for single-use patterns, no speculative error handling, constants are well-named with comments explaining the why. The two new unit tests cover both the latch trip and the reset path, and the full 109-test suite passes clean. The one acknowledged tradeoff — a conforming-but-slow client losing mid-turn message injection for the rest of the session after three slow answers — is documented in a debug log and is an acceptable degradation compared to the alternative (hanging forever). LGTM. Approving. 中文说明这是一个范围明确的回归修复,动机清晰、实现正确、测试覆盖充分。 回顾我的独立方案:我会添加一个简单的超时并在首次超时时禁用 drain。PR 的 3 次计数器方案严格更优 — 它容忍客户端的瞬时缓慢而不会过早放弃该功能,同时将最坏情况的阻塞限制在约 6 秒(3 × 2s)后永久禁用。成功时重置计数器这个细节使得对合规但繁忙的客户端更加健壮。 代码简洁 — 没有为一次性模式创建抽象,没有投机性的错误处理,常量命名恰当,注释解释了 为什么。两个新的单元测试覆盖了闩锁触发和重置路径,完整的 109 个测试套件全部通过。 已知的一个权衡 — 合规但慢的客户端在三次慢响应后会在该会话的剩余时间失去轮次中消息注入 — 已在 debug 日志中记录,与替代方案(永久挂起)相比是可接受的降级。 LGTM。批准。 — Qwen Code · qwen3.7-max |
qwen-code-ci-bot
left a comment
There was a problem hiding this comment.
LGTM, looks ready to ship. ✅
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
What this PR does
-32601error instead of silently ignoring them.Why it's needed
session/promptin three consecutive Release runs (example), blocking releases.Reviewer Test Plan
How to verify
Create a quick note.Request 4 (session/prompt) timed outin CI). After: the turn completes; the first three tool batches pay at most a one-time 2s wait each, then the agent stops asking.Evidence (Before & After)
acp-integrationsuite +acp-crontest, real model, no sandboxFull E2E report in a separate comment below.
Tested on
Environment (optional)
Real model via OpenAI-compatible endpoint (deepseek-v4-flash on DashScope),
QWEN_SANDBOX=false, isolatedQWEN_HOME.Risk & Scope
Linked Issues
Regression introduced by #4728.
中文说明
本 PR 做了什么
-32601错误,而不是静默忽略。为什么需要
session/prompt超时,阻塞发版。验证方式
Create a quick note。风险与范围