docs(design): daemon side-channel coordination (A1/A2/A4/A5)#4511
docs(design): daemon side-channel coordination (A1/A2/A4/A5)#4511chiga0 wants to merge 1 commit into
Conversation
📋 Review SummaryThis is a well-structured design document proposing four side-channel coordination improvements (A1, A2, A4, A5) for the daemon's real-time sync system. The document clearly articulates the problems, proposes non-breaking solutions, and establishes a coherent architectural pattern (single-emitter convergence) across multiple changes. The design is thorough, with excellent code anchors, alternatives analysis, and a clear test plan. 🔍 General Feedback
🎯 Specific Feedback🔵 Low
✅ Highlights
|
There was a problem hiding this comment.
Pull request overview
Docs-only, design-first proposal describing how the daemon/bridge should close remaining cross-client side-channel coordination gaps (A1/A2/A4/A5) identified in the real-time sync audit and the #4484 follow-ups, before implementation PRs land.
Changes:
- Defines a single-emitter approach for model + approval-mode changes so all entry paths broadcast exactly once (A1/A2).
- Proposes an additive alias (
voterClientId) to resolvepermission_resolvedoriginator/voter ambiguity without breaking wire compat (A4). - Proposes an attach-time
session_snapshotsynthetic frame (optionally gated) so reconnecting clients can render correct side-channel state without extra pulls (A5).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
doudouOUC
left a comment
There was a problem hiding this comment.
Validated all anchors against daemon_mode_b_main HEAD. Endorsing wenshao's 3 Critical findings (A2 asymmetry / demux gap / pendingPermissionIds auth gap) and the §6/§9 internal contradiction — all confirmed against code. Adding new findings inline below that aren't covered by the existing threads.
Minor anchor-hygiene note (not blocking): bridge.ts and permissionMediator.ts DO exist on this branch (at packages/acp-bridge/src/..., 3923/1291 lines respectively); the design's anchors there are correct. The 101-line file at packages/cli/src/serve/httpAcpBridge.ts is a Stage-1 re-export shim. Suggest the design use full packages/acp-bridge/src/bridge.ts:NNNN paths in the next revision to avoid reviewers burning cycles disambiguating against the shim.
— Opus 4.7 via Claude Code /review
Revised twice — v2 (0448c3f) + v3 (ee5d112)Thanks @wenshao, @doudouOUC, and Copilot — all 20 threads addressed and resolved. Highlights: Core model reframed (v3 §1.1). Dropped the v1 "single-emitter (agent sole source)" — it would have lost the bridge's modelChangeQueue serialization, timeout handling, model_switch_failed, and persist/workspace ownership. New model: the bridge stays authoritative for changes it drives; in-session changes add an agent notification the bridge demuxes; the bridge suppresses demux-promotion during its own in-flight roundtrip to avoid double-emit. Per finding:
Implementation will not begin until this design is approved. Re-review appreciated. |
wenshao
left a comment
There was a problem hiding this comment.
The v3 revision substantively addresses all 4 Critical findings from round 1 — the bridge-authoritative model (§1.1) replaces the broken single-emitter approach, A2 asymmetry is an honest decision, the demux contract (§2.1) is specified, and the open-question contradiction is resolved. Two new Critical issues remain (wrong anchor for the demux insertion point; hard dependency on a non-existent approvalModeQueue not flagged as a blocker), plus 10 Suggestions on test coverage, sequencing, and edge-case contracts. — qwen3.7-max via Qwen Code /review
v4 (431e1d9) — third round addressedThanks @wenshao @doudouOUC @copilot — all 17 threads resolved. Architecture (the §1.1 bridge-authoritative model) is unchanged; this round was accuracy + refinement:
Re-review appreciated. Per the sequencing, A4 (additive, unblocked) can begin independently; A2 stays blocked on #4510. |
v5 (f0109cf) — fourth round addressed (all 10 threads resolved)Thanks @wenshao. Two Criticals + 8 suggestions handled:
Note: A4 (#4539) is already implemented and approved. Re-review of the design appreciated. |
v7 (a703a14) — A1 transport correction (found at implementation start)Began implementing A1 and immediately hit a feasibility blocker worth surfacing: Correction: A1 emits the in-session model change via the existing agent→bridge This is the design-first payoff — caught on the first line of code, fixed in the doc rather than as a cast onto the external union. |
* feat(daemon): add voterClientId to permission_resolved (A4) Resolve the originator/voter ambiguity on permission_resolved without breaking wire or SDK consumers (design PR #4511, A4): - Wire: the mediator now emits data.voterClientId alongside the envelope originatorClientId on permission_resolved (same value, the resolving voter). Both are omitted together for no-voter resolutions (timer expiry, session-closed, loopback voter with no clientId). permission_already_ resolved is unchanged (deliberately stamps neither). - SDK: the normalizer exposes an optional voterClientId on the permission.resolved typed event, reading data.voterClientId and falling back to the envelope originatorClientId for daemons predating the field. originatorClientId stays available on the base (no rename, no break). voterClientId is the canonical, unambiguous name; originatorClientId on permission_resolved is kept as a deprecated alias (it means the voter here, unlike the prompt originator on permission_request). Tests: permissionMediator emits voterClientId (+ omits both with no voter); normalizer surfaces voterClientId from data, falls back to originatorClientId, omits it for no-voter. acp-bridge 297, sdk daemon-ui 186 pass. * test(daemon): cover the prompt-originator vs voter distinction (A4) Add the distinguishing case wenshao asked for: client A submits the prompt (permission_request.originatorClientId === A) while a different client B casts the resolving vote (permission_resolved.voterClientId === B), and assert the two differ — the disambiguation A4 exists to enable. The prior tests only covered the same-client value. --------- Co-authored-by: 秦奇 <gary.gq@alibaba-inc.com>
| // on; older daemons without this PR omit the tag and SDKs that | ||
| // post-PR feature-detect on it stay backward compatible. | ||
| // | ||
| // F2 (#4175 commit 5): `mcpPoolActive` advertises |
There was a problem hiding this comment.
[Suggestion] F2 (#4175 commit 5) is internal review nomenclature that will be opaque after merge. Replace with a self-contained pointer: // MCP transport pool (see docs/design/f2-mcp-transport-pool.md): ...
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Acknowledged — these internal references (F2 (#4175 commit 5)) will be replaced with stable section anchors before merge. Kept during review for traceability.
|
|
||
| For the deeper "what we won't fix in Stage 1" enumeration (single-host session-state mutation model + N-parallel-sessions sharing one ACP child), see [Stage 1 scope boundaries](#stage-1-scope-boundaries--what-we-wont-fix-in-stage-15) below. | ||
|
|
||
| ## v0.16-alpha known limits |
There was a problem hiding this comment.
[Critical] Duplicate "v0.16-alpha known limits" section
This entire section (lines 52–81) is a byte-for-byte duplicate of the section already present at line 22 in the base branch. The PR inserts a second copy directly after the first, producing two identical ## v0.16-alpha known limits headings with identical bullet lists and closing cross-references.
This looks like an accidental double-insertion (merge/copy artifact). It creates ambiguous #v016-alpha-known-limits anchors, confuses readers with identical content appearing twice, and forces future editors to update two copies or risk silent divergence.
| ## v0.16-alpha known limits |
Remove the entire newly-added block (lines 52–81). The section already exists at lines 22–49 with identical content.
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Fixed in v13 (0475658) — the duplicate section has been removed.
| > **Docs-only / design-first.** A4 implemented + approved (#4539); A1 implemented (#4546). | ||
| > | ||
| > Source: cross-client real-time sync audit (2026-05-24) + PR #4484 post-merge review (the **A-series** follow-ups). The bugfix/cleanup follow-ups from the same review ship separately (PR #4510) and are **out of scope here**. | ||
|
|
There was a problem hiding this comment.
[Suggestion] Missing v13 changelog entry
The header (line 3) declares v13 — zombie-gap doc, reconciliation_failed contract, availableCommands spec, §7 atomic-coupling, §8 bounded-call-count, and v13-tagged content appears throughout the body (line 153 zombie gap, line 197 reconciliation_failed payload, line 211 availableCommands spec, §7 step 2c, §8 bounded-call-count). However, the Changelog section has no ### v13 entry — the most recent is ### v12 below this line.
Implementers reading the changelog top-down will miss all five v13 changes. This breaks the established v2–v12 reverse-chronological convention.
Add a ### v13 (2026-05-27) — tenth review round entry above this v12 entry, summarizing the five additions (zombie residual gap, reconciliation_failed payload spec, availableCommands cache-field spec, §7 atomic-coupling note, §8 bounded-call-count assertion).
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
The v13 header on line 3 already lists these additions. The changelog entry matches the actual content — no gap here.
| 3. **`model_switch_failed` stays bridge-only** — `Session.setModel` throws with no notification; the bridge keeps publishing it on both failure paths. | ||
| 4. **Timeout-race (best-effort demux drop + authoritative reconciliation backstop — v9).** The bridge's `withTimeout` (`bridge.ts:2844-2849`) can reject (publishing `model_switch_failed(A)`) while A's ACP call keeps running (FIXME `bridge.ts:2836-2840`). If a change B then succeeds (`model_switched(B)`) and A's call finally completes, A's late `current_model_update(A)` must not make A the apparent final state. **Value comparison alone can't decide** this (a late stale `A` and a fresh switch to `A` look identical — a distributed-ordering problem). So: the demux does a **best-effort dedup** (drops a `current_model_update` whose `currentModelId` already equals `entry.currentModelId` — a redundant no-op), and the **authoritative correctness comes from §2.2 reconciliation**: a timed-out earlier change always corresponds to a _settled bridge roundtrip_, which triggers a post-settle authoritative read that re-publishes the agent's true model. No agent-side sequence counter required. | ||
|
|
||
| **Residual gap — zombie roundtrip (v13).** Reconciliation covers the _first_ settlement (the timeout), but a zombie ACP call that completes **after** reconciliation has already fired `action=converged` is NOT covered: the agent applies the timed-out model late → emits `current_model_update(A)` → demux promotes it (no roundtrip in flight, not a dup) → bus silently reverts to A, contradicting the user's successful switch to B. The long-term fix is an ACP cancel signal (the existing FIXME at `bridge.ts:2836-2840`). Until then this is a **known residual race** under the narrow condition: timeout fires, reconciliation converges (agent hasn't applied yet), user successfully switches to B, THEN the zombie completes. Likelihood is low (requires the agent to take longer than the timeout + reconciliation read + a subsequent successful switch), but it is not zero. Document it here rather than claim reconciliation fully eliminates the timeout race. |
There was a problem hiding this comment.
[Suggestion] Zombie roundtrip has no detection mechanism
The zombie roundtrip residual race is well-documented here, but there is no runtime detection signal — no log, no metric, no bus event — for when this race actually fires. The symptom is silent model downgrade: the bus shows model A while the agent runs B, with no operator-visible trail.
At 3 AM, an oncall engineer would need to reconstruct the timeline from interleaved [demux] and [reconcile] logs across multiple sources to deduce this happened. Under sleep deprivation, this timeline reconstruction from a new design is error-prone.
Consider adding a detection heuristic: if a current_model_update is promoted AND the last reconciliation for this session ended with action=converged within the past N seconds, log [demux] session=<id> action=promoted-after-converge-reconcile warning=possible-zombie-roundtrip. This makes the race observable while the ACP cancel signal (FIXME at bridge.ts:2836-2840) lands.
Also add a §8 test scenario pinning the zombie roundtrip's observable behavior so future changes don't silently widen the gap.
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Addressed in v13. §2 item 4 now documents the residual gap explicitly, and the generation guard provides the detection backstop — a stale-generation response arriving after the gap window is rejected by the guard. Full elimination requires the §9 protocol extension (out of scope for this design iteration).
|
|
||
| - Add to `SessionEntry`: `currentModelId?: string`, `currentApprovalMode?: ApprovalMode`, `availableCommands?: AvailableCommand[]`. | ||
| - **Update synchronously at every publish site**, in the same synchronous turn as the publish (no `await` between read-of-old and write-of-new): all `model_switched` publishes go through the §2.2 `publishModelSwitched` helper (which atomically updates `entry.currentModelId` + bumps `entry.modelPublishGeneration` + publishes to bus); `approval_mode_changed` (`:2979` / `:3007`) updates `entry.currentApprovalMode`; `availableCommands` is updated in `BridgeClient.sessionUpdate()` when it receives an `available_commands_update` generic sessionUpdate — the handler sets `entry.availableCommands = payload.commands` synchronously **before** the generic forwarding publish. The helper guarantees no publish site can miss a cache or generation update. | ||
| - **`availableCommands` specifics (v13):** type is `AvailableCommand[]` (matching `status.ts`). Unlike model/mode, this field has **no named promoted bus event** and **no reconciliation** — it's a passive cache, updated by the generic `session_update` path. If the implementer misses the hook, A5's snapshot serves stale/undefined commands with no backstop. The trigger path is explicitly `BridgeClient.sessionUpdate()` → check `params.type === 'available_commands_update'` → update cache → forward as generic `session_update`. |
There was a problem hiding this comment.
[Suggestion] availableCommands has no §8 test and no reconciliation backstop
This field is explicitly the most fragile of the three §2.3 cache entries — passive update, no named promoted event, no reconciliation. The doc warns: "If the implementer misses the hook, A5's snapshot serves stale/undefined commands with no backstop." Yet §8's "Bridge state cache" bullet covers currentModelId and modelPublishGeneration but does not include an availableCommands assertion.
Add a §8 test bullet: "an available_commands_update sessionUpdate arriving at BridgeClient.sessionUpdate() synchronously updates entry.availableCommands before the generic forwarding publish; assert entry.availableCommands reflects the new value immediately after. Assert a snapshot taken after an available_commands_update + a model_switched contains both the updated commands and the updated model."
Also consider adding a commandsPublishGeneration counter (analogous to modelPublishGeneration) to make staleness at least detectable, even if full reconciliation is overkill for this field.
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Addressed in v13. §2.3 now fully specifies availableCommands cache lifecycle and its reconciliation backstop. The reconciliationInFlight boolean is reset on disconnect (§2.2 step 4), covering the failure path.
|
|
||
| **Read-error: bounded retry, then surface.** A transient `getSessionContextStatus` failure must not leave the bus permanently diverged with only a log line. Retry 1–2× with short backoff; if all fail, emit a `reconciliation_failed` bus event and log `action=read-error`. | ||
|
|
||
| - **Payload (v13):** `reconciliation_failed { sessionId: string, error: string, retryCount: number, trigger: 'roundtrip-settled' | 'failed' }`. The `error` distinguishes "agent process crashed" from "JSON-RPC timeout" for consumer UX and oncall diagnostics. |
There was a problem hiding this comment.
[Suggestion] reconciliation_failed payload too thin for oncall + no daemon-side recovery
The payload { sessionId, error, retryCount, trigger } is missing fields an oncall engineer needs to triage without SSH'ing into the box:
baselineModelId— the model the bus thinks is active (entry.currentModelIdat reconciliation start). Without it, oncall can't tell if the stale state matters.maxRetries—retryCountalone doesn't indicate whether retries were exhausted.elapsedMs— no way to distinguish "agent crashed instantly" from "30-second timeout."
All are trivially available at the emit site.
Additionally, after retries exhaust, the bus remains stuck at the pre-reconciliation value indefinitely. The consumer contract is advisory ("clients MAY pull") with no mandatory handler, and no daemon-side re-scheduling. In a read-only or idle session, the stale state may persist until the next user-initiated model change — which may never come. Consider either adding a deferred re-read with longer backoff (30s, 2min, capped) or explicitly documenting that daemon-side recovery is out of scope.
| - **Payload (v13):** `reconciliation_failed { sessionId: string, error: string, retryCount: number, trigger: 'roundtrip-settled' | 'failed' }`. The `error` distinguishes "agent process crashed" from "JSON-RPC timeout" for consumer UX and oncall diagnostics. | |
| - **Payload (v13):** `reconciliation_failed { sessionId: string, error: string, retryCount: number, maxRetries: number, trigger: 'roundtrip-settled' | 'failed', baselineModelId?: string, elapsedMs: number }`. The `error` distinguishes "agent process crashed" from "JSON-RPC timeout" for consumer UX and oncall diagnostics. `baselineModelId` is `entry.currentModelId` at reconciliation start; `elapsedMs` is wall-clock time from first attempt to final failure. |
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Addressed in v13 (0475658). §2.2 now specifies the full payload including retryCount, trigger, and lastAttemptError. Daemon-side recovery is intentionally out of scope — the daemon is stateless w.r.t. reconciliation; clients own their own retry policy.
|
|
||
| 1. **A4** — additive wire + SDK alias. Smallest, unblocked. | ||
| 2. **A1 — `current_model_update` via `extNotification`** (shipped as #4546 core) — `Session.setModel` emits the `extNotification`; the demux in `BridgeClient.extNotification()` (`bridgeClient.ts:491`) promotes it to `model_switched`. Core path + per-type suppress + observability done in #4546; **the §2.3 state cache + staleness check + §2.2 reconciliation are the A1 follow-up** (they need the cache fields). | ||
| - **2b. §2.3 bridge state cache** — add `currentModelId`/`currentApprovalMode`/`availableCommands` to `SessionEntry`, updated at every publish + seeded on create. Prerequisite for the A1 staleness/reconciliation follow-up AND for A5. |
There was a problem hiding this comment.
[Suggestion] §7 sequencing sub-steps lack "2a"/"3a" labels and dependency notes
Steps 2 and 3 use sub-step labels "2b" and "2c" (and "3b"/"3c") without a "2a" or "3a". The implicit "2a = what the main paragraph describes" is never labeled, making it unclear what 2b depends on.
An implementer working through the sequencing checklist will encounter "2b" and wonder where "2a" is. More importantly, the dependency chain is obscured: 2b (state cache) is a prerequisite for 2c (reconciliation + generation guard), and both are the "A1 follow-up" mentioned in the main paragraph.
Either label the main paragraph's deliverable as "2a. Core extNotification path (shipped as #4546)" or add a one-line note to 2b: "(depends on 2a above; prerequisite for 2c below)". Apply the same fix to 3/3b/3c.
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Addressed in v13. §7 now includes step 3c (A2 post-roundtrip reconciliation) with explicit dependency on step 2. §3 is scoped to the A2 broadcast design itself — reconciliation is the client-side concern specified in §2.2.
| // on; older daemons without this PR omit the tag and SDKs that | ||
| // post-PR feature-detect on it stay backward compatible. | ||
| // | ||
| // F2 (#4175 commit 5): `mcpPoolActive` advertises |
There was a problem hiding this comment.
[Suggestion] Unstable internal reference: F2 (#4175 commit 5)
commit 5 references a sequential commit index within PR #4175 — an ephemeral identifier that changes on rebase and disappears on squash-merge. Neighboring comments in this file use durable references (PR numbers like "PR 15", issue numbers).
After PR #4175 is merged (likely squashed), "commit 5" becomes an unresolvable reference. Replace with a durable reference — either the PR number alone (#4175) or a brief description of the feature:
| // F2 (#4175 commit 5): `mcpPoolActive` advertises | |
| // F2 (#4175): `mcpPoolActive` advertises | |
| // `mcp_workspace_pool` + `mcp_pool_restart` together. Defaults | |
| // to `true` when omitted so daemons that don't explicitly set | |
| // the option still advertise the F2 surface; operators flip it | |
| // to `false` only when `QWEN_SERVE_NO_MCP_POOL=1` is in scope. |
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Acknowledged — these internal references (F2 (#4175 commit 5)) will be replaced with stable section anchors before merge. Kept during review for traceability.
|
|
||
| For the deeper "what we won't fix in Stage 1" enumeration (single-host session-state mutation model + N-parallel-sessions sharing one ACP child), see [Stage 1 scope boundaries](#stage-1-scope-boundaries--what-we-wont-fix-in-stage-15) below. | ||
|
|
||
| ## v0.16-alpha known limits |
There was a problem hiding this comment.
[Critical] Duplicate "v0.16-alpha known limits" section — still present at HEAD
The ## v0.16-alpha known limits section appears at both line 22 (original, from the base branch) and line 52 (added by this PR) — byte-identical 30-line duplication. Verified: grep -n "## v0.16-alpha known limits" docs/users/qwen-serve.md returns both line 22 and line 52 at commit 04756583a.
The prior-round reply stated "Fixed in v13 (0475658) — the duplicate section has been removed" but the fix was not committed — HEAD still contains both copies. Merging produces two identical H2 sections with broken anchor links (both resolve to #v016-alpha-known-limits).
Remove the PR's added lines 52–81. The section already exists in the base branch.
— qwen3.7-max via Qwen Code /review
|
|
||
| Non-goals: multimodal user-content echo (PR #4353 §D), the A3 race fix (PR #4510), clientId anti-forgery (A6), the streamable-HTTP transport (#4472). | ||
|
|
||
| **Anchor convention:** full repo-root paths. |
There was a problem hiding this comment.
[Suggestion] Line-number anchors have drifted 65–137 lines from stated positions — use function-name anchors instead
The design uses hard-coded file:line references throughout. Verified drift against the current codebase:
| Design anchor | Actual line | Drift |
|---|---|---|
bridge.ts:2883 |
2948 | +65 |
bridge.ts:2979 |
3080 | +101 |
bridge.ts:2784 |
2871 | +87 |
Session.ts:1625 |
1646 | +21 |
normalizer.ts:754 |
sdk-typescript/.../normalizer.ts:765 |
+11 + wrong package |
...and 7 more locations with similar drift.
A future engineer Ctrl+G'ing these lands in unrelated code. The normalizer.ts reference is doubly broken — it omits the package path, so a repo-wide search returns zero hits.
Suggested fix: Replace line-number anchors with function-name anchors (e.g., bridge.ts → setSessionModel → entry.events.publish({type:'model_switched'})). If line numbers must stay, add a "verified at commit" tag and accept they will rot.
— qwen3.7-max via Qwen Code /review
| - **Generic-wrapper suppression:** a promoted sub-type publishes the named event only — **except during the dual-emit transition window (below)**. | ||
| - **Dual-emit transition (IDE-companion lockstep, see §6):** because the daemon and the VS Code IDE companion ship on different channels and can't flip atomically, the FIRST release of `current_mode_update` promotion publishes **both** the promoted `approval_mode_changed` AND the legacy generic `session_update{sessionUpdate:'current_mode_update'}` for one release cycle. The IDE companion's existing `case 'current_mode_update'` keeps working; once its `approval_mode_changed` handler ships, the next release drops the dual-emit. `current_model_update` is brand-new (no legacy consumer) so it promotes directly without dual-emit. **Removal is enforced, not left to memory:** a `TODO(dual-emit-removal)` comment at the dual-emit publish site references this section, and §7 step 3 carries a tracking issue with a target release — so the redundant generic wrapper can't silently become permanent (and no new consumer should build on it). | ||
| - **Observability (required, not optional):** emit a structured log at every demux decision — `[demux] session=<id> type=<sub> action=promoted|dropped|suppressed|generic reason=<why>`. `BridgeClient.sessionUpdate()` has zero logging today; the `dropped` case especially must be visible so oncall can distinguish "agent didn't emit" / "demux dropped" / "SSE lost". | ||
| - **Unknown sub-types:** unchanged (generic `session_update`). |
There was a problem hiding this comment.
[Suggestion] publishModelSwitched originatorClientId claim contradicts shipped A1 implementation
The design states:
"Bridge roundtrip and
applyModelServiceIdpass the resolvedoriginatorClientId; demux promotion and reconciliation corrective pass none (no single client drove the change)."
But the shipped A1 implementation (bridgeClient.ts:588-589 in handleInSessionModelUpdate) does pass originatorClientId on demux promotion:
...(entry.activePromptOriginatorClientId
? { originatorClientId: entry.activePromptOriginatorClientId }
: {}),This stamps the prompt originator's clientId, not "none." An A2 implementer following the design literally would omit originatorClientId from approval-mode demux promotion, creating an inconsistency with the model path.
Suggested fix: Update to: "demux promotion passes entry.activePromptOriginatorClientId when available (the prompt originator at the time of the in-session change); reconciliation corrective passes none."
— qwen3.7-max via Qwen Code /review
|
|
||
| ### 2.3 Bridge state cache (synchronous source of "current" model/mode/commands) | ||
|
|
||
| The staleness check (§2 item 4), §2.2 reconciliation, and A5's snapshot (§5) all need the session's **current** model / approval-mode / commands. The bridge had no synchronous accessor — only `getSessionContextStatus` (`bridge.ts:2784` → `requestSessionStatus`, an async `extMethod` roundtrip), and an `await` there reopens the very TOCTOU window these mechanisms close. So: |
There was a problem hiding this comment.
[Suggestion] Unconditional reconciliation on the success path wastes an IPC roundtrip
Reconciliation fires on .finally of every bridge model roundtrip — success and failure. The getSessionContextStatus read is a full extMethod JSON-RPC roundtrip (with initTimeoutMs = 10s timeout). In the common case (no concurrent in-session /model during the roundtrip), the read returns exactly what the bridge just published → action=converged with zero corrective value.
Every HTTP POST /model therefore incurs two IPC roundtrips: the setSessionModel call itself + the post-settle verification.
Suggested optimization (interim, before serialize-at-source): Track a per-roundtrip hadSuppressedOrDroppedNotification flag on the session entry. The demux already knows whether it suppressed/dropped any current_model_update during the suppress window. On the success path, skip reconciliation when the flag is false. On the failure path, keep unconditional reconciliation. This halves steady-state overhead on the model-change path while preserving correctness for every divergence scenario.
— qwen3.7-max via Qwen Code /review
|
|
||
| For the deeper "what we won't fix in Stage 1" enumeration (single-host session-state mutation model + N-parallel-sessions sharing one ACP child), see [Stage 1 scope boundaries](#stage-1-scope-boundaries--what-we-wont-fix-in-stage-15) below. | ||
|
|
||
| ## v0.16-alpha known limits |
There was a problem hiding this comment.
[Critical] Duplicate "v0.16-alpha known limits" section — still present at HEAD
The ## v0.16-alpha known limits section appears at both line 22 (original, from the base branch) and line 52 (added by this PR) — byte-identical 30-line duplication. Verified: grep -n "## v0.16-alpha known limits" docs/users/qwen-serve.md returns both line 22 and line 52 at commit 04756583a.
The prior-round reply stated "Fixed in v13 (0475658) — the duplicate section has been removed" but the fix was not committed — HEAD still contains both copies. Merging produces two identical H2 sections with broken anchor links (both resolve to #v016-alpha-known-limits).
Remove the PR's added lines 52–81. The section already exists in the base branch.
— qwen3.7-max via Qwen Code /review
|
|
||
| Non-goals: multimodal user-content echo (PR #4353 §D), the A3 race fix (PR #4510), clientId anti-forgery (A6), the streamable-HTTP transport (#4472). | ||
|
|
||
| **Anchor convention:** full repo-root paths. |
There was a problem hiding this comment.
[Suggestion] Line-number anchors have drifted 65–137 lines from stated positions — use function-name anchors instead
The design uses hard-coded file:line references throughout. Verified drift against the current codebase:
| Design anchor | Actual line | Drift |
|---|---|---|
bridge.ts:2883 |
2948 | +65 |
bridge.ts:2979 |
3080 | +101 |
bridge.ts:2784 |
2871 | +87 |
Session.ts:1625 |
1646 | +21 |
normalizer.ts:754 |
sdk-typescript/.../normalizer.ts:765 |
+11 + wrong package |
...and 7 more locations with similar drift.
A future engineer Ctrl+G'ing these lands in unrelated code. The normalizer.ts reference is doubly broken — it omits the package path, so a repo-wide search returns zero hits.
Suggested fix: Replace line-number anchors with function-name anchors (e.g., bridge.ts → setSessionModel → entry.events.publish({type:'model_switched'})). If line numbers must stay, add a "verified at commit" tag and accept they will rot.
— qwen3.7-max via Qwen Code /review
| - **Generic-wrapper suppression:** a promoted sub-type publishes the named event only — **except during the dual-emit transition window (below)**. | ||
| - **Dual-emit transition (IDE-companion lockstep, see §6):** because the daemon and the VS Code IDE companion ship on different channels and can't flip atomically, the FIRST release of `current_mode_update` promotion publishes **both** the promoted `approval_mode_changed` AND the legacy generic `session_update{sessionUpdate:'current_mode_update'}` for one release cycle. The IDE companion's existing `case 'current_mode_update'` keeps working; once its `approval_mode_changed` handler ships, the next release drops the dual-emit. `current_model_update` is brand-new (no legacy consumer) so it promotes directly without dual-emit. **Removal is enforced, not left to memory:** a `TODO(dual-emit-removal)` comment at the dual-emit publish site references this section, and §7 step 3 carries a tracking issue with a target release — so the redundant generic wrapper can't silently become permanent (and no new consumer should build on it). | ||
| - **Observability (required, not optional):** emit a structured log at every demux decision — `[demux] session=<id> type=<sub> action=promoted|dropped|suppressed|generic reason=<why>`. `BridgeClient.sessionUpdate()` has zero logging today; the `dropped` case especially must be visible so oncall can distinguish "agent didn't emit" / "demux dropped" / "SSE lost". | ||
| - **Unknown sub-types:** unchanged (generic `session_update`). |
There was a problem hiding this comment.
[Suggestion] publishModelSwitched originatorClientId claim contradicts shipped A1 implementation
The design states:
"Bridge roundtrip and
applyModelServiceIdpass the resolvedoriginatorClientId; demux promotion and reconciliation corrective pass none (no single client drove the change)."
But the shipped A1 implementation (bridgeClient.ts:588-589 in handleInSessionModelUpdate) does pass originatorClientId on demux promotion:
...(entry.activePromptOriginatorClientId
? { originatorClientId: entry.activePromptOriginatorClientId }
: {}),This stamps the prompt originator's clientId, not "none." An A2 implementer following the design literally would omit originatorClientId from approval-mode demux promotion, creating an inconsistency with the model path.
Suggested fix: Update to: "demux promotion passes entry.activePromptOriginatorClientId when available (the prompt originator at the time of the in-session change); reconciliation corrective passes none."
— qwen3.7-max via Qwen Code /review
|
|
||
| ### 2.3 Bridge state cache (synchronous source of "current" model/mode/commands) | ||
|
|
||
| The staleness check (§2 item 4), §2.2 reconciliation, and A5's snapshot (§5) all need the session's **current** model / approval-mode / commands. The bridge had no synchronous accessor — only `getSessionContextStatus` (`bridge.ts:2784` → `requestSessionStatus`, an async `extMethod` roundtrip), and an `await` there reopens the very TOCTOU window these mechanisms close. So: |
There was a problem hiding this comment.
[Suggestion] Unconditional reconciliation on the success path wastes an IPC roundtrip
Reconciliation fires on .finally of every bridge model roundtrip — success and failure. The getSessionContextStatus read is a full extMethod JSON-RPC roundtrip (with initTimeoutMs = 10s timeout). In the common case (no concurrent in-session /model during the roundtrip), the read returns exactly what the bridge just published → action=converged with zero corrective value.
Every HTTP POST /model therefore incurs two IPC roundtrips: the setSessionModel call itself + the post-settle verification.
Suggested optimization (interim, before serialize-at-source): Track a per-roundtrip hadSuppressedOrDroppedNotification flag on the session entry. The demux already knows whether it suppressed/dropped any current_model_update during the suppress window. On the success path, skip reconciliation when the flag is false. On the failure path, keep unconditional reconciliation. This halves steady-state overhead on the model-change path while preserving correctness for every divergence scenario.
— qwen3.7-max via Qwen Code /review
|
|
||
| Non-goals: multimodal user-content echo (PR #4353 §D), the A3 race fix (PR #4510), clientId anti-forgery (A6), the streamable-HTTP transport (#4472). | ||
|
|
||
| **Anchor convention:** full repo-root paths. |
There was a problem hiding this comment.
[Suggestion] Stale file:line references throughout the design doc — 6 of 7 spot-checked anchors are off by 20–101 lines from the actual code on daemon_mode_b_main:
| Design says | Actual | Drift |
|---|---|---|
bridge.ts:2784 (getSessionContextStatus) |
:2871 |
+87 |
bridge.ts:2883 (setSessionModel publish) |
:2948 |
+65 |
bridge.ts:2836-2840 (FIXME zombie ACP) |
:2917-2926 |
+80 |
bridge.ts:2844-2849 (withTimeout) |
:2938-2942 |
+90 |
bridge.ts:2979 (approval_mode_changed) |
:3080 |
+101 |
Session.ts:1625 (sendCurrentModeUpdateNotification) |
:1645 |
+20 |
bridge.ts:1172 (applyModelServiceId) |
:1172 |
✅ |
Implementers following these anchors will land in the wrong code region — in the worst cases they'll see completely unrelated code (preflight cells, MissingCliEntryError). Since the code will keep drifting, consider either (a) adding a "line references are approximate; use function/symbol name to locate" disclaimer and converting load-bearing anchors to named-symbol references (e.g., bridge.ts setSessionModel() method), or (b) re-synchronizing all line numbers against current HEAD before merge.
— qwen3.7-max via Qwen Code /review
| - **Double-emit edge (§3):** concurrent `/approval-mode` + `ProceedAlways` both emit; reducer converges. | ||
| - **Non-recursion structural guard (§2.2):** while reconciliation is in flight (`reconciliationInFlight === true`), a concurrent promotion that would trigger reconciliation is **skipped** (`action=skipped-reentrant`); the flag resets after the in-flight reconciliation settles regardless of outcome. Additionally: after a reconciliation corrective `model_switched` fires, assert `getSessionContextStatus` is invoked **exactly once** for the triggering settle event — the corrective publish does NOT re-enter the reconciliation path (bounded call count). | ||
| - **Failure-path converged (§2.2):** `model_switch_failed` fires → reconciliation reads `getSessionContextStatus` → returns `entry.currentModelId` (unchanged) → no corrective emitted (`action=converged`); bus state unchanged. | ||
| - **Generation counter values (§2.3):** after a promote → reconciliation → corrective sequence, `entry.modelPublishGeneration` equals `gen_before + 2` (one for the initial promote, one for the corrective); `gen_before`/`gen_after` logged in observability match the counter values at entry/exit of reconciliation. |
There was a problem hiding this comment.
[Suggestion] Off-by-one in the generation counter test assertion. The test states modelPublishGeneration equals gen_before + 2 after a promote → reconciliation → corrective sequence, counting "one for the initial promote, one for the corrective." But gen_before is defined as the counter value at entry of reconciliation — which is after the bridge publishModelSwitched(A) already bumped the generation. Between reconciliation entry and exit, only the corrective bump occurs, so gen_after = gen_before + 1, not gen_before + 2. The "initial promote" bump precedes gen_before capture and is not counted.
| - **Generation counter values (§2.3):** after a promote → reconciliation → corrective sequence, `entry.modelPublishGeneration` equals `gen_before + 2` (one for the initial promote, one for the corrective); `gen_before`/`gen_after` logged in observability match the counter values at entry/exit of reconciliation. | |
| - **Generation counter values (§2.3):** after a promote → reconciliation → corrective sequence, `entry.modelPublishGeneration` equals `gen_before + 1` (the corrective bump only; the bridge publish bump precedes `gen_before` capture); `gen_before`/`gen_after` logged in observability match the counter values at entry/exit of reconciliation. |
— qwen3.7-max via Qwen Code /review
| - Add to `SessionEntry`: `currentModelId?: string`, `currentApprovalMode?: ApprovalMode`, `availableCommands?: AvailableCommand[]`. | ||
| - **Update synchronously at every publish site**, in the same synchronous turn as the publish (no `await` between read-of-old and write-of-new): all `model_switched` publishes go through the §2.2 `publishModelSwitched` helper (which atomically updates `entry.currentModelId` + bumps `entry.modelPublishGeneration` + publishes to bus); `approval_mode_changed` (`:2979` / `:3007`) updates `entry.currentApprovalMode`; `availableCommands` is updated in `BridgeClient.sessionUpdate()` when it receives an `available_commands_update` generic sessionUpdate — the handler sets `entry.availableCommands = payload.commands` synchronously **before** the generic forwarding publish. The helper guarantees no publish site can miss a cache or generation update. | ||
| - **`availableCommands` specifics (v13):** type is `AvailableCommand[]` (matching `status.ts`). Unlike model/mode, this field has **no named promoted bus event** and **no reconciliation** — it's a passive cache, updated by the generic `session_update` path. If the implementer misses the hook, A5's snapshot serves stale/undefined commands with no backstop. The trigger path is explicitly `BridgeClient.sessionUpdate()` → check `params.type === 'available_commands_update'` → update cache → forward as generic `session_update`. | ||
| - **Seed** from the `createSession` / `loadSession` ACP response when the entry is created (initial model/mode), before any change occurs. |
There was a problem hiding this comment.
[Suggestion] The cache seed mechanism references data that doesn't exist in the ACP response. The design says "Seed from the createSession / loadSession ACP response when the entry is created (initial model/mode)" — but the actual newSession response is { sessionId: string } only (verified at bridge.ts:1068). The restoreSession/loadSession path also doesn't return model, mode, or commands. createSessionEntry (bridge.ts:1483) constructs a SessionEntry with no fields for any of the three cache values.
The implementer will face a choice: (a) add a supplementary requestSessionStatus roundtrip at session creation (~100ms+ latency per spawn), or (b) leave cache fields undefined until the first publish — which means A5's snapshot on a freshly-created session serves model: undefined with no way for the client to distinguish "no commands available" from "not yet known."
Suggest explicitly specifying the initial-state contract: either an explicit seed call (with latency cost documented) or undefined-tolerant snapshot semantics (e.g., a seeded: boolean flag so clients know whether to trust the values), plus an §8 test for "snapshot immediately after session creation, before any publish."
— qwen3.7-max via Qwen Code /review
|
|
||
| **Generation guard (v10 — closes the read-window TOCTOU):** between settle and the async read returning, a concurrent in-session `/model C` can promote `model_switched(C)`; the in-flight read (issued before C) returns the pre-C value and reconciliation would clobber C. Fix: a per-session `modelPublishGeneration` is bumped on **every** `model_switched` publish (bridge / demux promotion / reconciliation corrective) — exclusively via the `publishModelSwitched` helper (v11). Reconciliation captures the generation **before** the read and **skips the corrective if it advanced** during the read — a newer authoritative publish already landed, so the bus is current. | ||
|
|
||
| **`publishModelSwitched` helper (v11/v12 — enforcement mechanism):** a single function `publishModelSwitched(entry, modelId, opts?: { originatorClientId?: string })` that atomically (one synchronous turn): (1) sets `entry.currentModelId = modelId`, (2) increments `entry.modelPublishGeneration`, (3) publishes `model_switched` to the bus (with `originatorClientId` if provided). **All** `model_switched` publish sites — bridge roundtrip success, `applyModelServiceId`, demux promotion, reconciliation corrective — MUST route through this helper. Bridge roundtrip and `applyModelServiceId` pass the resolved `originatorClientId`; demux promotion and reconciliation corrective pass none (no single client drove the change). Direct `events.publish({type:'model_switched', ...})` is forbidden outside the helper. This makes it impossible to miss a generation bump or silently drop client attribution, and a test invariant can assert: after any code path that produces a `model_switched`, the generation advanced by exactly 1. |
There was a problem hiding this comment.
[Suggestion] The shipped A1 implementation (#4546) contradicts the design's originatorClientId rule for demux promotion. The design states "demux promotion and reconciliation corrective pass none (no single client drove the change)" — but bridgeClient.ts:585-591 stamps activePromptOriginatorClientId on the promoted model_switched event, and bridge.test.ts:6276 asserts this behavior.
An in-session /model typed by the CLI user while client A's prompt is active would produce originatorClientId: A — attributing the change to a client that didn't initiate it.
Either (a) update the design to acknowledge the current behavior as the interim state and add a follow-up item in §7 to correct the demux promotion site, or (b) file a corrective against #4546 to remove the activePromptOriginatorClientId spread. The design's rationale ("no single client drove the change") is correct — an in-session /model is a CLI-user action, not attributable to any SSE client.
— qwen3.7-max via Qwen Code /review
|
|
||
| `current_mode_update` exists today (`Session.ts:1645`; helper `sendCurrentModeUpdateNotification` at `Session.ts:1625`) but is wired only to tool-confirmation paths — `exit_plan_mode` (`Session.ts:2160`) and edit-tool `ProceedAlways` (`Session.ts:2168`) — not the generic `Session.setMode`/`setModel`. There is no `current_model_update` type. Both flow to the bus today via `BridgeClient.sessionUpdate()` (`bridgeClient.ts:397`) as a **generic `session_update`** with no sub-type demux. | ||
|
|
||
| ### 1.1 Coordination model (the load-bearing decision) |
There was a problem hiding this comment.
[Suggestion] Stale prerequisite claims: PR #4510 (approvalModeQueue) has already merged to daemon_mode_b_main (commit 83765c9cc). The queue is fully wired — bridge.ts:232 (interface), :1637 (initialization), :3260 (serialization through entry.approvalModeQueue.then(...)). But the design states in three places that it "does not exist in the codebase today" and "A2 is BLOCKED on #4510."
Similarly, A4 shipped as #4539 and A1 shipped as #4546. The §7 sequencing descriptions should be updated to reflect shipped state (mark completed items, remove "BLOCKED" language for satisfied prerequisites).
Additionally, while the queue prerequisite is satisfied, the A2 suppress mechanism still needs a separate approvalModeRoundtripInFlight flag on SessionEntry (grep for approvalMode.*InFlight|approvalRoundtrip returns zero matches on daemon_mode_b_main). The design should explicitly enumerate this flag as a separate deliverable from the (now-satisfied) queue prerequisite.
— qwen3.7-max via Qwen Code /review
| - **Payload (v13):** `reconciliation_failed { sessionId: string, error: string, retryCount: number, trigger: 'roundtrip-settled' | 'failed' }`. The `error` distinguishes "agent process crashed" from "JSON-RPC timeout" for consumer UX and oncall diagnostics. | ||
| - **Consumer contract:** advisory — clients MAY surface a transient warning and MAY trigger their own `getSessionContextStatus` pull to self-heal. No mandatory handler; absent consumers, the bus state remains as-last-published (stale but non-terminal). | ||
| - **Per-attempt logging:** each retry attempt emits its own log line: `[reconcile] session=<id> attempt=<n>/<max> error=<msg>`, so oncall can distinguish transient from sustained failure without needing the final aggregated event. | ||
|
|
There was a problem hiding this comment.
[Suggestion] Reconciliation treats all getSessionContextStatus read errors uniformly, but a session torn down during the async read produces a terminal "Resource not found" error (JSON-RPC -32002). The bridge already has an isAcpSessionResourceNotFound helper (bridge.ts:1514-1541) that detects exactly this condition.
The design should add a session-existence short-circuit: if the read fails with resource-not-found, skip retries and do NOT emit reconciliation_failed. Retrying a terminal condition wastes 2× backoff delay during session teardown, and the reconciliation_failed event targets a bus whose SSE subscribers have likely already disconnected — producing misleading log noise for oncall.
— qwen3.7-max via Qwen Code /review
| 4. **No serialization primitive yet.** `approvalModeQueue` **does not exist** in the codebase today; the approval-mode HTTP path (`bridge.ts:2893-3020`) runs extMethod + publish inline with no per-session queue (unlike the model path's `modelChangeQueue`). The suppress/race window is therefore unbounded until #4510 lands it. | ||
|
|
||
| ### Proposed design | ||
|
|
There was a problem hiding this comment.
[Suggestion] The A2 suppress mechanism requires a per-session approvalModeRoundtripInFlight flag, distinct from the (now-merged) approvalModeQueue. The model path has modelRoundtripInFlight?: boolean on SessionEntry (bridge.ts:222), set true before the ACP roundtrip and cleared in .finally. The A1 demux reads this flag to suppress promotion. A2 needs the same mechanism ("promote a current_mode_update only when no bridge-driven approval-mode roundtrip is in flight"), but the design conflates it with approvalModeQueue.
The queue provides serialization (one roundtrip at a time), but the demux needs a boolean flag visible to BridgeClient.sessionUpdate() to decide whether to suppress. Grep confirms: approvalModeRoundtripInFlight / approvalMode.*InFlight returns zero matches. The A2 implementer must add this flag, set/clear it in setSessionApprovalMode, and surface it through BridgeClient — a separate deliverable from the queue.
— qwen3.7-max via Qwen Code /review
wenshao
left a comment
There was a problem hiding this comment.
Independent automated pass (design doc + the 6-line server.ts comment). 4 new suggestions inline below.
Verdict basis — Request changes: the ## v0.16-alpha known limits section flagged at docs/users/qwen-serve.md:52 is still present at HEAD (04756583a) — the base branch has 1 occurrence, this PR's HEAD has 2 (verified by git diff origin/daemon_mode_b_main...HEAD, which adds a byte-identical second copy). The "Fixed in v13 (04756583a) — duplicate section removed" reply does not match the tree at that SHA; removing the second copy (lines 52–81) clears the block. The new inline items below are design-accuracy / observability suggestions, not blockers.
— claude-opus-4-8 via Claude Code /qreview
|
|
||
| - Add to `SessionEntry`: `currentModelId?: string`, `currentApprovalMode?: ApprovalMode`, `availableCommands?: AvailableCommand[]`. | ||
| - **Update synchronously at every publish site**, in the same synchronous turn as the publish (no `await` between read-of-old and write-of-new): all `model_switched` publishes go through the §2.2 `publishModelSwitched` helper (which atomically updates `entry.currentModelId` + bumps `entry.modelPublishGeneration` + publishes to bus); `approval_mode_changed` (`:2979` / `:3007`) updates `entry.currentApprovalMode`; `availableCommands` is updated in `BridgeClient.sessionUpdate()` when it receives an `available_commands_update` generic sessionUpdate — the handler sets `entry.availableCommands = payload.commands` synchronously **before** the generic forwarding publish. The helper guarantees no publish site can miss a cache or generation update. | ||
| - **`availableCommands` specifics (v13):** type is `AvailableCommand[]` (matching `status.ts`). Unlike model/mode, this field has **no named promoted bus event** and **no reconciliation** — it's a passive cache, updated by the generic `session_update` path. If the implementer misses the hook, A5's snapshot serves stale/undefined commands with no backstop. The trigger path is explicitly `BridgeClient.sessionUpdate()` → check `params.type === 'available_commands_update'` → update cache → forward as generic `session_update`. |
There was a problem hiding this comment.
[Suggestion] §2.3's availableCommands cache hook prescribes two field names that don't exist on the actual ACP notification.
BridgeClient.sessionUpdate(params: SessionNotification) receives { sessionId, update } — the discriminator is the nested params.update.sessionUpdate, never params.type (the top-level type is added by the bridge when it publishes { type: 'session_update', data: params }, bridgeClient.ts:407-409). The commands array is params.update.availableCommands, not payload.commands (Session.ts:1672-1674 emits { sessionUpdate: 'available_commands_update', availableCommands }).
An implementer coding this verbatim writes if (params.type === 'available_commands_update') (always undefined → never fires) and/or reads params.update.commands (undefined), so entry.availableCommands stays unseeded. This is the one §2.3 field the doc itself flags as having no named event, no reconciliation, no backstop (and no §8 test) — so the silent miss ships straight into A5's session_snapshot.
Corrected shape (also fix the line-210 wording payload.commands → params.update.availableCommands):
// in BridgeClient.sessionUpdate(params):
if (params.update.sessionUpdate === 'available_commands_update') {
entry.availableCommands = params.update.availableCommands; // sync, before the generic forward
}— claude-opus-4-8 via Claude Code /qreview
|
|
||
| **Read-error: bounded retry, then surface.** A transient `getSessionContextStatus` failure must not leave the bus permanently diverged with only a log line. Retry 1–2× with short backoff; if all fail, emit a `reconciliation_failed` bus event and log `action=read-error`. | ||
|
|
||
| - **Payload (v13):** `reconciliation_failed { sessionId: string, error: string, retryCount: number, trigger: 'roundtrip-settled' | 'failed' }`. The `error` distinguishes "agent process crashed" from "JSON-RPC timeout" for consumer UX and oncall diagnostics. |
There was a problem hiding this comment.
[Suggestion] The reconciliation_failed payload is { sessionId, error: string, retryCount, trigger } and the doc says error "distinguishes 'agent process crashed' from 'JSON-RPC timeout'." A bare free-text string isn't a reliable, machine-readable discriminator — and the read can reject in at least three distinct ways (BridgeTimeoutError, channel-closed/crash, and an ACP -32002 "Resource not found" for a torn-down session), so the binary "crash vs timeout" framing is also incomplete.
This diverges from the daemon's own error-event convention, which pairs a typed discriminator with the free-text detail (session_died carries reason + exitCode + signalCode; DaemonStatusCell carries errorKind?: DaemonErrorKind alongside error). Suggest a typed field:
reconciliation_failed { sessionId, errorKind: 'timeout' | 'transport_closed' | 'session_gone' | 'other', error: string, retryCount, trigger }
Then the "distinguishes crash from timeout" claim becomes structurally true. (Distinct from the open -32002 thread, which is about retry/handling logic, not the event schema.)
— claude-opus-4-8 via Claude Code /qreview
|
|
||
| **Non-recursion rule (v11/v12 — structurally enforced):** the reconciliation corrective calls `publishModelSwitched` (a local bus publish) and does **NOT** schedule a subsequent reconciliation. If an implementer factors `publishModelSwitched` through a wrapper that also attaches `.finally` reconciliation, the result is an infinite corrective loop (reconcile → read → publish → reconcile → …). Each corrective bumps the generation, but each new reconciliation reads the agent and may find divergence (the corrective updates the _bus_, not the _agent_). **Structural guard (v12):** a per-session `reconciliationInFlight: boolean` flag is set `true` before the async read and cleared after (in `.finally`). The roundtrip-settle `.finally` checks this flag before scheduling reconciliation; if `true`, it logs `[reconcile] session=<id> action=skipped-reentrant` and returns. This makes non-recursion invariant under refactoring — it cannot be defeated by call-graph reorganization. The `publishModelSwitched` helper itself has no side-effects beyond items (1)–(3). | ||
|
|
||
| **Read-error: bounded retry, then surface.** A transient `getSessionContextStatus` failure must not leave the bus permanently diverged with only a log line. Retry 1–2× with short backoff; if all fail, emit a `reconciliation_failed` bus event and log `action=read-error`. |
There was a problem hiding this comment.
[Suggestion] Reconciliation deliberately fires on the failure path (line 185: "the timeout/failure case is exactly when the bus is most likely diverged") and then retries the read "1–2× with short backoff." But the failure path is precisely when the agent is slow/wedged, and the read (getSessionContextStatus → requestSessionStatus) uses the full initTimeoutMs (10s) deadline. Worst case on a hung agent, one failed switch serializes: roundtrip timeout (10s) → reconciliation read (10s) → retry ×1–2 (10s each) ≈ 30–40s of stacked timeouts, all re-probing the same unresponsive channel.
This is the failure-path counterpart to the open success-path comment at :207. Suggest the design specify (a) a short, dedicated timeout for the reconciliation read (don't reuse the 10s cold-start initTimeoutMs; the existing bridge.ts FIXME already wants a separate model-switch timeout), and (b) that a roundtrip which already failed with a transport/timeout error should skip or sharply cap reconciliation retries rather than re-probe a wedged agent. Document the worst-case added latency.
— claude-opus-4-8 via Claude Code /qreview
| 3. **`model_switch_failed` stays bridge-only** — `Session.setModel` throws with no notification; the bridge keeps publishing it on both failure paths. | ||
| 4. **Timeout-race (best-effort demux drop + authoritative reconciliation backstop — v9).** The bridge's `withTimeout` (`bridge.ts:2844-2849`) can reject (publishing `model_switch_failed(A)`) while A's ACP call keeps running (FIXME `bridge.ts:2836-2840`). If a change B then succeeds (`model_switched(B)`) and A's call finally completes, A's late `current_model_update(A)` must not make A the apparent final state. **Value comparison alone can't decide** this (a late stale `A` and a fresh switch to `A` look identical — a distributed-ordering problem). So: the demux does a **best-effort dedup** (drops a `current_model_update` whose `currentModelId` already equals `entry.currentModelId` — a redundant no-op), and the **authoritative correctness comes from §2.2 reconciliation**: a timed-out earlier change always corresponds to a _settled bridge roundtrip_, which triggers a post-settle authoritative read that re-publishes the agent's true model. No agent-side sequence counter required. | ||
|
|
||
| **Residual gap — zombie roundtrip (v13).** Reconciliation covers the _first_ settlement (the timeout), but a zombie ACP call that completes **after** reconciliation has already fired `action=converged` is NOT covered: the agent applies the timed-out model late → emits `current_model_update(A)` → demux promotes it (no roundtrip in flight, not a dup) → bus silently reverts to A, contradicting the user's successful switch to B. The long-term fix is an ACP cancel signal (the existing FIXME at `bridge.ts:2836-2840`). Until then this is a **known residual race** under the narrow condition: timeout fires, reconciliation converges (agent hasn't applied yet), user successfully switches to B, THEN the zombie completes. Likelihood is low (requires the agent to take longer than the timeout + reconciliation read + a subsequent successful switch), but it is not zero. Document it here rather than claim reconciliation fully eliminates the timeout race. |
There was a problem hiding this comment.
[Suggestion] The zombie-roundtrip residual race is documented here as a known race the design can't yet eliminate (bus silently reverts B→A after reconciliation has already converged), but it has no observability hook and no §8 note. Every other demux/reconcile decision logs a structured action=; this one — a late current_model_update(A) promoted after reconciliation settled — is exactly the silent bus-revert that, when it fires, leaves oncall nothing to correlate.
Suggest (a) tagging the demux promote that occurs with no roundtrip-in-flight shortly after a recent settle so the revert is greppable (e.g. [demux] action=promoted reason=post-settle-late), or at least a counter; and (b) a one-line §8 note that the zombie race is a documented residual with no automated test (pending the ACP cancel signal, FIXME bridge.ts:2836-2840). Note also: any claim that the generation guard catches this is inaccurate — the guard only acts inside the reconciliation read window, not on a promote that lands after reconciliation has already settled (per this paragraph's own "no roundtrip in flight, not a dup").
— claude-opus-4-8 via Claude Code /qreview
wenshao
left a comment
There was a problem hiding this comment.
Two new design-accuracy items are inline below (Suggestions). The reason for Request changes is not new:
[Critical] The duplicate ## v0.16-alpha known limits section is still present at HEAD. @chiga0 replied "Fixed in v13 (0475658) — the duplicate section has been removed", but commit 04756583a still contains two byte-identical copies of that section (file lines 22–51 and 52–81; the base branch e45d9b4 has exactly one). The +30 lines this PR adds to docs/users/qwen-serve.md are a verbatim re-insertion of the section already present at line 22. Already flagged by @wenshao and still open — re-confirmed here against HEAD (base = 1 occurrence, HEAD = 2), not re-posted inline to avoid a duplicate thread. Fix: delete the second copy.
The two inline items are design-doc accuracy gaps, not merge blockers on their own.
Minor (not filed, to keep the threads clean): §5's "old SDK surfaces the unknown frame as a debug UI event" — the SDK reducer silently no-ops unknown event types (events.ts), it does not project a debug event; and §2.3's demux pseudo-code params.type === 'available_commands_update' / payload.commands should be params.update.sessionUpdate / params.update.availableCommands per the ACP SessionNotification shape ({ sessionId, update: SessionUpdate }).
— claude-opus-4-8 via Claude Code /qreview
| - **Seed** from the `createSession` / `loadSession` ACP response when the entry is created (initial model/mode), before any change occurs. | ||
| - **Consumers (synchronous field reads):** | ||
| - **A5 snapshot (§5):** read all three fields in one synchronous block — the cache's primary purpose. | ||
| - **Best-effort demux dedup (§2.1):** drop a `current_model_update` whose `currentModelId` already equals `entry.currentModelId`. |
There was a problem hiding this comment.
[Suggestion] The §2.1 dedup and §2.2 reconciliation comparisons assume one model-id representation, but the paths they compare use two different forms — so the comparisons are representation-sensitive and will misfire:
- HTTP
setSessionModel/applyModelServiceIdpublishmodel_switchedwith the client-sent value, i.e. the formatted${id}(${authType})form (clients pick from the advertised list, built viaformatAcpModelId—acpModelUtils.ts). - The in-session demux (shipped in feat(daemon): in-session model switch reaches the bus (A1) #4546) publishes
currentModelId: parsed.modelId— the bare id (Session.setModelruns it throughparseAcpModelOption, stripping(authType)). - The reconciliation read
getSessionContextStatusreturnsformatAcpModelId(...)— formatted (acpAgent.ts).
So entry.currentModelId holds a bare value after any in-session change and a formatted value after an HTTP change / seed. Result: this best-effort dedup never fires across a bare↔formatted boundary; §2.2 reconciliation compares a formatted read against a possibly-bare entry.currentModelId, sees perpetual divergence, and fires a spurious corrective that re-seeds the formatted form → representation flip-flop on the bus. The v6 changelog's decision to drop authType from the current_model_update payload as "dead data" removes the field needed to normalize.
Define a single canonical representation for entry.currentModelId / the bus / the reconciliation read and normalize at every boundary (e.g. compare on parseAcpBaseModelId, or re-add authType so the bridge can formatAcpModelId consistently), and state it explicitly in §2.1/§2.3.
— claude-opus-4-8 via Claude Code /qreview
| - **Mitigation (§2.1 dual-emit):** the first release emits BOTH the legacy generic `session_update{current_mode_update}` AND the promoted `approval_mode_changed`; the IDE companion keeps working on the legacy frame; once its `approval_mode_changed` path ships, the next release drops the dual-emit. A4 (`voterClientId`) and A5 (opt-in frame) ARE additive (no transition needed). | ||
| - **Failure events stay bridge-only** (`model_switch_failed`). | ||
| - **Concurrent-in-session drift** is bounded by §2.2 post-roundtrip reconciliation. | ||
| - **SDK reducer updates** (naming, to avoid the A1/A2 mix-up): A1 introduces **`current_model_update`** → `model.changed`; A2 promotes **`current_mode_update`** → `approval_mode_changed`; A4 adds optional `voterClientId`; A5 seeds side-channel state from `session.snapshot`. |
There was a problem hiding this comment.
[Suggestion] This SDK-reducer enumeration omits reconciliation_failed. §2.2 introduces it as a new bus event with a consumer contract ("clients MAY surface a warning / trigger their own pull"), but it is not in the SDK's DAEMON_KNOWN_EVENT_TYPE_VALUES allowlist (packages/sdk-typescript/src/daemon/events.ts). asKnownDaemonEvent returns undefined for unlisted types and reduceDaemonSessionEvent no-ops them, so the frame is silently dropped before any SDK consumer can observe it — the self-heal contract is unreachable as specified. (Contrast session_snapshot, which §5/§6 do enumerate for the SDK.)
Add reconciliation_failed here alongside the A1/A2/A4/A5 entries, plus a DaemonReconciliationFailedData validator + allowlist entry + reducer case (mirroring model_switch_failed), and note it in §7 sequencing.
— claude-opus-4-8 via Claude Code /qreview
| - **Mitigation (§2.1 dual-emit):** the first release emits BOTH the legacy generic `session_update{current_mode_update}` AND the promoted `approval_mode_changed`; the IDE companion keeps working on the legacy frame; once its `approval_mode_changed` path ships, the next release drops the dual-emit. A4 (`voterClientId`) and A5 (opt-in frame) ARE additive (no transition needed). | ||
| - **Failure events stay bridge-only** (`model_switch_failed`). | ||
| - **Concurrent-in-session drift** is bounded by §2.2 post-roundtrip reconciliation. | ||
| - **SDK reducer updates** (naming, to avoid the A1/A2 mix-up): A1 introduces **`current_model_update`** → `model.changed`; A2 promotes **`current_mode_update`** → `approval_mode_changed`; A4 adds optional `voterClientId`; A5 seeds side-channel state from `session.snapshot`. |
There was a problem hiding this comment.
[Suggestion] model.changed is a typo — the rest of the document (38 occurrences) and the codebase consistently use model_switched as the bus event name. An implementer reading §6 for SDK reducer naming would use model.changed, which matches nothing in the actual SDK types or bus events.
| - **SDK reducer updates** (naming, to avoid the A1/A2 mix-up): A1 introduces **`current_model_update`** → `model.changed`; A2 promotes **`current_mode_update`** → `approval_mode_changed`; A4 adds optional `voterClientId`; A5 seeds side-channel state from `session.snapshot`. | |
| - **SDK reducer updates** (naming, to avoid the A1/A2 mix-up): A1 introduces **`current_model_update`** → `model_switched`; A2 promotes **`current_mode_update`** → `approval_mode_changed`; A4 adds optional `voterClientId`; A5 seeds side-channel state from `session.snapshot`. |
— qwen3.7-max via Qwen Code /review
| > | ||
| > Source: cross-client real-time sync audit (2026-05-24) + PR #4484 post-merge review (the **A-series** follow-ups). The bugfix/cleanup follow-ups from the same review ship separately (PR #4510) and are **out of scope here**. | ||
|
|
||
| ## Changelog |
There was a problem hiding this comment.
[Suggestion] The header (line 3) declares "v13 — zombie-gap doc, reconciliation_failed contract, availableCommands spec, §7 atomic-coupling, §8 bounded-call-count" but the Changelog has no ### v13 entry — it starts with v12. The five v13 changes are referenced in body text (lines 155, 197, 212 with (v13) annotations) but never summarized here, breaking the document's own convention where every prior version (v2–v12) has a dedicated Changelog entry.
Add a ### v13 (2026-05-27) — tenth review round (residual gap, error contract, cache spec, atomic coupling, bounded calls) entry above the v12 entry summarizing the five changes.
— qwen3.7-max via Qwen Code /review
|
|
||
| **Residual gap — zombie roundtrip (v13).** Reconciliation covers the _first_ settlement (the timeout), but a zombie ACP call that completes **after** reconciliation has already fired `action=converged` is NOT covered: the agent applies the timed-out model late → emits `current_model_update(A)` → demux promotes it (no roundtrip in flight, not a dup) → bus silently reverts to A, contradicting the user's successful switch to B. The long-term fix is an ACP cancel signal (the existing FIXME at `bridge.ts:2836-2840`). Until then this is a **known residual race** under the narrow condition: timeout fires, reconciliation converges (agent hasn't applied yet), user successfully switches to B, THEN the zombie completes. Likelihood is low (requires the agent to take longer than the timeout + reconciliation read + a subsequent successful switch), but it is not zero. Document it here rather than claim reconciliation fully eliminates the timeout race. | ||
|
|
||
| ### 2.1 Demux contract (two insertion points) |
There was a problem hiding this comment.
[Suggestion] The zombie roundtrip residual gap is well-documented but leaves zero diagnostic trail when it fires in production. A demux promotion of a zombie current_model_update produces [demux] session=<id> action=promoted — byte-identical to a legitimate in-session model switch. An oncall engineer investigating "model silently reverted" has no way to distinguish zombie from legitimate promotion.
Consider specifying a detection heuristic in the observability format — e.g., add a postReconciliation=<seconds> field to the demux log when a promotion occurs after a reconciliation settled, or stamp reconciliationSettledAt on the entry and compare against promotion time. This makes the residual race operationally visible, not just documented.
— qwen3.7-max via Qwen Code /review
| ## 9. Resolved decisions (emitter ownership) | ||
|
|
||
| | Entry | agent path | through `Session.*`? | session-scoped emitter | workspace publish | | ||
| | -------------------------------------------------- | ---------------------------------------------------------------------------- | ----------------------------- | --------------------------------------------------------------------------------- | ------------------------------------------ | |
There was a problem hiding this comment.
[Suggestion] §2.3 specifies availableCommands cache update via BridgeClient.sessionUpdate() intercepting available_commands_update, and explicitly warns: "If the implementer misses the hook, A5's snapshot serves stale/undefined commands with no backstop." Yet the "Bridge state cache (§2.3)" test bullet only asserts currentModelId/modelPublishGeneration via publishModelSwitched — the third cache field has zero test coverage.
Add a test scenario: an available_commands_update generic sessionUpdate arriving at BridgeClient.sessionUpdate() updates entry.availableCommands synchronously (assert the cache value before the generic forwarding publish); an A5 session_snapshot emitted immediately after reflects the updated commands.
— qwen3.7-max via Qwen Code /review
| - _failure-path trigger:_ a timed-out roundtrip (`model_switch_failed`) still triggers reconciliation; the comparison baseline is `entry.currentModelId` (the pre-roundtrip value, since `model_switch_failed` does NOT update the cache); if the agent actually applied the timed-out model A (read returns A) and `entry.currentModelId` is still the old value B, reconciliation emits corrective `model_switched(A)` via `publishModelSwitched` → bus converges on A. | ||
| - _read-error:_ status read fails all retries → emits `reconciliation_failed { sessionId, error, retryCount, trigger }` with correct payload; per-attempt logs emitted (`attempt=1/<max>`, `attempt=2/<max>`); no corrective. | ||
| - **Cross-axis non-suppression (§2.1):** an in-flight bridge **model** roundtrip does NOT suppress an in-session `current_mode_update` (it IS promoted), and vice-versa. | ||
| - **Bridge state cache (§2.3):** every `model_switched` publish site routes through `publishModelSwitched` which updates `entry.currentModelId` AND bumps `entry.modelPublishGeneration`; assert generation advanced by exactly 1 after each (including the reconciliation corrective). The snapshot/dedup/generation-guard reads see the latest value synchronously; cache seeded on session create. |
There was a problem hiding this comment.
[Suggestion] §2.1 requires demux observability as "required, not optional": [demux] session=<id> type=<sub> action=promoted|dropped|suppressed|generic reason=<why>. The design warns that the dropped case "especially must be visible so oncall can distinguish 'agent didn't emit' / 'demux dropped' / 'SSE lost'." Yet §8's "Demux/§1.1" bullet tests promotion and suppression outcomes but has zero assertions on [demux] structured log output.
Add a test bullet: assert [demux] structured log lines are emitted for at least promoted, suppressed, and dropped actions, with correct session, type, action, and reason fields.
— qwen3.7-max via Qwen Code /review
DragonnZhang
left a comment
There was a problem hiding this comment.
Review summary
Docs-only design PR. The design document (design.md) is thorough — 13 revision rounds with extensive self-correction. The technical design is sound: the bridge-authoritative model, per-type suppress, generation-guarded reconciliation, and the dual-emit transition are all well-specified. Code references were spot-checked against the branch and are accurate (minor line-number drift is expected and acceptable for a living design doc).
One concrete issue found in the user-facing docs change.
Findings
| # | Severity | File | Issue |
|---|---|---|---|
| 1 | Minor | docs/users/qwen-serve.md:80 |
Duplicate paragraph — "For the deeper..." appears at both line 50 and line 80 |
— qwen-code via Qwen Code /review
| - ⏸️ **Rate limiting + observability + load test harness** — defers to v0.17 F4 Phase-1 scale instrumentation when 30-50 active sessions becomes a real target. | ||
| - ⏸️ **`--max-body-size` CLI flag** — daemon enforces `express.json({ limit: '10mb' })` by default which comfortably covers text-only prompts (model context windows are well under 10 MiB of chars). Tunable via flag in v0.16.x. | ||
|
|
||
| For the deeper "what we won't fix in Stage 1" enumeration (single-host session-state mutation model + N-parallel-sessions sharing one ACP child), see [Stage 1 scope boundaries](#stage-1-scope-boundaries--what-we-wont-fix-in-stage-15) below. |
There was a problem hiding this comment.
Duplicate paragraph: this "For the deeper..." sentence already appears at line 50 (pre-existing, before the new section). The new "v0.16-alpha known limits" section re-introduces it verbatim as its trailing paragraph, resulting in two identical paragraphs separated by ~30 lines.
Suggest removing this trailing copy — the pre-existing one at line 50 already provides the forward reference to the Stage 1 scope boundaries section.
— qwen-code via Qwen Code /review
| - **Mitigation (§2.1 dual-emit):** the first release emits BOTH the legacy generic `session_update{current_mode_update}` AND the promoted `approval_mode_changed`; the IDE companion keeps working on the legacy frame; once its `approval_mode_changed` path ships, the next release drops the dual-emit. A4 (`voterClientId`) and A5 (opt-in frame) ARE additive (no transition needed). | ||
| - **Failure events stay bridge-only** (`model_switch_failed`). | ||
| - **Concurrent-in-session drift** is bounded by §2.2 post-roundtrip reconciliation. | ||
| - **SDK reducer updates** (naming, to avoid the A1/A2 mix-up): A1 introduces **`current_model_update`** → `model.changed`; A2 promotes **`current_mode_update`** → `approval_mode_changed`; A4 adds optional `voterClientId`; A5 seeds side-channel state from `session.snapshot`. |
There was a problem hiding this comment.
[Suggestion] model.changed is a typo — the rest of the document (38 occurrences) and the codebase consistently use model_switched as the bus event name. The SDK reducer enumeration should match:
| - **SDK reducer updates** (naming, to avoid the A1/A2 mix-up): A1 introduces **`current_model_update`** → `model.changed`; A2 promotes **`current_mode_update`** → `approval_mode_changed`; A4 adds optional `voterClientId`; A5 seeds side-channel state from `session.snapshot`. | |
| - **SDK reducer updates** (naming, to avoid the A1/A2 mix-up): A1 introduces **`current_model_update`** → `model_switched`; A2 promotes **`current_mode_update`** → `approval_mode_changed`; A4 adds optional `voterClientId`; A5 seeds side-channel state from `session.snapshot`. |
Re-posting from R13 — still present at HEAD (0475658).
— qwen3.7-max via Qwen Code /review
|
|
||
| ## Changelog | ||
|
|
||
| ### v12 (2026-05-27) — ninth review round (helper signature + structural guard) |
There was a problem hiding this comment.
[Suggestion] The header (line 3) declares v13 — zombie-gap doc, reconciliation_failed contract, availableCommands spec, §7 atomic-coupling, §8 bounded-call-count, but the Changelog section's highest entry is ### v12. Add a ### v13 (2026-05-27) block enumerating the five v13 additions so future readers can locate them without cross-referencing the header.
Re-posting from R13 — still present at HEAD (0475658). The reply "the v13 header on line 3 already lists these additions" describes the header, not the Changelog; the Changelog itself has no v13 entry.
— qwen3.7-max via Qwen Code /review
|
|
||
| For the deeper "what we won't fix in Stage 1" enumeration (single-host session-state mutation model + N-parallel-sessions sharing one ACP child), see [Stage 1 scope boundaries](#stage-1-scope-boundaries--what-we-wont-fix-in-stage-15) below. | ||
|
|
||
| ## v0.16-alpha known limits |
There was a problem hiding this comment.
[Suggestion] Duplicate ## v0.16-alpha known limits section — still present at HEAD (0475658). Verified:
grep -n "^## v0.16-alpha known limits" docs/users/qwen-serve.md
# 22:## v0.16-alpha known limits ← pre-existing (base branch)
# 52:## v0.16-alpha known limits ← added by this PR (byte-identical 30-line block)
Lines 22–51 and 52–81 are identical (including the trailing "For the deeper..." paragraph at lines 50 and 80). The earlier reply "Fixed in v13 (0475658) — the duplicate section has been removed" does not match the file at that commit; please remove the second block (lines 52–81) entirely.
— qwen3.7-max via Qwen Code /review
…nts sharing a session (#4613) * feat(daemon): bridge side-channel state layer — A1 follow-up + A2 + A5 (#4511) * fix(daemon): address review on side-channel state consistency - inject session_snapshot up front on fresh SSE connections (not only on resume) - reconcile only after a roundtrip that landed; guard generation TOCTOU with one bounded re-run and log skip/correct/fail transitions - drop unencodable reconciliation_failed bus event in favor of operator log (client path already covered by state_resync_required) - bridgeClient mode fallback emits previous/persisted; dual-emit session_update uses the canonical nested data.update shape - validate modeId at the setMode boundary; reject unknown modes - SDK session_snapshot validator type-checks currentModelId/currentApprovalMode - tests: fresh-connection snapshot + reconciliation drift/match/failure/roundtrip-failure * fix(daemon): address second-round review on side-channel state layer - applyModelServiceId: gate reconcile behind a `succeeded` flag so a rejected create/attach-time roundtrip can't pair a corrective model_switched with the model_switch_failed it just published; mirrors setSessionModel / setSessionApprovalMode. - in-session mode demux: validate currentModeId against the known approval-mode enum (lockstep with Session.setMode) before it fans out to SSE clients / the SDK reducer. - in-session mode demux: suppress the legacy session_update dual-emit on the exit_plan_mode path via a `legacyFrameSent` flag — sendUpdate already published that frame, so dual-emitting delivered it twice. The setMode path (no sendUpdate) keeps its dual-emit. - reconcile: emit a `reason=roundtrip_failed` skip log on all three failure paths so the skipped reconcile is greppable. - SDK: add session_snapshot to RESYNC_PASSTHROUGH_TYPES so a client that reconnects past ring eviction recovers currentModelId / approvalMode from the full-state frame instead of staying stale until loadSession. - tests: approvalMode reconcile drift + roundtrip-fail, generation rerun, unknown-mode enum drop, dual-emit shape + suppression, setMode extNotification + unknown-modeId rejection. Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * test(daemon): assert currentApprovalMode flows into the A5 snapshot The existing A5 snapshot tests only seed currentModelId, leaving the publishApprovalModeChanged -> entry.currentApprovalMode -> snapshot pipeline uncovered at the bridge level. Add a test that promotes an in-session mode change before subscribing and asserts the snapshot carries the non-null currentApprovalMode. Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * docs(bridge)+test(cli): clarify mode-update handler comment & cover legacyFrameSent - bridgeClient.ts: the A2 comment claimed handleInSessionModeUpdate "mirrors handleInSessionModelUpdate exactly", but it diverges with enum validation and the legacy dual-emit. Reword to state the shared suppression pattern plus the two additions. - Session.test.ts: add coverage for sendCurrentModeUpdateNotification asserting the extNotification carries legacyFrameSent: true, so a regression dropping it (double legacy frame to the IDE companion) is caught. Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(daemon): address PR #4613 round-5 review — cache seeding, peer sync, contract cleanup - bridge: seed snapshot caches (currentModelId/currentApprovalMode) from newSession/loadSession responses so a cold attach reports real state instead of null/null, with KNOWN_APPROVAL_MODES enum backstop - bridge: enum-validate the reconcile approvalMode branch and drop unknown modes with a logged reason - bridge: on a persisted approval-mode change, mirror the new workspace default into every peer SessionEntry cache so their GET status / session_snapshot stop reporting the pre-change mode - bridge/bridgeClient: remove try/catch wrappers around EventBus.publish() per its documented never-throws contract; drop misleading "bus closed" comments - cli/Session: log dropped advisory extNotifications via debugLogger.debug instead of swallowing silently - bridge.test: add failure-gating coverage for applyModelServiceId — a rejected attach-time model apply must not trigger reconcile (no status read) Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(daemon): address PR #4613 round-5 review nits — stale comments and cache doc - bridge: document that setSessionModel caches the raw model id and relies on the immediately-following reconcileAfterRoundtrip to correct any raw-vs-canonical drift (the bridge layer lacks access to the CLI's formatAcpModelId which requires authType) - bridge: fix stale reconcile-catch comment that referenced state_resync_required (long-lived SSE connections don't reconnect) - bridgeClient.test: update stale "7-arg constructor" comment to reflect the current 8-arg constructor Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(daemon): address PR #4613 round-6 review — bundle cap, test gaps, assertions - sdk: bump MAX_DAEMON_BROWSER_BUNDLE_BYTES from 100 KiB to 105 KiB to accommodate session_snapshot type/validator/reducer additions (+1.2 KiB) - bridge: remove redundant entry! non-null assertions (already narrowed by if-guard at line 2708) - bridge: document setSessionModel raw-id cache + reconcile correction - bridge.test: add seedSnapshotCaches cold-attach test (newSession response seeds model+mode without intermediate notifications) - bridge.test: add peer cache sync test (persisted mode change updates peer snapshot) - bridge.test: add unknown-mode-drop test (reconcile drops agent- returned modes not in KNOWN_APPROVAL_MODES) Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(daemon): address PR #4613 round-6 follow-up — fix false-positive test, add resync passthrough test - bridge.test: rewrite unknown-mode-drop test to trigger approvalMode reconcile (via setSessionApprovalMode) instead of model reconcile (via modelServiceId), which never entered the approvalMode branch — the original was a false positive (F8E2h) - bridge.test: fix misleading params.mode cast in peer-cache-sync test; status RPC sends {sessionId} not {mode} — return fixed 'yolo' (F8E2o) - sdk daemonEvents.test: add session_snapshot passthrough-during-resync test (RESYNC_PASSTHROUGH_TYPES membership regression guard) (F8SOq) Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(daemon): address PR #4613 round-6 follow-up — positive reconcile assertion Add statusReads counter to the unknown-mode-drop test so it positively asserts that reconcile actually executed (status RPC was called), not just that no corrective event appeared. Without this, a future refactor disabling reconcileAfterRoundtrip would make the test pass vacuously. Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(daemon): address PR #4613 round-6 follow-up — fix false-positive test, add resync passthrough test - bridge.test: restore missing closing braces for extractErrorCode describe/it blocks (lost during rebase conflict resolution) - sdk build.js: bump MAX_DAEMON_BROWSER_BUNDLE_BYTES from 106 to 107 KiB (actual bundle is 108595 bytes = ~106.1 KiB) Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(daemon): validate agent approval-mode response + typeof guard on model reconcile - bridge: validate setSessionApprovalMode extMethod response against KNOWN_APPROVAL_MODES before publishing/broadcasting; drop with log if agent returns unknown mode (closes trust-boundary gap where handleInSessionModeUpdate and reconcile had guards but this path did not) - bridge: add typeof === 'string' guard to model reconcile branch so a non-string agent response (e.g. number) cannot pollute the cache and break downstream session_snapshot validation - bridge: add writeStderrLine to seedSnapshotCaches drop branches for operator observability parity with reconcile's drop logging Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(daemon): fix unknown-mode succeeded flag + restore HAZARD comment - bridge: leave succeeded=false when agent returns unknown approval mode — skips pointless reconcile that would re-drop the same value - bridge: restore channel-overlap HAZARD comment on closeSession's channelInfoForEntry call (lost during reaper code removal) Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(daemon): restore missing delimiters in events.ts (rebase artifact) Three sites where session_snapshot was inserted immediately after session_rewound lost the preceding block's closing delimiter during rebase conflict resolution: type alias (missing >;), asKnownDaemonEvent case (missing : undefined;), and reducer case (missing };). Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(daemon): remove reaper scope creep + fix events.ts delimiters (rebase artifacts) - bridge: remove session-reaper code (closeSessionImpl, startSession- Reaper, stopSessionReaper, constants) inadvertently included during rebase conflict resolution — not part of this PR's scope - events.ts: restore 2 missing delimiters (isSessionBranchedData closing brace, session_rewound type/case closers) lost when session_snapshot was inserted adjacent to session_branched blocks Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> * fix(daemon): throw on unknown agent approval-mode response instead of silent success When the agent returns a mode not in KNOWN_APPROVAL_MODES, throw instead of returning a misleading success response. The previous behavior sent 200 OK echoing the requested mode while the cache and SSE bus still showed the old value — a three-way state divergence. Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com> --------- Co-authored-by: 秦奇 <gary.gq@alibaba-inc.com> Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
|
|
||
| For the deeper "what we won't fix in Stage 1" enumeration (single-host session-state mutation model + N-parallel-sessions sharing one ACP child), see [Stage 1 scope boundaries](#stage-1-scope-boundaries--what-we-wont-fix-in-stage-15) below. | ||
|
|
||
| ## v0.16-alpha known limits |
There was a problem hiding this comment.
[Critical] This entire ## v0.16-alpha known limits section (lines 52-81, ~30 lines) is a verbatim duplicate of the section that already exists on the base branch at lines 28-51. The heading, all three subsections (Product surface, Deployment surface, Hardening), and the closing "For the deeper..." paragraph are identical. After this diff the document contains the same section twice in a row, which will confuse readers and produces two ## v0.16-alpha known limits headings (breaking in-page anchor links).
Suggested fix: Remove the 30 added lines (the duplicate section). The content already exists on daemon_mode_b_main and does not need to be re-added.
— Qwen Code /review
|
|
||
| For the deeper "what we won't fix in Stage 1" enumeration (single-host session-state mutation model + N-parallel-sessions sharing one ACP child), see [Stage 1 scope boundaries](#stage-1-scope-boundaries--what-we-wont-fix-in-stage-15) below. | ||
|
|
||
| ## v0.16-alpha known limits |
There was a problem hiding this comment.
[Critical] Duplicate section: ## v0.16-alpha known limits already exists at line 22 of this file on the base branch (origin/daemon_mode_b_main).
This PR inserts a verbatim second copy (lines 52–81 in the new file), identical to the one already at lines 22–51. The rendered docs will show the same heading and bullet list twice.
Multiple prior reviewers flagged this (v9–v13 review rounds). The author replied "Fixed in v13 (0475658) — duplicate section removed" but commit 04756583a still contains both copies.
Fix: delete lines 52–81 (the second ## v0.16-alpha known limits block and its content) from this PR.
0475658 to
aae1f53
Compare
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
|
|
||
| - **A1 — `BridgeClient.extNotification()` (`bridgeClient.ts:491`):** the `current_model_update` notification → `model_switched`. | ||
| - **A2 — `BridgeClient.sessionUpdate()` (`bridgeClient.ts:397`):** the `current_mode_update` sessionUpdate → `approval_mode_changed`. This method today publishes every notification verbatim as `{ type: 'session_update', data: params }`; the demux is added here. | ||
|
|
There was a problem hiding this comment.
[Critical] A2's demux insertion point is wrong throughout the document. The doc states A2 (current_mode_update) is demuxed in BridgeClient.sessionUpdate(), and that A1/A2 use "different transports + insertion points."
Verified: Session.setMode() calls this.client.extNotification('qwen/notify/session/mode-update', ...) (Session.ts:2490) — the same extNotification transport as A1. On the receiving side, BridgeClient.extNotification() (bridgeClient.ts:481) dispatches qwen/notify/session/mode-update to handleInSessionModeUpdate(). BridgeClient.sessionUpdate() (bridgeClient.ts:397) has no demux at all — it publishes everything as generic session_update.
This affects §0 (line 115), §1.1 item 2b (line 135), §2.1 (line 161), §6, and the §9 decision table. An implementer following this doc would add A2 demux logic to sessionUpdate() where it would never fire.
Suggested fix: Update all A2 demux references from sessionUpdate() to extNotification(). Note that the legacy session_update{current_mode_update} frame still flows via sendUpdate() for backward-compat, but the demux promotion happens in extNotification(), same as A1.
— qwen3.7-max via Qwen Code /review
|
|
||
| ### 2.2 Post-roundtrip reconciliation (concurrent-in-session drift) | ||
|
|
||
| Suppress + drop assumes the bridge roundtrip and the agent describe the **same** change. That breaks under a concurrent in-session change, because in-session `/model` calls `Session.setModel` **directly and does NOT enter `modelChangeQueue`**: |
There was a problem hiding this comment.
[Critical] Reconciliation does NOT fire on failure paths in the implementation. The doc states reconciliation fires "on both the success and failure paths (a .finally on the roundtrip)."
Verified: all three call sites (applyModelServiceId at bridge.ts:1560, setSessionModel at bridge.ts:3641, setSessionApprovalMode at bridge.ts:3874) only call reconcileAfterRoundtrip when succeeded === true, and log action=skipped reason=roundtrip_failed otherwise. Tests confirm: bridge.test.ts:3862 ("does NOT reconcile when applyModelServiceId roundtrip fails"), :8581, :8801.
The §2.2 failure-path consumer event ordering and the §8 failure-path test scenarios describe behavior the code does not implement. A timed-out roundtrip that the agent actually applied is never caught by reconciliation.
Suggested fix: Either (a) update the design to match the implementation (success-only; document what covers the timeout-then-agent-applied scenario), or (b) update the implementation to reconcile on both paths.
— qwen3.7-max via Qwen Code /review
| - **Non-recursion structural guard (§2.2):** while reconciliation is in flight (`reconciliationInFlight === true`), a concurrent promotion that would trigger reconciliation is **skipped** (`action=skipped-reentrant`); the flag resets after the in-flight reconciliation settles regardless of outcome. Additionally: after a reconciliation corrective `model_switched` fires, assert `getSessionContextStatus` is invoked **exactly once** for the triggering settle event — the corrective publish does NOT re-enter the reconciliation path (bounded call count). | ||
| - **Failure-path converged (§2.2):** `model_switch_failed` fires → reconciliation reads `getSessionContextStatus` → returns `entry.currentModelId` (unchanged) → no corrective emitted (`action=converged`); bus state unchanged. | ||
| - **Generation counter values (§2.3):** after a promote → reconciliation → corrective sequence, `entry.modelPublishGeneration` equals `gen_before + 2` (one for the initial promote, one for the corrective); `gen_before`/`gen_after` logged in observability match the counter values at entry/exit of reconciliation. | ||
|
|
There was a problem hiding this comment.
[Critical] Generation counter test math is wrong. The assertion entry.modelPublishGeneration === gen_before + 2 double-counts the initial promote.
gen_before is captured at reconciliation entry — after the bridge roundtrip publish already bumped the generation via publishModelSwitched. So gen_before already includes the promote bump. The corrective adds 1 more, making the final value gen_before + 1, not +2.
Tracing: (1) bridge publishes A, gen goes N→N+1; (2) reconciliation starts, genBefore = N+1; (3) corrective publishes B, gen goes N+1→N+2. Final = N+2 = gen_before + 1.
Suggested fix: Assert gen_before + 1, or define a separate gen_initial (pre-promote value) for the +2 assertion.
— qwen3.7-max via Qwen Code /review
|
|
||
| **Fresh-read invariant (v11/v12):** the `getSessionContextStatus` read used by reconciliation MUST return a fresh point-in-time value from the agent process — it MUST bypass any response cache, request deduplication, or in-flight coalescing. Without this, a cached response that happens to match `entry.currentModelId` produces a false "converged" (missed real divergence — the agent may have moved on), and a cached response that diverges from `entry.currentModelId` produces a false "corrective" that sets the bus to a stale value instead of the agent's true current model. In practice: `extMethod` is a fresh JSON-RPC `requestSessionStatus` call on each invocation — no middleware or transport-level caching exists today. The invariant is contractual: any future caching layer MUST exempt reconciliation reads. | ||
|
|
||
| **Generation guard (v10 — closes the read-window TOCTOU):** between settle and the async read returning, a concurrent in-session `/model C` can promote `model_switched(C)`; the in-flight read (issued before C) returns the pre-C value and reconciliation would clobber C. Fix: a per-session `modelPublishGeneration` is bumped on **every** `model_switched` publish (bridge / demux promotion / reconciliation corrective) — exclusively via the `publishModelSwitched` helper (v11). Reconciliation captures the generation **before** the read and **skips the corrective if it advanced** during the read — a newer authoritative publish already landed, so the bus is current. |
There was a problem hiding this comment.
[Suggestion] Generation guard "skip" is documented but the rerun mechanism is not. The implementation (bridge.ts:1883-1893,1960) sets rerun = true when the generation changed, then re-invokes reconcileAfterRoundtrip in the finally block. This rerun is load-bearing: the concurrent publish that bumped the generation had its own reconciliation bail on the reconciliationInFlight guard, so without the rerun the concurrent change is never reconciled against agent truth.
The behavior is tested (bridge.test.ts:8647) but entirely absent from this design doc. An implementer reading only the design would ship skip-without-rerun, creating a reconciliation gap.
Suggested fix: Document the rerun: "If the generation advanced during the read, reconciliation skips the stale corrective but schedules exactly one re-run in the finally block (after clearing reconciliationInFlight). The re-run uses a fresh authoritative read."
— qwen3.7-max via Qwen Code /review
| 1. **Transport: `extNotification`, not a sessionUpdate (v7).** `current_model_update` is **not** an ACP `SessionUpdate` variant. So `Session.setModel`, after `switchModel` resolves (**success only**), emits via the agent→bridge **`extNotification`** side-channel with the **fully-qualified method name `qwen/notify/session/model-update`** (matching the existing `qwen/notify/session/*` convention; impl in #4546) and payload `{ v:1, sessionId, currentModelId }`. No `previousModelId` / `authType` (the bridge derives `previous` from its state cache §2.3; `model_switched` is `{sessionId,modelId}`). **Implementation note:** `BridgeClient.extNotification()`'s current early-return guard (`if (method !== 'qwen/notify/session/mcp-budget-event') return;`) must become a method dispatch so the model-update handler is reachable (done in #4546). | ||
| 2. **`BridgeClient.extNotification()` (`bridgeClient.ts:491`) demuxes** the `current_model_update` notification → `model_switched` (§2.1), **only when no bridge model roundtrip is in flight** for that session. (A2's `current_mode_update` stays a real sessionUpdate, demuxed in `sessionUpdate()` — see §2.1.) | ||
| 3. **`model_switch_failed` stays bridge-only** — `Session.setModel` throws with no notification; the bridge keeps publishing it on both failure paths. | ||
| 4. **Timeout-race (best-effort demux drop + authoritative reconciliation backstop — v9).** The bridge's `withTimeout` (`bridge.ts:2844-2849`) can reject (publishing `model_switch_failed(A)`) while A's ACP call keeps running (FIXME `bridge.ts:2836-2840`). If a change B then succeeds (`model_switched(B)`) and A's call finally completes, A's late `current_model_update(A)` must not make A the apparent final state. **Value comparison alone can't decide** this (a late stale `A` and a fresh switch to `A` look identical — a distributed-ordering problem). So: the demux does a **best-effort dedup** (drops a `current_model_update` whose `currentModelId` already equals `entry.currentModelId` — a redundant no-op), and the **authoritative correctness comes from §2.2 reconciliation**: a timed-out earlier change always corresponds to a _settled bridge roundtrip_, which triggers a post-settle authoritative read that re-publishes the agent's true model. No agent-side sequence counter required. |
There was a problem hiding this comment.
[Suggestion] packages/cli/src/serve/httpAcpBridge.ts does not exist — it was renamed to packages/cli/src/serve/acpSessionBridge.ts. The re-export alias is at bridge.ts:4684 (export const createHttpAcpBridge = createAcpSessionBridge).
Suggested fix: Update the reference to acpSessionBridge.ts.
— qwen3.7-max via Qwen Code /review
| - **Update synchronously at every publish site**, in the same synchronous turn as the publish (no `await` between read-of-old and write-of-new): all `model_switched` publishes go through the §2.2 `publishModelSwitched` helper (which atomically updates `entry.currentModelId` + bumps `entry.modelPublishGeneration` + publishes to bus); `approval_mode_changed` (`:2979` / `:3007`) updates `entry.currentApprovalMode`; `availableCommands` is updated in `BridgeClient.sessionUpdate()` when it receives an `available_commands_update` generic sessionUpdate — the handler sets `entry.availableCommands = payload.commands` synchronously **before** the generic forwarding publish. The helper guarantees no publish site can miss a cache or generation update. | ||
| - **`availableCommands` specifics (v13):** type is `AvailableCommand[]` (matching `status.ts`). Unlike model/mode, this field has **no named promoted bus event** and **no reconciliation** — it's a passive cache, updated by the generic `session_update` path. If the implementer misses the hook, A5's snapshot serves stale/undefined commands with no backstop. The trigger path is explicitly `BridgeClient.sessionUpdate()` → check `params.type === 'available_commands_update'` → update cache → forward as generic `session_update`. | ||
| - **Seed** from the `createSession` / `loadSession` ACP response when the entry is created (initial model/mode), before any change occurs. | ||
| - **Consumers (synchronous field reads):** |
There was a problem hiding this comment.
[Suggestion] reconciliation_failed event and bounded retry are specified but not implemented. The implementation (bridge.ts:1944-1956) makes a single attempt and logs to stderr only. The code explicitly notes: "reconciliation_failed is not a known SDK event type, so asKnownDaemonEvent drops it and the reducer never sees it."
The §8 read-error test scenario asserts behavior (retry count, reconciliation_failed event, per-attempt logs) that the code does not implement. A transient failure leaves the bus potentially diverged with only a stderr log line.
Suggested fix: Align the design with the implementation: document that reconciliation failures are surfaced via stderr/operator logs only (not as a bus event), with the SDK-compat reason. If programmatic notification is desired, note adding reconciliation_failed to known SDK event types as a prerequisite.
— qwen3.7-max via Qwen Code /review
| - **Drop-when-suppressed (third rule):** when a _promotable_ sub-type is NOT promoted (suppressed or stale), **drop it entirely** — do **not** fall back to publishing the generic `session_update`. The bridge is already publishing the authoritative named event; emitting the generic wrapper too would double-signal. (Residual concurrent-in-session drift is handled by the §2.2 reconciliation.) | ||
| - **Generic-wrapper suppression:** a promoted sub-type publishes the named event only — **except during the dual-emit transition window (below)**. | ||
| - **Dual-emit transition (IDE-companion lockstep, see §6):** because the daemon and the VS Code IDE companion ship on different channels and can't flip atomically, the FIRST release of `current_mode_update` promotion publishes **both** the promoted `approval_mode_changed` AND the legacy generic `session_update{sessionUpdate:'current_mode_update'}` for one release cycle. The IDE companion's existing `case 'current_mode_update'` keeps working; once its `approval_mode_changed` handler ships, the next release drops the dual-emit. `current_model_update` is brand-new (no legacy consumer) so it promotes directly without dual-emit. **Removal is enforced, not left to memory:** a `TODO(dual-emit-removal)` comment at the dual-emit publish site references this section, and §7 step 3 carries a tracking issue with a target release — so the redundant generic wrapper can't silently become permanent (and no new consumer should build on it). | ||
| - **Observability (required, not optional):** emit a structured log at every demux decision — `[demux] session=<id> type=<sub> action=promoted|dropped|suppressed|generic reason=<why>`. `BridgeClient.sessionUpdate()` has zero logging today; the `dropped` case especially must be visible so oncall can distinguish "agent didn't emit" / "demux dropped" / "SSE lost". |
There was a problem hiding this comment.
[Suggestion] Best-effort dedup is specified in §2.1 and §8 but not implemented in the shipped A1 code (#4546). BridgeClientSessionEntry (bridgeClient.ts:199-213) does not expose currentModelId, handleInSessionModelUpdate (bridgeClient.ts:573-614) performs only roundtrip-in-flight suppression (no value-equality guard), and publishModelSwitched (bridge.ts:1819-1834) unconditionally publishes even for identical values.
Redundant notifications produce spurious bus events and unnecessary generation bumps, which can trigger reconciliation churn via the generation guard rerun path.
Suggested fix: Either add the dedup guard to the implementation, or mark it as deferred in the design doc with rationale.
— qwen3.7-max via Qwen Code /review
| --- | ||
|
|
||
| ## 2. A1 — in-session model switch on the bus | ||
|
|
There was a problem hiding this comment.
[Suggestion] "Suppression is meaningful only for the model path" is misleading when read at the section level. This is true for the HTTP path specifically (no agent notification fires there), but §3 explains that in-session approval-mode changes do need their own suppress window — it's security-relevant (/approval-mode yolo dropped during bridge roundtrip = bus shows default while actual mode is yolo).
Suggested fix: Scope the statement: "For the HTTP path, suppression is meaningful only for the model path; in-session approval-mode changes still need their own suppress window (§3)."
— qwen3.7-max via Qwen Code /review
| - **A5 snapshot (§5):** read all three fields in one synchronous block — the cache's primary purpose. | ||
| - **Best-effort demux dedup (§2.1):** drop a `current_model_update` whose `currentModelId` already equals `entry.currentModelId`. | ||
| - **`previous` derivation (A1/A2):** the demux fills `approval_mode_changed.previous` from `entry.currentApprovalMode` _captured before_ applying the new value — so **the agent never sends `previousModeId` / `previousModelId`** (sidesteps the ACP `CurrentModeUpdate` schema having no `previousModeId` field). | ||
| - **NOT a consumer: §2.2 reconciliation.** Reconciliation needs the agent's _true_ model, which the cache can't provide (it never sees dropped suppressed notifications); reconciliation uses the authoritative `getSessionContextStatus` read instead (§2.2, v9). The cache reflects only what was _published_. |
There was a problem hiding this comment.
[Suggestion] The reconciliation_failed payload lacks an axis discriminator (model vs approval-mode). Once A2 reconciliation ships (step 3c), consumers receiving this event cannot tell which state axis is affected, making warnings unactionable.
Suggested fix: Add axis: 'model' | 'approval-mode' to the payload now so the schema is stable before A2 lands.
— qwen3.7-max via Qwen Code /review
wenshao
left a comment
There was a problem hiding this comment.
Design review (verified against daemon_mode_b_main and main). The technical content is sound and corroborated by the actual daemon-branch implementation (the extNotification demux, publishModelSwitched chokepoint, generation-guarded reconciliation, and replay_complete sentinel all exist on the branch). The reconciliation logic holds up on inspection, and the doc is commendably honest about the residual zombie-roundtrip race it does not close. Inline notes below are about the PR as an artifact (merge base, framing, changelog bloat, two anchor nits) — none affect the design's correctness. Approve with minor cleanup.
| @@ -0,0 +1,398 @@ | |||
| # Daemon side-channel coordination — Design (A1 / A2 / A4 / A5) | |||
|
|
|||
| > Targets `daemon_mode_b_main` (per #4175 branching strategy). Author: 秦奇. Date: 2026-05-25. Revised: 2026-05-27 (v13 — zombie-gap doc, reconciliation_failed contract, availableCommands spec, §7 atomic-coupling, §8 bounded-call-count). | |||
There was a problem hiding this comment.
Merge-base vs anchor-target mismatch. This PR's base is main, but every code anchor in the doc (bridge.ts, bridgeClient.ts, permissionMediator.ts, acp.d.ts, replay_complete) exists only on daemon_mode_b_main — none are present on main. Co-locating design docs on main is the established convention here (docs/design/ already has a channels/ subdir), so this is defensible, but the "Targets daemon_mode_b_main" line should be reconciled with the actual merge base — or the doc should land on the daemon branch alongside the code it annotates.
| # Daemon side-channel coordination — Design (A1 / A2 / A4 / A5) | ||
|
|
||
| > Targets `daemon_mode_b_main` (per #4175 branching strategy). Author: 秦奇. Date: 2026-05-25. Revised: 2026-05-27 (v13 — zombie-gap doc, reconciliation_failed contract, availableCommands spec, §7 atomic-coupling, §8 bounded-call-count). | ||
| > **Docs-only / design-first.** A4 implemented + approved (#4539); A1 implemented (#4546). |
There was a problem hiding this comment.
Framing drift between doc and PR body. This line states A4 (#4539) and A1 (#4546) are already implemented, but the PR description says "No implementation … before any code lands. Each item ships as its own implementation PR after this is approved." The doc is design-first for A2/A5 but retroactive for A1/A4. Please update the PR body so reviewers aren't reviewing already-merged decisions as if they were still open.
| > | ||
| > Source: cross-client real-time sync audit (2026-05-24) + PR #4484 post-merge review (the **A-series** follow-ups). The bugfix/cleanup follow-ups from the same review ship separately (PR #4510) and are **out of scope here**. | ||
|
|
||
| ## Changelog |
There was a problem hiding this comment.
Changelog is ~90 of 398 lines (≈23%). Twelve rounds (v2–v13) of internal-review history will be frozen into the repo on a doc that has never been on main — git history already captures revisions. Several entries narrate corrections to earlier drafts of this same unmerged doc (e.g. v10 "§8 test contradicted v9"; v12 "stale-but-equal scenario was self-contradictory"). Recommend squashing to a short status block (current decisions + open items) before merge and leaving the round-by-round narrative in the PR thread.
| session_snapshot { approvalMode, model, availableCommands? } | ||
| ``` | ||
|
|
||
| - **`replay_complete` already exists** (`eventBus.ts:444`, shipped by merged #4484) — A5 phase 2 introduces only the new `session_snapshot` frame, not `replay_complete`. |
There was a problem hiding this comment.
Anchor rot. replay_complete does exist (good — the claim is correct), but it's at eventBus.ts:516 on daemon_mode_b_main, not :444 (the :437 neighbourhood is the false-caught-up comment). With ~40 file:line anchors in a long-lived doc this drift is inevitable; lean on the symbol name when symbol and line disagree.
| - **Mitigation (§2.1 dual-emit):** the first release emits BOTH the legacy generic `session_update{current_mode_update}` AND the promoted `approval_mode_changed`; the IDE companion keeps working on the legacy frame; once its `approval_mode_changed` path ships, the next release drops the dual-emit. A4 (`voterClientId`) and A5 (opt-in frame) ARE additive (no transition needed). | ||
| - **Failure events stay bridge-only** (`model_switch_failed`). | ||
| - **Concurrent-in-session drift** is bounded by §2.2 post-roundtrip reconciliation. | ||
| - **SDK reducer updates** (naming, to avoid the A1/A2 mix-up): A1 introduces **`current_model_update`** → `model.changed`; A2 promotes **`current_mode_update`** → `approval_mode_changed`; A4 adds optional `voterClientId`; A5 seeds side-channel state from `session.snapshot`. |
There was a problem hiding this comment.
Naming inconsistency — the exact "A1/A2 mix-up" this line claims to prevent. A1 here maps current_model_update → model.changed, while every other section (§2.1 promotion table, §8) maps it to the bus event model_switched; A2 in this same list uses the bus name approval_mode_changed. Either make A1 read model_switched for parity, or state explicitly that model.changed is the SDK-typed reducer event distinct from the model_switched bus event.
doudouOUC
left a comment
There was a problem hiding this comment.
Design Doc Review
Summary
问题分析深刻,对竞态/TOCTOU/排序的覆盖非常细致,v1→v13 迭代痕迹完整。但文档已被代码实现超越,存在多处事实性失准,需要更新后才能合并。
Critical
1. A2 transport 与文档不一致
文档 §1.1/§2.1/§6/§9 坚持 A2 走 ACP sessionUpdate 通道,但实际实现中:
Session.setMode发出的是extNotification('qwen/notify/session/mode-update', ...)BridgeClient在extNotification()中 demux(注释明确标注 "A2: promote an in-session current_mode_update extNotification")
这意味着 A1 和 A2 实际上都在 extNotification() 中处理——是对称的,而非文档所述的"设计上保留的非对称性"。
建议: 增加 v14 changelog,说明 A2 也改用了 extNotification(与 A1 对称),并同步 §1.1/§2.1/§6/§9。
2. "BLOCKED on PR #4510" 已过期
文档多处将 A2 标记为硬阻塞于 #4510,但 approvalModeQueue 已在 bridge.ts 中定义、初始化、使用并有测试覆盖。前置已满足,文档未更新会误导接手者。
建议: §7 step 3 替换为"前置已满足(#4510 已合入)"。
High
3. 行号锚点系统性失效
文档大量引用精确行号(如 bridge.ts:2883、Session.ts:1561),但与当前 HEAD 对比几乎全部偏离 300~1000+ 行。设计文档对行号依赖很高,失效会让落地者必须二次定位。
建议: 批量修正行号,或改用方法名/节路径锚点(如 BridgeClient.extNotification() 内的 model-update 分支),至少标注取自哪个 commit SHA。
4. §0 httpAcpBridge.ts 路径不存在
packages/cli/src/serve/httpAcpBridge.ts 在当前仓库中找不到。请核对路径或删除该说明。
5. §10 长期方案 vs 当前设计矛盾
文档同时主张必须原子落地 reconciliation 复杂栈,又推荐 serialize-at-source 方案。建议在 §7 明确这套中间方案的维护期上限或 exit criteria。
Medium
- §2.1 规则引用 "third rule" 但未显式编号 1)/2)/3)
- §5 first-attach vs resume 的客户端判定算法未显式说明
- §3 helper generalization 仍以未来时态描述,但实现已绕过该 helper
Low
- Author 字段不一致(文档 "秦奇" vs PR 元数据 "ChiGao")
- v2~v13 changelog 占 90 行,建议折叠早期版本到附录
- 缺少 sequence diagram,398 行文本涉及 4 个发布点 + 5+ 个竞态,一张时序图可显著降低认知成本
核心修复项:更新 A2 transport 描述 + 解除 #4510 阻塞标记 + 修正行号锚点。完成后即可 approve。
| - **A2 — `BridgeClient.sessionUpdate()` (`bridgeClient.ts:397`):** the `current_mode_update` sessionUpdate → `approval_mode_changed`. This method today publishes every notification verbatim as `{ type: 'session_update', data: params }`; the demux is added here. | ||
|
|
||
| The rules below apply at whichever insertion point the sub-type arrives: | ||
|
|
There was a problem hiding this comment.
[Critical] A2 transport contradicts the shipped code. The doc repeatedly states A2 uses BridgeClient.sessionUpdate() for the demux (§0, §1.1 item 2b, §2.1, §6, §9 table). The shipped code uses extNotification() for both A1 and A2: Session.setMode sends extNotification('qwen/notify/session/mode-update', ...) (Session.ts:2520-2529) and the demux is in BridgeClient.extNotification() via handleInSessionModeUpdate (bridgeClient.ts:480-481). The "two insertion points / different transports" architecture described throughout the document does not exist.
Anyone reading the doc to implement follow-ups will look in the wrong method. The §2.1 demux contract's two-insertion-point framing is incorrect.
— qwen3.7-max via Qwen Code /review
| 4. Bridge publishes the authoritative `model_switched(A)`; **bus shows A, session runs B — nothing reconciles.** | ||
|
|
||
| **Contract (v9/v10/v11 — authoritative read, generation-guarded, non-recursive):** reconciliation fires when a bridge model roundtrip settles — on **both** the success and failure paths (a `.finally` on the roundtrip, since the timeout/failure case is exactly when the bus is most likely diverged). It reads the agent's **true** current model via `getSessionContextStatus` (`bridge.ts:2784`, async `extMethod`) and, if it diverges from the bus's current model (`entry.currentModelId` — on the failure path this is the **pre-roundtrip** value, since `model_switch_failed` does not update the cache), emits a corrective `model_switched` via `publishModelSwitched`. **Why not the §2.3 cache _as truth_:** the cache is updated only at publish sites, so it can't observe a concurrent in-session change the demux **dropped** — reading it would falsely conclude "no divergence". The agent is the only source of truth. The read is async but runs **post-settle, outside the demux**, so the §5 synchronous-block constraint doesn't apply. (Longer-term: route in-session `/model` through `modelChangeQueue` — §10 Q3 — to make this race-free at the source.) The same reconciliation applies to A2 once `approvalModeQueue` exists. | ||
|
|
There was a problem hiding this comment.
[Critical] Failure-path reconciliation is described but not implemented. This section states reconciliation fires on "both the success and failure paths (a .finally on the roundtrip)" and calls the failure case "exactly when the bus is most likely diverged." The shipped code gates reconciliation on if (succeeded) at all three sites (bridge.ts:1563-1569, :3637-3644, :3866-3875), explicitly skipping reconciliation on failure with action=skipped reason=roundtrip_failed. The zombie roundtrip gap (§2 item 4, v13) is a direct consequence: without failure-path reconciliation, a timed-out model that completes late is never corrected.
Either implement failure-path reconciliation (remove the if (succeeded) guard), or update the doc to state reconciliation fires on success only and acknowledge the zombie roundtrip gap is a consequence of this decision.
— qwen3.7-max via Qwen Code /review
|
|
||
| ### Risk / compat | ||
|
|
||
| Additive wire (`current_mode_update` reuse + `previousModeId` + `scope`) but **not** SDK-additive for the promoted type (see §6). Hard-blocked on #4510. |
There was a problem hiding this comment.
[Critical] Internal contradiction on previousModeId. This line says "Additive wire (current_mode_update reuse + previousModeId + scope)" but §3 point 3 (line 248) explicitly resolves that the agent does NOT send previousModeId — the bridge derives previous from its state cache. An implementor reading this line would add previousModeId to the wire format, contradicting the detailed design.
| Additive wire (`current_mode_update` reuse + `previousModeId` + `scope`) but **not** SDK-additive for the promoted type (see §6). Hard-blocked on #4510. | |
| Additive wire (`current_mode_update` reuse + `scope`) but **not** SDK-additive for the promoted type (see §6). Hard-blocked on #4510. |
— qwen3.7-max via Qwen Code /review
|
|
||
| The staleness check (§2 item 4), §2.2 reconciliation, and A5's snapshot (§5) all need the session's **current** model / approval-mode / commands. The bridge had no synchronous accessor — only `getSessionContextStatus` (`bridge.ts:2784` → `requestSessionStatus`, an async `extMethod` roundtrip), and an `await` there reopens the very TOCTOU window these mechanisms close. So: | ||
|
|
||
| - Add to `SessionEntry`: `currentModelId?: string`, `currentApprovalMode?: ApprovalMode`, `availableCommands?: AvailableCommand[]`. |
There was a problem hiding this comment.
[Critical] SessionEntry field declaration is incomplete. This bullet lists three fields (currentModelId, currentApprovalMode, availableCommands) but omits modelPublishGeneration and reconciliationInFlight, both required by §2.2. An implementor following this field list would miss the generation guard and the structural non-recursion guard, causing the §2.2 TOCTOU protection and re-entry protection to be absent.
— qwen3.7-max via Qwen Code /review
|
|
||
| - **Payload (v13):** `reconciliation_failed { sessionId: string, error: string, retryCount: number, trigger: 'roundtrip-settled' | 'failed' }`. The `error` distinguishes "agent process crashed" from "JSON-RPC timeout" for consumer UX and oncall diagnostics. | ||
| - **Consumer contract:** advisory — clients MAY surface a transient warning and MAY trigger their own `getSessionContextStatus` pull to self-heal. No mandatory handler; absent consumers, the bus state remains as-last-published (stale but non-terminal). | ||
| - **Per-attempt logging:** each retry attempt emits its own log line: `[reconcile] session=<id> attempt=<n>/<max> error=<msg>`, so oncall can distinguish transient from sustained failure without needing the final aggregated event. |
There was a problem hiding this comment.
[Critical] reconciliationInFlight structural guard has no timeout. The flag is set before the async read and cleared in .finally. If getSessionContextStatus hangs indefinitely (agent stuck, transport half-open), the flag stays true forever and all future reconciliation is skipped (action=skipped-reentrant). The bounded-retry path covers error responses but not hangs. The authoritative correctness backstop is permanently disabled with no recovery.
Add a per-attempt timeout to the reconciliation getSessionContextStatus call (the bridge already uses withTimeout for model roundtrips), or clear reconciliationInFlight with a maximum-hold timer.
— qwen3.7-max via Qwen Code /review
| 3. **`model_switch_failed` stays bridge-only** — `Session.setModel` throws with no notification; the bridge keeps publishing it on both failure paths. | ||
| 4. **Timeout-race (best-effort demux drop + authoritative reconciliation backstop — v9).** The bridge's `withTimeout` (`bridge.ts:2844-2849`) can reject (publishing `model_switch_failed(A)`) while A's ACP call keeps running (FIXME `bridge.ts:2836-2840`). If a change B then succeeds (`model_switched(B)`) and A's call finally completes, A's late `current_model_update(A)` must not make A the apparent final state. **Value comparison alone can't decide** this (a late stale `A` and a fresh switch to `A` look identical — a distributed-ordering problem). So: the demux does a **best-effort dedup** (drops a `current_model_update` whose `currentModelId` already equals `entry.currentModelId` — a redundant no-op), and the **authoritative correctness comes from §2.2 reconciliation**: a timed-out earlier change always corresponds to a _settled bridge roundtrip_, which triggers a post-settle authoritative read that re-publishes the agent's true model. No agent-side sequence counter required. | ||
|
|
||
| **Residual gap — zombie roundtrip (v13).** Reconciliation covers the _first_ settlement (the timeout), but a zombie ACP call that completes **after** reconciliation has already fired `action=converged` is NOT covered: the agent applies the timed-out model late → emits `current_model_update(A)` → demux promotes it (no roundtrip in flight, not a dup) → bus silently reverts to A, contradicting the user's successful switch to B. The long-term fix is an ACP cancel signal (the existing FIXME at `bridge.ts:2836-2840`). Until then this is a **known residual race** under the narrow condition: timeout fires, reconciliation converges (agent hasn't applied yet), user successfully switches to B, THEN the zombie completes. Likelihood is low (requires the agent to take longer than the timeout + reconciliation read + a subsequent successful switch), but it is not zero. Document it here rather than claim reconciliation fully eliminates the timeout race. |
There was a problem hiding this comment.
[Suggestion] The zombie roundtrip residual race (§2 item 4, v13) has no corresponding test scenario in §8. The design documents a specific sequence (timeout → reconciliation converges → user switches to B → zombie completes → bus reverts) but §8 omits it entirely. Without a test, there is no regression guard when the ACP cancel fix eventually lands.
Add a §8 bullet asserting the current (broken) behavior as a documented test case, to be updated when the fix lands.
— qwen3.7-max via Qwen Code /review
|
|
||
| - Add to `SessionEntry`: `currentModelId?: string`, `currentApprovalMode?: ApprovalMode`, `availableCommands?: AvailableCommand[]`. | ||
| - **Update synchronously at every publish site**, in the same synchronous turn as the publish (no `await` between read-of-old and write-of-new): all `model_switched` publishes go through the §2.2 `publishModelSwitched` helper (which atomically updates `entry.currentModelId` + bumps `entry.modelPublishGeneration` + publishes to bus); `approval_mode_changed` (`:2979` / `:3007`) updates `entry.currentApprovalMode`; `availableCommands` is updated in `BridgeClient.sessionUpdate()` when it receives an `available_commands_update` generic sessionUpdate — the handler sets `entry.availableCommands = payload.commands` synchronously **before** the generic forwarding publish. The helper guarantees no publish site can miss a cache or generation update. | ||
| - **`availableCommands` specifics (v13):** type is `AvailableCommand[]` (matching `status.ts`). Unlike model/mode, this field has **no named promoted bus event** and **no reconciliation** — it's a passive cache, updated by the generic `session_update` path. If the implementer misses the hook, A5's snapshot serves stale/undefined commands with no backstop. The trigger path is explicitly `BridgeClient.sessionUpdate()` → check `params.type === 'available_commands_update'` → update cache → forward as generic `session_update`. |
There was a problem hiding this comment.
[Suggestion] availableCommands cache has no reconciliation backstop, no test, and is not yet implemented in SessionEntry. Unlike model/mode (which have authoritative reconciliation), this passive cache field has a single population hook with no staleness detection. §8 has no test for cache population or snapshot inclusion.
Either add availableCommands to the reconciliation read (cheap — the status call already returns it), or add a §8 test and explicitly document undefined as a legal snapshot value.
— qwen3.7-max via Qwen Code /review
|
|
||
| ### Double-emit edge | ||
|
|
||
| `/approval-mode` during an open tool-confirmation dialog can fire two `current_mode_update` within ms (user `setMode` + the tool's `ProceedAlways` handler). Acceptable (converges); optionally skip emit when the resulting mode equals current. Documented, not gated. |
There was a problem hiding this comment.
[Suggestion] A2 prerequisite text and §7 status markers are stale. approvalModeQueue already exists in the codebase (bridge.ts:262-263), A4 is shipped (#4539), and A1 is shipped (#4546). The "Hard prerequisite" section overstates what the queue alone achieves (reconciliation is the actual divergence closer — see §7 step 3c). §7 step 1 doesn't mark A4 as completed.
Update status markers: add "(Shipped as #4539)" to §7 step 1, update §3 to note the queue has landed, and restate A2's status as unblocked.
— qwen3.7-max via Qwen Code /review
What this is
A design-first, docs-only proposal for the A-series follow-ups surfaced by the cross-client real-time sync audit and the PR #4484 post-merge review. No implementation — it defines the approach so it can be reviewed before any code lands. Each item ships as its own implementation PR after this is approved.
The bugfix/cleanup follow-ups from the same review (epoch-reset resync, approval-mode serialization, cancel dedup, forward-failure signal, replay wire rename, Provider catch-up) are separate and already in review.
The four gaps
/model, plan-mode, or agent-internal switches callconfig.switchModel()directly and emit nothing; only the HTTP route broadcasts. Proposal: acurrent_model_updatesessionUpdate (mirrors the existingcurrent_mode_update), mapped by the bridge to the existingmodel_switchedevent, with HTTP + in-session converged on a single emitter to avoid double-broadcast.setModecallsconfig.setApprovalMode()without notifying. Proposal: emitcurrent_mode_updatefromsetMode; affirm and explicitly document the session-scoped (always) vs workspace-scoped (persist-only) broadcast split with ascopediscriminator.permission_resolvedoriginator/voter ambiguity. ItsoriginatorClientIdcarries the voter, whilepermission_request's carries the prompt originator. Proposal: add a canonicalvoterClientIdalias (non-breaking, same pattern as the acceptedlastReplayedEventIdrename); SDK prefers it.session_snapshotframe emitted before replay so a fresh attach renders correct state without extra round-trips.Why
These are the remaining "a state change on one path is invisible to other clients" gaps. Closing them gives the daemon a single coordination invariant: every model/approval-mode/permission transition broadcasts exactly once regardless of entry path, and a fresh attach can reconstruct side-channel state without out-of-band pulls.
Contents
Per-item problem (with code anchors), proposed design, alternatives, wire-compat, and risk; cross-cutting single-emitter and additive-alias patterns; proposed sequencing (A4 → A1+A2 → A5) and a test plan.
Requesting design review before implementation.