feat(cli,sdk): qwen serve daemon (Stage 1)#3889
Conversation
Adds a `serve` subcommand that boots an Express 5 listener with bearer auth, host allowlist, and CORS modeled on `vscode-ide-companion/src/ ide-server.ts`. Ships only `/health` and `/capabilities` to begin with; session/prompt/event routes will land in follow-up PRs once the per- session ACP child-process bridge in `httpAcpBridge.ts` is wired. Defaults to 127.0.0.1 with auth disabled so local development needs no configuration. Binding beyond loopback (e.g. `--hostname 0.0.0.0`) refuses to start without a token (`--token` or `QWEN_SERVER_TOKEN`). Capabilities envelope versioned at v=1 with a `features` array — clients should gate UI off `features`, never off `mode`, so subsequent PRs can add capability tags without breaking older clients. Per design issue's Stage 1 scope (~700-1000 LOC). Adds ~430 LOC of implementation + tests in this scaffold; the remaining budget belongs to the route wiring + bridge implementation in follow-ups.
Stage 1 follow-up to the scaffold. Implements the bridge between the
HTTP daemon and the existing ACP child agent, plus the first session
endpoint.
`HttpAcpBridge.spawnOrAttach`:
- Spawns `node $cliEntry --acp` per workspace via an injectable
`ChannelFactory` (default uses `process.argv[1]`; tests use an
in-memory `TransformStream` pair so they don't fork real processes).
- Drives the ACP `initialize` + `newSession` handshake via the SDK's
`ClientSideConnection`, with a 10s timeout that kills the channel.
- Under `sessionScope: 'single'` (default), reuses the live session
when the same canonical workspace cwd is requested again — backs
the `attached: true` flag.
- The `Client` impl on the bridge side proxies file reads/writes to
local fs (daemon and agent share the host) and buffers
`sessionUpdate` notifications for the SSE wiring in the next PR.
`requestPermission` returns `cancelled` until the
`/permission/:requestId` route lands.
`POST /session`:
- 400 on missing or relative `cwd`.
- 200 with `{sessionId, workspaceCwd, attached}` on success.
- 500 on bridge failure (the failing channel is killed, not leaked).
`runQwenServe` constructs the bridge and ties `bridge.shutdown()` into
the listener-close path so SIGINT/SIGTERM drain children before the
socket closes.
Tests (14 new, 0 regressions in the 4967-test baseline):
- 9 bridge cases over an in-memory channel — fresh spawn, single-scope
reuse, cross-workspace isolation, thread-scope independence, path
canonicalization, relative-path rejection, init failure cleanup,
init timeout, multi-channel shutdown.
- 4 route cases for /session (missing/relative/200/500).
- 1 lifecycle case asserting `runQwenServe.close()` calls
`bridge.shutdown()` before closing the listener.
Verified end-to-end: `qwen serve` boots, `POST /session` spawns a real
`qwen --acp` child and returns the SDK-assigned `sessionId`, repeat
calls under the same cwd return `attached: true`, `SIGTERM` reaps the
child along with the listener.
…3803) Stage 1 follow-up after the bridge scaffold. Adds the two routes a client needs to actually run a turn against the daemon. Bridge: - `sendPrompt(sessionId, req)` looks up the session, FIFO-queues the call against the per-session prompt queue, and forwards through the SDK `ClientSideConnection.prompt`. Concurrent calls observe ACP's "one active prompt per session" invariant — second waits for first. - A failed prompt does NOT poison the queue; the tail catches and keeps draining so the next caller still runs (the original caller still sees its own rejection). - `cancelSession(sessionId, req?)` bypasses the queue and forwards the ACP notification immediately. ACP semantics: the agent winds down the *currently active* prompt; queued work is unaffected. - Both methods throw `SessionNotFoundError` (a typed Error subclass) when the id is unknown so route handlers can map cleanly to 404 without brittle message matching. - Both methods overwrite the `sessionId` field in the request body with the routing id — a stale or spoofed body would otherwise be dispatched to the wrong agent process. Routes: - `POST /session/:id/prompt` → 200 with PromptResponse, 400 on missing/non-array prompt, 404 on unknown session, 500 on agent error. - `POST /session/:id/cancel` → 204 always (cancel is a notification), 404 on unknown session. Tests (14 new — 7 bridge + 7 route, 0 regressions in the 4981 baseline): - sendPrompt: success forwards & returns response · routing-id overrides body sessionId · concurrent prompts FIFO-serialize (verified via per-prompt start/end ordering with a release latch) · failed prompt doesn't block subsequent prompts · 404 for unknown id. - cancelSession: forwards with routing id · 404 for unknown id. - Routes: 200/400/404/500 paths for prompt; 204 with body or empty + 404 for cancel. Verified end-to-end against a real `qwen --acp` child: - POST /session/:id/prompt with `[{type:'text',text:'hi'}]` → 200 `{"stopReason":"end_turn"}` in ~3.4s. - POST /session/:id/cancel → 204. - POST /session/does-not-exist/prompt → 404 with the unknown id surfaced in the body.
Stage 1 follow-up that turns prompt into a real streaming experience.
Replaces the in-memory `notifications: SessionNotification[]` buffer
on each session with a per-session EventBus and exposes it through
`GET /session/:id/events` as an `text/event-stream` SSE feed.
EventBus (`packages/cli/src/serve/eventBus.ts`):
- Monotonic per-session ids (`v: 1` schema). Each `publish` chains an
id, returning the materialized BridgeEvent.
- Bounded ring (default 1000) backs `Last-Event-ID` reconnect — a
consumer that drops can resume from `lastEventId` and replay any
still-buffered events before live events flow.
- Per-subscriber bounded queue (default 256). When a slow consumer
overruns its queue, the bus appends a synthetic `client_evicted`
terminal frame and closes that subscription so it can't hold the
daemon hostage. Other subscribers are unaffected.
- `subscribe()` returns an AsyncIterable — registration is synchronous
so events `publish`ed immediately after the subscribe land in the
queue (a generator-style implementation deferred registration to
first `next()` and raced with publishes).
- AbortSignal-aware: aborting the signal closes the iterator promptly.
Bridge (`httpAcpBridge.ts`):
- `BridgeClient.sessionUpdate` now publishes onto the session's
EventBus instead of pushing to a plain array — every ACP
notification the agent emits becomes a stream event automatically.
- New `subscribeEvents(sessionId, opts?)` returns the bus's
AsyncIterable; throws `SessionNotFoundError` for unknown ids.
- Shutdown closes every live event bus before killing channels so
pending consumers unwind cleanly.
Route (`server.ts`):
- `GET /session/:id/events` sets the SSE content type, advertises a
3s reconnect hint, and writes a 15s heartbeat comment frame to
keep proxy/NAT connections alive.
- Forwards the `Last-Event-ID` header to the bus.
- `req.on('close')` triggers an AbortController that propagates into
the bridge subscription so disconnects don't leak subscribers.
- 404 when the bridge can't find the session.
Capabilities envelope: `STAGE1_FEATURES` now advertises
`session_create`, `session_prompt`, `session_cancel`, `session_events`
in addition to `health`/`capabilities` so clients can light up UI for
the routes that have actually shipped.
Tests (16 new, 0 regressions in the 4995 baseline):
- 9 EventBus unit cases — id sequencing, live delivery, replay,
replay+live splice, fan-out to N subscribers, eviction on
overflow, abort-signal unsubscribe, bus.close() drains
subscribers, ring-size eviction.
- 4 bridge subscribe cases — 404, sessionUpdate→event publishing
via real ACP fake-agent, shutdown closes live subscriptions.
- 4 SSE route cases against a live HTTP listener — frame format,
Last-Event-ID forwarding, 404, abort propagation on disconnect.
Verified end-to-end against a real `qwen --acp` child:
- Subscribed to `/session/$SID/events`, fired `POST /session/$SID/prompt`
with text content. Captured 13 distinct `event: session_update`
SSE frames in real time during the model's response — `available_
commands_update` metadata, 9 `agent_thought_chunk` frames carrying
the model's chain-of-thought, 3 `agent_message_chunk` frames with
the actual reply, and a final usage frame with token totals.
- Frames carry monotonic ids 1..13, the daemon-side counter, and
are valid SSE per the EventSource spec.
Stage 1 follow-up that turns `BridgeClient.requestPermission` from a
hardcoded `cancelled` placeholder into a real first-responder vote
loop, and ships the HTTP route any attached client uses to cast the
deciding vote.
Bridge:
- `requestPermission` generates a UUID requestId, registers a
pending entry on a daemon-wide map (and the owning session's
`pendingPermissionIds` set), publishes a `permission_request`
event onto the session's EventBus (so SSE subscribers see it),
and awaits the resolution.
- New `respondToPermission(requestId, response)` resolves the
pending promise with the supplied outcome. First call wins —
subsequent calls return false. On success the bridge publishes a
`permission_resolved` event so other attached clients can update
their UI when the race is decided.
- `cancelSession` and `shutdown` both resolve every still-pending
permission for the affected session(s) as
`{ outcome: { outcome: 'cancelled' } }` per the ACP spec
requirement that a cancelled prompt MUST resolve outstanding
requestPermission calls with cancelled.
- New `pendingPermissionCount` getter exposes inflight count for
inspection / tests.
Route (`server.ts`):
- `POST /permission/:requestId` validates the body's `outcome` is
either `{ outcome: 'cancelled' }` or `{ outcome: 'selected',
optionId: string }`, then forwards to `bridge.respondToPermission`.
- 200 on accepted vote, 404 when the requestId is unknown or
already resolved (Stage 1 doesn't differentiate), 400 on a
malformed outcome.
Capabilities envelope: STAGE1_FEATURES gains `permission_vote`.
Tests (14 new — 9 bridge + 5 route, 0 regressions in the 5011 baseline):
- Bridge: publishes permission_request with a generated requestId
and waits; respondToPermission first-responder wins; publishes
permission_resolved on vote; respondToPermission false for
unknown requestId; cancelSession resolves outstanding as
cancelled; shutdown resolves outstanding as cancelled.
- Route: 200 on selected outcome; 200 on cancelled outcome; 404 on
unknown requestId; 400 on malformed outcome; 400 on missing
outcome.
Verified end-to-end against a real `qwen --acp` child:
- Subscribed to /session/$SID/events, sent a prompt asking the
agent to write a file at /tmp/qwen-serve-permission-e2e-test.txt.
- The agent triggered a permission_request via the bus, surfacing
the three options Qwen Code presents (Allow Always / Allow /
Reject) with their option ids.
- POSTed `{outcome:{outcome:"selected",optionId:"proceed_once"}}`
to /permission/$requestId — got HTTP 200.
- Bus published the matching permission_resolved event.
- Agent proceeded with the writeTextFile tool call; file was
actually created on disk with the expected content.
Stage 1 follow-up that proves the cross-mode protocol-isomorphism design
assumption: an SDK client can drive the daemon's HTTP routes end-to-end
without going through ProcessTransport's stdio + stream-json path.
DaemonClient is a sibling of ProcessTransport, not a replacement. The two
speak different protocols (ACP NDJSON over HTTP vs stream-json over
stdio). Existing `query()` users keep getting subprocess-mode unchanged;
applications that want daemon-mode (cross-client attach, shared MCP
pool, network reachability, first-responder permissions) opt in by
constructing a DaemonClient against a running `qwen serve`.
API surface (`packages/sdk-typescript/src/daemon/`):
- `new DaemonClient({ baseUrl, token?, fetch? })`. The `fetch` override
is for tests; defaults to `globalThis.fetch`. Trailing slashes on
`baseUrl` are stripped.
- `health()`, `capabilities()` — discovery.
- `createOrAttachSession({ workspaceCwd, modelServiceId? })` — `attached:
true` on the response indicates a session was reused under
sessionScope:single.
- `prompt(sessionId, { prompt: ContentBlock[] })` — returns
PromptResult with stopReason.
- `cancel(sessionId)` — tolerates 204; throws on 404.
- `subscribeEvents(sessionId, { lastEventId?, signal? })` — async
iterator over parsed SSE frames; AbortSignal-aware. Native Node
AbortController only — jsdom polyfills are incompatible with undici.
- `respondToPermission(requestId, response)` — first-responder vote;
returns true on 200, false on 404 (lost the race or unknown id),
throws on 400/500.
`DaemonHttpError` is thrown for any non-2xx (besides the 404
"already-resolved" case on permission votes); carries `status` and
`body` so callers can branch on standard daemon HTTP semantics.
`parseSseStream(body)` is the underlying SSE parser; exported separately
so applications can consume daemon SSE outside the DaemonClient surface.
Handles split-chunk frames, comment/retry directives, malformed JSON
(skip), trailing frame without final newline.
Wire types live SDK-side (no SDK→CLI dep); the capabilities envelope's
`v` field signals breaking changes.
Tests (26 new, 0 regressions in the 201 baseline):
- 7 SSE parser cases — single frame, multiple frames, comments,
chunked-split frame, malformed JSON skip, trailing frame on close,
empty stream.
- 19 DaemonClient cases — health success/error, capabilities, bearer
auth presence/absence, createOrAttachSession success/400, prompt
body shape + sessionId url-encoding, cancel 204/404, permission
200/400/404, subscribeEvents header forwarding + 404, baseUrl
normalization.
Verified end-to-end against a real `qwen serve` daemon driving a real
`qwen --acp` child:
- `client.capabilities()` returned `{v:1, mode:"http-bridge", features:
[...7 tags]}`.
- First `createOrAttachSession` returned `attached:false`; second
returned `attached:true` with the same sessionId.
- `client.prompt(...)` with text content yielded `{stopReason:
"end_turn"}` while the parallel `subscribeEvents` iterator streamed
10 distinct frames during the same turn.
- AbortController on the events iterator cleanly severed the SSE
connection.
Closes the §04 Stage-1 routes table for `qwen serve` with the two
remaining endpoints, plus matching SDK methods.
`GET /workspace/:id/sessions`
- `:id` is the URL-encoded canonical absolute workspace path
(Express decodes path params automatically; clients pass
`encodeURIComponent(cwd)`).
- Returns `{ sessions: [{ sessionId, workspaceCwd }, ...] }` for live
sessions whose canonical workspace matches.
- Empty array (not 404) when the workspace is idle so picker UIs
don't have to special-case "no sessions yet".
- 400 when the decoded path isn't absolute.
`POST /session/:id/model`
- Body: `{ modelId: string, ... }`. The route's `:id` overrides any
spoofed sessionId in the body.
- Forwards to ACP's `unstable_setSessionModel` and publishes a
`model_switched` event onto the session bus so cross-client UIs
update.
- 200 with the agent's response on success, 400 on missing/empty
modelId, 404 on unknown session.
- The SDK method is currently unstable; documented in the bridge
comment in case the spec renames the method when it stabilizes.
Bridge:
- New `listWorkspaceSessions(workspaceCwd)` iterates `byId.values()`
and filters by canonical workspace path; works for both `single`
and `thread` session scopes.
- New `setSessionModel(sessionId, req)` forwards through
`connection.unstable_setSessionModel`, normalizes sessionId,
publishes `model_switched`, throws SessionNotFoundError on
unknown ids.
`STAGE1_FEATURES` capabilities envelope grows to 9 tags, adding
`session_list` and `session_set_model`.
SDK (`DaemonClient`):
- `listWorkspaceSessions(workspaceCwd)` URL-encodes the cwd and
returns the parsed `sessions` array directly.
- `setSessionModel(sessionId, modelId)` POSTs the body and returns
the agent response (currently opaque per ACP unstable spec).
- Wire types `DaemonSessionSummary` and `SetModelResult` exported
from the SDK barrel.
Tangential cleanup: `sendBridgeError` now extracts a useful message
from non-Error values via a small `errorMessage` helper. JSON-RPC
errors from the agent (`{code, message, data}`) used to surface as
`"[object Object]"` in the 500 response body; they now show the
inner `message` field. Caught while running the model-set e2e.
Tests (17 new — 9 bridge + 7 route + 4 SDK, 0 regressions in the
5022 + 227 baselines):
- Bridge listWorkspaceSessions: matching cwd returns the live
sessions; canonicalizes the lookup; empty for relative paths.
- Bridge setSessionModel: forwards modelId + overrides body
sessionId; publishes model_switched event; 404 unknown session.
- Route /workspace/:id/sessions: returns the bridge list; empty for
idle workspace; 400 for relative path.
- Route /session/:id/model: 200 success; 400 missing modelId; 400
empty modelId; 404 unknown session.
- SDK listWorkspaceSessions: URL-encodes the cwd; throws on 400.
- SDK setSessionModel: posts body; throws on 404.
Verified end-to-end against a real `qwen serve`:
- SDK reports 9 capability features, list returns the existing
session, attached:true on repeat create, and `setSessionModel`
rejects with HTTP 500 when the modelId isn't registered (with the
daemon now surfacing "Internal error" instead of "[object Object]").
- 404 path through SDK on unknown sessionId works.
Self-review pass on PR #3889. Two real correctness bugs and an ergonomics gap, plus the test-coverage holes the audit surfaced. The loudest finding ("host allowlist no-op when bind=localhost") was a false positive — the conditional was misread; existing tests already prove the validator is active on `localhost` binds. Real fixes: - Bearer-auth timing-attack: `parts[1] !== token` short-circuits per byte, leaking which prefix is correct via response latency. Replace with SHA-256 of both sides + `crypto.timingSafeEqual` so comparison is constant-time regardless of token length. - Concurrent `spawnOrAttach` race in single-scope: two parallel callers for the same workspace both passed the `byWorkspace.get` check, both spawned, and one entry ended up orphaned in `byId` while the other won `byWorkspace`. Violates the "at most one session per workspace" invariant. Coalesce via an `inFlightSpawns` map: parallel callers attach to the in-flight promise and report `attached: true`. The slot is cleared on both success and rejection so a failed spawn doesn't poison the workspace forever. New test asserts ONE channel spawns under parallel calls and that retry works after rejection. - `Number.parseInt('1.5e10z', 10)` returns 1, so a malformed `Last-Event-ID` header silently passes through. Tighten `parseLastEventId` to `^\d+$` so anything not a pure decimal integer is dropped. New test exercises 'abc', '-1', '1.5e10z'. Ergonomics: - `LOOPBACK_BINDS` and `LOOPBACK_HOST_BINDS` now include `::1` and `[::1]`. IPv6 loopback users no longer have to set a token. Host-allowlist allows `[::1]:port` Host headers. Documentation: - `BridgeClient` doc-comment now states the Stage 1 trust model explicitly: agent runs as the same UID, the file-proxy methods are NOT a workspace-cwd sandbox, restricting them would be theatre. The audit flagged this as a "design gap" but the daemon-and-agent-on-same-host posture makes a sandbox here redundant — Stage 4+ remote-sandbox swaps the Client for a sandbox-aware variant. SDK fix: - `DaemonClient.failOnError` previously called `res.json()`, which consumes the body even on parse-failure; the subsequent `res.text()` returned empty. New impl reads once as text and attempts JSON-parse; raw text is the fallback. New test asserts a `text/plain` 502 surfaces the body verbatim. Test gap fills (audit-flagged): - Bridge: in-memory file-proxy tests for `BridgeClient.{read,write} TextFile` including line/limit slicing. - SSE route: `stream_error` synthetic frame on iterator throw mid-stream; numeric Last-Event-ID forwarded; malformed Last-Event-ID dropped. - DaemonClient: text/plain error body coerced to `body` field; `respondToPermission` 5xx throws; `subscribeEvents` null-body throws; `cancel`/`respondToPermission` URL-encode session/request ids that contain slashes. Verified end-to-end with a token-required daemon: right token → 200, wrong/missing/malformed → 401. All paths return uniform 401 messages so a side-channel can't distinguish between "no header", "bad scheme", and "wrong token". Test counts: cli serve **89** (was 81, +8), sdk daemon **35** (was 30, +5). Full suites still green.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Implements Stage 1 of qwen serve: an HTTP daemon that bridges ACP NDJSON over HTTP + SSE, plus an SDK-side DaemonClient to drive Stage 1 routes for health/capabilities, session management, prompt/cancel, SSE events, set-model, and permission voting.
Changes:
- Added CLI daemon server (
qwen serve) with auth controls (bearer + Host allowlist + Origin denial) and Stage 1 HTTP/SSE routes. - Implemented
HttpAcpBridge+ per-sessionEventBusfor subprocess lifecycle, event fan-out/replay, prompt FIFO, and permission vote resolution. - Added TypeScript SDK
DaemonClient+ SSE parser with unit test coverage for daemon interactions.
Reviewed changes
Copilot reviewed 20 out of 21 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/sdk-typescript/test/unit/daemon-sse.test.ts | Adds unit tests for SSE parsing behavior (chunking, invalid JSON, trailing frames). |
| packages/sdk-typescript/test/unit/DaemonClient.test.ts | Adds unit tests validating HTTP semantics, auth headers, SSE subscription, and error coercion. |
| packages/sdk-typescript/src/index.ts | Exposes daemon client + types from the SDK public entrypoint. |
| packages/sdk-typescript/src/daemon/types.ts | Defines SDK-side wire types for the daemon HTTP API. |
| packages/sdk-typescript/src/daemon/sse.ts | Implements SSE-to-DaemonEvent async iterator parser. |
| packages/sdk-typescript/src/daemon/index.ts | Adds a barrel export for daemon client, SSE parser, and wire types. |
| packages/sdk-typescript/src/daemon/DaemonClient.ts | Implements the SDK HTTP client for the Stage 1 daemon routes. |
| packages/cli/src/serve/types.ts | Defines daemon mode/options + capabilities envelope and Stage 1 feature tags. |
| packages/cli/src/serve/server.ts | Implements Express app with routes for health/capabilities/sessions/prompt/cancel/events/model/permission. |
| packages/cli/src/serve/server.test.ts | Adds extensive route-level tests for auth, host allowlist, sessions, SSE behavior, and shutdown. |
| packages/cli/src/serve/runQwenServe.ts | Adds listener startup + token enforcement + shutdown draining + signal handling. |
| packages/cli/src/serve/index.ts | Exports serve-related APIs (app builder, runner, bridge, event bus). |
| packages/cli/src/serve/httpAcpBridge.ts | Implements per-session subprocess spawning, ACP handshake, FIFO prompt queue, permission voting, and event publishing. |
| packages/cli/src/serve/httpAcpBridge.test.ts | Adds unit tests for spawn coalescing, prompt FIFO, permissions, events, fs proxying, and shutdown. |
| packages/cli/src/serve/eventBus.ts | Implements bounded subscriber queues, replay ring buffer, and eviction semantics. |
| packages/cli/src/serve/eventBus.test.ts | Adds unit tests for replay, fan-out, eviction, abort behavior, and ring-size behavior. |
| packages/cli/src/serve/auth.ts | Adds bearer auth, Origin denial via CORS, and Host allowlisting for loopback binds. |
| packages/cli/src/config/config.ts | Registers the new serve command and adjusts command exit behavior to allow daemon blocking. |
| packages/cli/src/commands/serve.ts | Adds qwen serve command surface and --http-bridge flag gating Stage 1. |
| packages/cli/package.json | Adds Express/CORS/Supertest and related typings. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Second self-review pass on PR #3889. Three real bugs (one correctness, one resource-cleanup, one cosmetic) plus consolidation of the loopback bindings into a single source of truth. Real fixes: - Shutdown could hang forever on a long-lived SSE consumer: `server.close` waits for every in-flight connection to drain, and a paused EventSource client never disconnects. Added a `SHUTDOWN_FORCE_CLOSE_MS` (5s) timer that calls `server.closeAllConnections()` to force-destroy stuck sockets, then resolves so `process.exit(0)` can run. New test asserts close completes well under 5.5s even when an SSE GET is in flight. - Signal-handler race during shutdown: round 1 detached the SIGINT/SIGTERM listeners *up front* in `handle.close()`. If a second SIGTERM arrived during the drain, no handler existed and Node's default termination ran, orphaning agent children. Switch to detaching at the *end* of the close path (in `finish()`): during the drain window the handler is still attached and the `if (shuttingDown) return` guard makes a second signal a no-op; after drain completes we can safely remove the listeners (this also fixes a test-suite MaxListenersExceededWarning that fired once we ran the runQwenServe tests >10 times in a single process). - SSE response had no `error` listener. When the underlying TCP socket died (RST, kill -9 on the client), the next `res.write` threw EPIPE and Express forwarded it to the default error handler, logging noisily. Added `res.on('error', cleanup)` so the failure is absorbed and triggers the same teardown path the `req.on('close')` handler uses. Validation: - `createHttpAcpBridge` now throws on invalid `sessionScope` (anything other than `'single'` or `'thread'`) and on `initializeTimeoutMs <= 0`. Misconfigured callers used to silently degrade to thread behavior; now they fail loudly. Cleanup: - The `LOOPBACK_BINDS` set was duplicated between `auth.ts` and `runQwenServe.ts` (round 1 missed this). Extracted into `packages/cli/src/serve/loopbackBinds.ts` with a single `isLoopbackBind(hostname)` helper. Both files now import; drift is impossible. - `res.flushHeaders?.()` lost the optional chaining. The method is on `http.ServerResponse` since Node 1.6; our `engines` floor is 20. Tests added: - bridge: `sessionScope` validation, `initializeTimeoutMs` validation. - server: shutdown force-close timeout, SIGINT/SIGTERM listener detach-after-drain. False positives from the round 2 audit (verified and dismissed): - "EventBus nextId overflow at 2^53" — theoretical only (would require ~9 quadrillion publishes per session). No code change. - "Subscribe-during-close race" — JS is single-threaded; the close() flag is set synchronously before the loop touches state. - "Queued prompts on shutdown" — by design; documented via the promptQueue tail comment. - "10MB body parser limit" — design choice for Stage 1's in-memory buffering model; revisit if ACP streaming lands in Stage 2. - "Unbounded body read in DaemonClient.failOnError" — daemon is local in Stage 1; the threat surface for adversarial-large error bodies is the same as the daemon's other unbounded buffers. Test counts: cli serve **93** (was 89, +4), full cli **5047** (no regressions), sdk **236** (no regressions).
Two more self-review passes on PR #3889. No correctness bugs surfaced this time — round 3 found a HIGH-severity Windows-path claim that turned out to be a false positive (`path.win32.isAbsolute('/foo/bar')` returns true; verified against Node 20). Round 4 confirmed every prior decision and surfaced one latent-but-not-currently-triggered concurrency note. Changes are pure documentation + a tiny optional-chain cleanup: - Drop `?.` on `server.closeAllConnections()` in runQwenServe.ts — the method exists since Node 18.2 and our `engines` floor is 20. The optional chain dated from before round 2's force-close timer landed; clean it up. - Help text for `qwen serve --port` now documents that port 0 means "OS-assigned ephemeral port" (which the implementation has always supported but never advertised). - `defaultSpawnChannelFactory` gains a comment near the spawn site documenting the FD-budget implication (~3 FDs per session, bump `ulimit -n` for many concurrent sessions) and the `stdio: ['pipe', 'pipe', 'inherit']` choice (child stderr lands in the daemon's stderr, interleaved across sessions). Both are Stage-1-accepted; Stage 2/4+ revisit each. - Comment on the bridge's `byWorkspace`/`byId` Maps documenting the known gap that a child crashing between requests leaves a garbage SessionEntry until daemon shutdown — surfaced as a per-prompt failure when the dead session is touched, not a hang. Stage 2's in-process bridge eliminates the spawned-child failure mode entirely so this gap goes away naturally. - `EventBus.subscribe` doc-comment now states explicitly that the returned iterator is NOT safe to drive from concurrent `.next()` callers — the underlying queue isn't atomic. Daemon usage is the sequential `for await ... of` inside the SSE route, so this is safe in production. Documented so a future fan-out consumer doesn't accidentally rely on undefined behavior. False positives verified and dismissed (round 3 + 4 combined): - `path.isAbsolute('/foo/bar')` Windows breakage — `path.win32. isAbsolute('/foo/bar')` is true; verified empirically. - "Windows drive divergence" causing duplicate sessions — different drives are different on-disk paths; sessions intentionally differ. - "parseSseStream early-break leaks reader" — `for await ... break` triggers `iterator.return()` which runs the generator's `finally` that calls `releaseLock`. Standard JS semantics. - "Promise executor sync-throw fragility in requestPermission" — sync throws inside `new Promise(executor)` reject the outer promise; functionally correct, just stylistic. - "Force-close timeout test elapsed assertion flakiness" — assertion is `< 5500ms` but the natural happy-path is sub-100ms. Generous headroom; not flake-prone in practice. - "fetch reference stale after polyfill" — `globalThis.fetch.bind` captures at construction; tests inject `opts.fetch` instead of polyfilling, which is the correct pattern. Test counts unchanged (cli serve **93**, sdk **236**); typecheck + lint clean. STAGE1_FEATURES still matches every implemented route 1:1, fakeBridge in tests implements every HttpAcpBridge method.
wenshao
left a comment
There was a problem hiding this comment.
Addresses the four critical findings from the PR #3889 reviewer pass: 1. ACP `ReadTextFileRequest.line` is 1-based per spec, but the bridge's `BridgeClient.readTextFile` was treating it as a 0-based slice index. A client asking for `{line:1, limit:2}` ("first two lines") was getting lines 2-3 — a sign-off-by-one bug that breaks every editor / SDK client following the ACP schema. Convert to 0-based via `Math.max(0, line - 1)`. The existing slice test was asserting the wrong behavior; updated to expect the spec-correct result and added a second `line:3, limit:2` case to lock in the offset. 2. `modelServiceId` was accepted by the SDK + server `POST /session` path, forwarded into `bridge.spawnOrAttach`, and then silently dropped: `doSpawn` never wired it into the agent. Callers requesting a specific model got the agent's default and no indication anything was wrong. Now `doSpawn` issues `unstable_setSessionModel` immediately after `newSession`. If the agent rejects the model id, the half-initialized session is torn down and the spawn rejects so the caller can retry cleanly instead of inheriting silent drift. Three new bridge tests: happy path, omit-when-undefined, agent-rejection cleanup. 3. The CORS middleware used `cors({ origin: (o, cb) => cb(new CORSError(...), false) })` for browser-Origin requests. `cors` flows the Error into Express's error chain; without an explicit error handler that produces a 500 + HTML body, which is misleading for what is really a deterministic 403 denial. Replace with a tiny `RequestHandler` that checks `req.headers.origin` directly and returns `403 { error: 'Request denied by CORS policy' }` JSON. Drops the `cors` and `@types/cors` dependencies — there's no other consumer in the cli package. 4. The SSE `stream_error` synthetic frame hard-coded `id: 0`, which would regress the client's `Last-Event-ID` tracker and trigger duplicate replays on reconnect. The frame is terminal and daemon-emitted — it has no place in the per-session monotonic sequence. Refactor `formatSseFrame` to omit the `id:` line when the input event has no id field, and emit `stream_error` without one. Test updated to assert `frames[1].id === undefined` while the preceding `session_update` still carries its monotonic id. Tangential cleanup: `errorMessage` now formats the SSE error body (was `err.message` only — would have shown `[object Object]` for JSON-RPC errors mid-stream, mirroring the round-1 SDK fix). Test counts: cli serve **96** (was 93, +3 modelServiceId cases); existing readTextFile slice test rewritten in place. Full typecheck + lint + suite green.
…ish (#3803) Second batch of reviewer-flagged fixes for PR #3889. Addresses 7 robustness issues across the daemon's SSE pipeline + the bus + the SDK's stream parser. Daemon SSE (`server.ts`): - SSE writes now respect backpressure. `res.write` returns false when the kernel send buffer is full; the previous code ignored that and Node accumulated payloads in user-space memory unboundedly. A slow consumer on a chatty session could balloon daemon RSS. New `writeWithBackpressure` helper awaits `drain` (or `close`/`error`) before scheduling the next write — for both per-frame writes and heartbeats. - `parseLastEventId` rejects values > `Number.MAX_SAFE_INTEGER`. With the prior `^\d+$` regex a malicious 25-digit value would parse to a number that loses precision and confuses replay comparisons. EventBus (`eventBus.ts`): - `Last-Event-ID` replay events now `forcePush` past `maxQueued`. A client reconnecting with a 1000-event gap on a subscriber whose cap is 256 was silently losing entries 257-1000 — a sign-off-by- nothing breakage of the resume contract. Live publishes still go through the normal cap (slow live consumer must be evictable); historical replay is bypassed. - `onAbort` now disposes the subscription immediately instead of only closing the queue. An aborted-but-never-iterated subscriber used to linger in `bus.subs` until the consumer drove `next()` / `return()`. New tests cover both abort-after-subscribe and already-aborted-at-subscribe paths. - `BoundedAsyncQueue.next` now checks `buf.length > 0` before shifting instead of `buf.shift() !== undefined`. The bus never pushes `undefined` today but the queue is generic — the prior pattern would mis-handle a queue whose element type legitimately includes undefined. SDK SSE parser (`sse.ts`): - Now flushes the TextDecoder on stream close. Without the final `decoder.decode()`, an incomplete multi-byte UTF-8 sequence at the tail of the last chunk was silently dropped — corrupting any frame whose JSON ended mid-character. New test feeds a stream split mid-byte through "中" (3-byte UTF-8) and asserts the character round-trips. - Frame separators now accept both `\n\n` and `\r\n\r\n`. SSE spec allows CRLF, and intermediaries (corporate proxies, some Node http servers) sometimes normalize. Frame field splitter also accepts `\r?\n`. Two new tests cover pure CRLF + mixed-LF/CRLF. Test counts: cli serve **99** (was 96, +3 EventBus); sdk daemon-sse **10** (was 7, +3). Full typecheck + lint + suite green.
Last batch from the PR #3889 reviewer pass: mostly docs + a ReDoS-tooling-silencing rewrite + a yargs-key cleanup. - `commands/serve.ts` ServeArgs interface dropped the camelCase `httpBridge` mirror; the handler now reads `argv['http-bridge']` matching the declared option name. The dual surface relied on yargs's camelCase expansion behavior — fragile if yargs config ever changes. - `DaemonClient` constructor's `baseUrl.replace(/\/+$/, '')` (which is end-anchored and linear, but CodeQL's polynomial-regex detector flags any `\/+$` pattern on attacker-controlled input) swapped for a hand-rolled `stripTrailingSlashes` loop. Same behavior, no rule trigger. - `defaultSpawnChannelFactory`'s `cwd: workspaceCwd` flow into `spawn` is the second CodeQL finding ("uncontrolled data used in path expression"). It IS user-controlled, by design — that's the Stage 1 trust model. Added a `// lgtm[js/shell-command- constructed-from-input]` suppression with a comment explaining the model and pointing at issue #3803 §11 for the Stage 4+ remote- sandbox replacement. - Stale doc comment on `createServeApp` that still listed only `/health`, `/capabilities`, `POST /session` as shipped — now enumerates all 9 routes that match §04 of the design. - Stale doc comment on `HttpAcpBridge` saying "Stage 1 buffers them in-memory; SSE wiring lands in the next PR" — SSE wiring landed in commit 41aa950. Replaced with a description of the actual flow through EventBus + SSE. No behavior change; tests + lint + typecheck still green. cli serve still **99**, sdk **38** (was 30 before this batch — daemon-sse +3, DaemonClient +5 from rounds 1+2). Full e2e against built daemon re-verified: CORS denial returns 403 JSON (was 500 HTML), bad `modelServiceId` now causes spawn to fail with HTTP 500 (was: silent default-model substitution), `POST /session` without modelServiceId unaffected.
…vent.id optional (#3803) Two more fixes from a final post-review-comment audit pass on PR #3889. Both are subtle correctness gaps that fell out of the round-1 critical fixes (modelServiceId apply + SSE id-less stream_error). - In `httpAcpBridge.ts:doSpawn`, when `unstable_setSessionModel` rejects after `newSession` succeeded, we tear down the entry from `byWorkspace` + `byId` (round 1 fix) but did NOT close the EventBus we'd just constructed for that entry. The agent could have published a session_update notification during init that queued in the (now unreachable) bus's ring buffer; without an explicit close the bus + buffer linger until the next GC cycle. Bounded leak (1 bus per failed spawn × 1000-event ring) but cleaner to close it. New regression test exercises the retry path after a model-rejection failure to lock in that we don't reuse the orphan and that subscribers on the fresh session see an empty iterator on immediate abort. - SDK `DaemonEvent.id` is now `id?: number` instead of `id: number`. The round-1 SSE fix made the daemon emit `stream_error` frames *without* an `id:` line so they don't pollute the per-session monotonic sequence. The SDK parser correctly returns `undefined` for the missing field, but the type still advertised `id: number` — TypeScript consumers persisting `lastSeenId = event.id` would accidentally store `undefined`. Made the field optional and added a doc comment instructing consumers to skip frames without an id. Plus one more false-positive verified and dismissed: - "writeWithBackpressure Promise double-settle race": the auditor flagged that `res.write(chunk, callback)` could fire its callback after the synchronous `ok=true` resolve. Verified harmless — Promise double-settle is a no-op, the callback only rejects on error (caught separately by `res.on('error', cleanup)`), and multiple parallel writes register independent listener sets that each remove their own pair after firing. Test counts: cli serve **100** (was 99, +1 retry-after-model-rejection regression). SDK unchanged at 239. Full typecheck + lint + suites green; flow re-verified end-to-end.
Critical fixes: - server.ts safeBody() helper (BZ9uv/va/vs/wD + Bd10m + Bd1zz): prototype-pollution sanitization at the body-spread boundary. `__proto__` / `constructor` / `prototype` keys are stripped and the result is an Object.create(null) target. Replaces 5 sites of copy-pasted `typeof req.body === 'object'...` preamble + makes the `...(body as object)` spread sites safe. - httpAcpBridge requestPermission (Bd1yh): per-request wall-clock deadline (default 5 min, configurable via `BridgeOptions.permissionResponseTimeoutMs`). Without this, an agent calling requestPermission with no SSE subscriber connected would hang the per-session FIFO forever. After deadline, resolve as cancelled + log stderr warning. - httpAcpBridge requestPermission (Bd1z5): per-session pending permissions cap (default 64, configurable via `BridgeOptions.maxPendingPermissionsPerSession`). New requests past the cap resolve as cancelled with stderr warning. Prevents a chatty agent from growing pendingPermissions unboundedly. - runQwenServe onSignal double-signal force-exit (Bd1y6): new `bridge.killAllSync()` + `AcpChannel.killSync()` method synchronously SIGKILLs every live qwen --acp child BEFORE `process.exit(1)`. Previously double-Ctrl+C bypassed the async bridge.shutdown() and left children running as orphans. - server.ts SSE subscriber-limit response (Bd1zJ): 429 + Retry-After instead of 200 + stream_error frame. EventSource treats 4xx as terminal (no auto-reconnect); the previous 200+close-stream triggered EventSource's reconnect loop, amplifying the load the limit existed to prevent. - doSpawn ghost sessionId guard (Bd1zc): re-check byId.has() after applyModelServiceId(). The model-switch yields and can race channel.exited; without this, caller got HTTP 200 with a sessionId that 404s on every subsequent request. Follow-ups in the same diff: - sse.ts consumeFrames CRLF scan comment (BcRh_): the comment claimed the CRLF scan was bounded to `[cursor, lf)`, but Node's `indexOf` has no upper bound. Rewrote to describe what the code actually does (scan full remainder; only USE the result if it falls before `lf`). - sse.ts SseFramingError export (Bd10T): typed error class for framing-level failures so SDK consumers can distinguish "upstream isn't SSE" from generic network errors via instanceof check. Re-exported from @qwen-code/sdk. - protocol doc /health auth (Bctum): document the loopback exemption — `/health` doesn't require Authorization on loopback binds even when a token is configured. Matches `createServeApp`'s registration order. Bd1xz (cross-session permission escalation) acknowledged as duplicate of BUy4H — already documented as a known Stage 1 gap under the single-user / small-team trust model; fix is Stage 1.5 must-have #3 (per-client identity + per-session permission scope). Tests: - New: prototype-pollution test verifies `__proto__` spread doesn't pollute `Object.prototype`. - All 70 server + 55 bridge + 16 daemon-sse + 60 DaemonClient tests pass (203 total). `killSync()` stubbed on every inline test channel fake; fake bridge has `killAllSync()`.
…ly bounded (BeFHR + BeFId) Previous attempt at the BX9_a perf optimization left the CRLF scan running over the full remainder of `buf` on every loop iteration where an LF separator existed — only the LF-not-found fallback path was actually bounded. Comments claimed the CRLF scan was restricted to `[cursor, lf)` or "only fires when needed", but Node's `String.indexOf` doesn't accept an end index. Bound the scan via a `buf.slice(cursor, lf)` window before `indexOf` so the assertion is now true: in the common LF-only case we pay one full scan (for LF) plus one bounded scan over the matched frame's bytes (small).
…link, no-sessionId throw - httpAcpBridge.writeTextFile BfFvO: dangling-symlink case. `fs.realpath` throws ENOENT for a symlink whose target doesn't exist, and the blanket catch silently fell back to writing through the symlink itself — `rename(tmp, params.path)` then replaced the symlink with a regular file, exactly the bug BX8Yw was supposed to fix. Use `fs.readlink` to disambiguate "truly non-existent" from "dangling symlink"; resolve the dangling target manually and write through to it so the symlink stays a symlink. Regression test added. - httpAcpBridge BridgeClient resolveEntry BfFut: defensive throw on no-sessionId ACP call against a multi-session channel. ACP today carries sessionId on every per-session call, but if a future no-sessionId call lands, silently dropping it on a multi-session channel would be invisible. - httpAcpBridge.test.ts BX8YO Windows skip: hard-skip via `process.platform === 'win32'`. Git-Bash etc. ship a `mkfifo` binary that degenerates on Windows (creates a regular file or silently no-ops), making the assertion match the wrong error shape. Linux + macOS coverage is sufficient for a platform- agnostic `!stats.isFile()` check. BfFvW (CRLF scan comment) was already addressed in 0a4146a — the reviewer's diff was against the pre-fix version.
wenshao
left a comment
There was a problem hiding this comment.
Critical fixes:
- httpAcpBridge.doSpawn newSession-failure cleanup (BkwQA): if
`connection.newSession()` throws on a freshly-created channel
whose sessionIds set is empty, tear the channel down rather than
leaking the empty `qwen --acp` child in `byWorkspaceChannel`
(invisible to `sessionCount` / `maxSessions`). Channels with
other live sessions still survive — only the truly-empty case
reaps.
- httpAcpBridge.detachClient + killSession tombstone (BkwQP):
detachClient no longer reaps live sessions. Scenario: A spawns
(attached: false, hasn't opened SSE yet), B attaches
(attachCount: 1), B disconnects → previous code reaped A's
still-valid session. New behavior:
* killSession({ requireZeroAttaches: true }) sets
`entry.spawnOwnerWantedKill = true` when it bails on
attachCount > 0 (instead of just returning).
* detachClient ONLY decrements attachCount. It completes the
deferred reap only when (spawnOwnerWantedKill && attachCount
=== 0 && subscriberCount === 0).
* Both-disconnected case still works (reap completes via B's
detachClient seeing the tombstone). Spawn-owner-alive case
no longer reaps. Existing tanzhenxin-issue-2 test rewritten;
new test pins the spawn-owner-alive case.
- httpAcpBridge.writeTextFile mode preservation (BkwQW): stat the
target before writing; if it exists, chmod the tmp file to the
preserved mode (and chown owner/group — best-effort, EPERM
ignored for non-root). Previously a 0600 secret/config edit
would downgrade to umask-default 0644, exposing contents to
other local users.
- bridge.respondToPermission option-ID validation (BkwQI): new
`InvalidPermissionOptionError` thrown when the voter's `optionId`
isn't in the set of options the agent originally offered in the
`permission_request` event. PendingPermission now carries
`allowedOptionIds`. Server route catches the error → 400 (vs.
404 for unknown requestId). Prevents authenticated clients from
forging hidden outcomes like `ProceedAlways*` when the prompt's
`hideAlwaysAllow` policy intentionally suppressed them.
Doc fixes:
- httpAcpBridge top-of-file (BkdCg) + types.ts ServeMode (BkdC8):
rewrite the "each session spawns its own qwen --acp child"
framing to match the actual Stage 1.5 multi-session-per-channel
architecture (one child per workspace, sessions multiplex via
`connection.newSession()`).
wenshao
left a comment
There was a problem hiding this comment.
…g tests + 1 doc - writeTextFile mode-bits race (Blehd): the BkwQW fix preserved mode via `chmod` AFTER `fs.writeFile`, leaving a brief window where a `0600` secret-edit was readable at the directory's umask default (commonly `0644`). Now pass `mode` to writeFile directly so the file is CREATED with the preserved mode atomically via the `open(O_CREAT, mode)` syscall. The post-write `chmod` remains as belt-and-suspenders against a tight operator umask (POSIX `mode & ~umask` could drop bits we wanted preserved). - httpAcpBridge.test.ts: new bridge-level test for the BkwQI `InvalidPermissionOptionError` path (Blehk). Forge a vote with an `optionId` not in the agent-offered set; assert the throw AND that the pending permission survives so a valid vote can still resolve it. - server.test.ts: new route-level test for the BkwQI 400 mapping (Blehl). Fake bridge throws `InvalidPermissionOptionError`; assert response is 400 with `code: 'invalid_option_id'`, `requestId`, and `optionId` in the body. - commands/serve --http-bridge help text (Bk59I): updated to reflect Stage 1.5 multi-session — "one `qwen --acp` child per workspace, with multiple sessions multiplexed via the agent's native `newSession()`" (was: "per-session child").
…ody-read rejection (BlqF_) Some fetch impls (undici on abort) reject the in-flight `reader.read()` with an AbortError after `reader.cancel()` fires. Pre-fix that rejection bubbled to the consumer's `for await`, contradicting the "abort cancels cleanly" public contract — code that called `controller.abort()` to wind a subscription down saw an unexpected throw on the next iteration. Wrap `reader.read()` in try/catch: - if `signal?.aborted` is true → treat the rejection as clean completion (return from the generator) - otherwise re-throw, so real upstream failures (network drop, unexpected close, malformed body) still reach the consumer Two regression tests pin the guard's scope: signal-aborted mid-stream returns cleanly with the frames received so far; a non-abort `streamController.error(...)` still bubbles via `rejects.toThrow`.
…listener (BmJT1) Pre-fix: `publish()`'s eviction path deleted the sub from `this.subs` but never invoked `dispose()`, leaving the AbortSignal abort-listener registered in `subscribe()` attached. Because the consumer is by definition stalled (that's what caused the overflow), `next()` / `return()` never fire to detach the listener through the iterator path. Closures over the queue + sub stayed live until the AbortSignal itself went out of scope. Under attack (thousands of opened-then-stalled SSE clients), this amplified into significant heap retention. Fix: store `dispose` on `InternalSub` and invoke `sub.dispose()` from the eviction path. The same closure used by the abort listener / the iterator's `next()`/`return()` cleanup now runs through the eviction path too — idempotent through `disposed` so a post-eviction abort or iterator-return is still safe. Regression test pins the post-eviction abort + publish path producing zero side effects.
| | `--token <str>` | — | Bearer token. Falls back to `QWEN_SERVER_TOKEN` env var (with leading/trailing whitespace stripped — handy for `$(cat token.txt)`). | | ||
| | `--max-sessions <n>` | `20` | Cap on concurrent live sessions. New `POST /session` requests that would spawn a fresh child return `503` (with `Retry-After: 5`) when the cap is hit; attaches to existing sessions are NOT counted. Set to `0` to disable. Sized for single-user / small-team usage; raise it if your deployment has the RAM/FD headroom (~30–50 MB per session). | | ||
| | `--max-connections <n>` | `256` | Listener-level TCP connection cap (`server.maxConnections`). Bounds raw socket count irrespective of session count — slow / phantom SSE clients get rejected at accept time once full. Raise alongside `--max-sessions` if your deployment expects many SSE subscribers per session. | | ||
| | `--http-bridge` | `true` | Stage 1 mode: per-session `qwen --acp` child process. Stage 2 native in-process becomes available later. | |
| - **`--hostname 0.0.0.0` requires a token** — boot refuses without one. | ||
| - **`LOOPBACK_BINDS` includes IPv6** — `::1` and `[::1]` count as loopback for the no-token rule. | ||
| - **Host header allowlist** — on **loopback** binds the daemon checks `Host:` matches `localhost:port` / `127.0.0.1:port` / `[::1]:port` / `host.docker.internal:port` (case-insensitive per RFC 7230 §5.4) to defend against DNS rebinding. **Non-loopback binds (`--hostname 0.0.0.0`) intentionally bypass the Host allowlist** — the operator has chosen the surface area, so the bearer-token gate is the sole authentication layer; reverse proxies / SNI / client cert pinning are the operator's responsibility, not the daemon's. If you need Host-based isolation on a non-loopback bind, terminate TLS + check Host at a front proxy. | ||
| - **CORS denies any browser Origin** — returns `403` JSON. **Implication for browser-served webuis** (BUy4e): any `packages/webui`-style frontend that lives on a separate origin will get 403 at the wire. Stage 1 options for browser-style consumption: (a) package the webui as a native shell (Electron/Tauri) so no `Origin` header is sent, or (b) front the daemon with a same-origin reverse proxy that strips/rewrites `Origin` for a known frontend. Stage 1.5 will add `--allow-origin <pattern>` for opt-in named frontends. |
| - `{ "outcome": "selected", "optionId": "<one-of-the-options>" }` — accept / reject / proceed-once / etc, per the agent's offered choices | ||
| - `{ "outcome": "cancelled" }` — drop the request (matches what `cancelSession` / `shutdown` do internally) |
tanzhenxin
left a comment
There was a problem hiding this comment.
Review
The prior two critical findings (--max-connections 0 Node 22 break, orphan-on-coalesce-double-disconnect) are properly closed with regression tests, and the per-workspace multiplex architectural change is sound — careful session lifecycle, deduplicated channel shutdown, correctly-scoped permission/event routing per sessionId. The 16 follow-up commits address a broad sweep of code-review feedback with care. Two observations worth raising before ship: a scope/architecture note about the new bridge living alongside the existing one in the channels package, and one regression in the double-signal force-kill path.
1. Parallel ACP-bridge implementation duplicates the one in the channels package (severity: medium · confidence: very high)
The PR adds a 2400-line ACP bridge under serve/ but touches nothing in packages/channels/. The repo now hosts two independent ACP-bridge implementations that both spawn a qwen --acp child, run the ACP init handshake, multiplex multiple sessions onto one child, and dispatch session updates back to clients — with zero shared code. The new implementation is the stricter of the two: real init-handshake await instead of a 1s setTimeout, FIFO prompt queue per session, bounded per-subscriber event queues with replay, deadline-bounded permission requests with first-responder voting. The channels-base bridge has the corresponding gaps (timing-based init, no prompt serialization, EventEmitter fan-out, auto-approving permissions). Those gaps now have a correct reference implementation, but the channels adapters (dingtalk, weixin, telegram) continue to use the weaker bridge. As Stage 2's in-process refactor reshapes the new bridge's internals, the channels-side bridge will either need a parallel migration or fall further behind. Worth a follow-up to extract a shared multiplex-ACP base both implementations layer onto.
2. Double-tap Ctrl-C no longer force-kills wedged agent children during graceful drain (severity: medium · confidence: very high)
The Bd1y6 design promises that a second SIGINT/SIGTERM during the graceful drain synchronously SIGKILLs every live agent child before process.exit(1) — the operator-visible "I want this NOW" path for a wedged child that's ignoring SIGTERM. The multi-session refactor inadvertently broke that guarantee. The first signal's drain path snapshots the channel map and then clears it before awaiting the per-child SIGTERM-grace window (up to 10 seconds each). If the operator double-taps Ctrl-C during that window, the force-kill routine snapshots the (now empty) channel map and does nothing; the subsequent process.exit(1) then tears down the in-flight SIGKILL fallback timers. Any child still inside its SIGTERM grace window is left orphaned with dangling pipes. The exact scenario this force-kill path was added to handle is the one it no longer handles. Mechanical fix: hold the channel snapshot in a closure the force-kill routine can also see, or defer clearing the map until after the kill awaits resolve.
Verdict
COMMENT — both prior criticals are closed, the per-workspace multiplex refactor is sound, and the 16 follow-up commits show real care. The channels-package duplication is a scope/architecture observation worth a follow-up. The force-kill regression should be fixed before ship since it silently breaks an operator-visible guarantee.
LaZzyMan
left a comment
There was a problem hiding this comment.
Review
Stage 1 of the qwen serve daemon — HTTP/SSE bridge in front of ACP, an in-memory event bus with replay, a TypeScript daemon SDK, and the per-workspace multiplexing rework that landed mid-review. I traced the multi-client race edges on the multiplexing path (channel teardown fan-out, last-session ownership in killSession, the deferred-kill tombstone, the late-shutdown re-check inside session creation, and the single-listener transportClosedReject ordering) plus the bus, SSE, and auth surfaces end-to-end. No live correctness or security issue at HEAD. The long stretch of closed review threads has already worked through what would otherwise have been the bug-finding territory; the documented scope boundaries (TUI as super-client, multiplexing within a workspace, MCP refcount deferred) read as honest and complete.
Verdict
APPROVE — Stage 1 is ready to land; the experimental label fits because the scope is intentionally narrow, not because the implementation needs more time.
…broken by multi-session refactor (BkUyD) The Bd1y6 design promised a second SIGINT/SIGTERM during graceful drain synchronously SIGKILLs every live agent child via `bridge.killAllSync()` before `process.exit(1)` — the operator- visible "kill it now" path for a wedged child ignoring SIGTERM. The Stage 1.5 multi-session refactor (commit 6a170ef) inadvertently broke this. `shutdown()` snapshots `byWorkspaceChannel` then CLEARS the map BEFORE awaiting the per-child SIGTERM-grace kills (up to ~10s each). If the operator double-taps mid-window, `killAllSync()` snapshotted from the now-empty `byWorkspaceChannel.values()` and silently no-op'd — the for-loop iterated nothing, `process.exit(1)` fired, and any child still inside its SIGTERM grace window was left orphaned with dangling pipes. Exactly the scenario the force-kill path was added to handle. Fix: introduce a separate `liveChannels: Set<ChannelInfo>` as the source of truth for "channels with potentially-alive child processes". Added in `getOrCreateChannel` alongside `byWorkspaceChannel.set(...)`; removed only when `channel.exited` fires (the OS-level "really dead" signal). `killAllSync()` now iterates `liveChannels`, so a mid-shutdown second signal still sees every still-alive child regardless of where the graceful drain currently is. Other paths (`killSession` last-session reap, `channel.exited` crash handler) automatically remove via the same exit-handler hook. Regression test: - Builds two sessions on different workspaces - Replaces each channel's `kill()` with a never-resolving Promise (simulating stuck SIGTERM grace) - Calls `bridge.shutdown()` to enter mid-drain state - Yields twice so shutdown's sync prefix runs (clears byWorkspaceChannel, starts the never-resolving awaits) - Calls `bridge.killAllSync()` — pre-fix this saw an empty `byWorkspaceChannel` and the spy array would have been empty; post-fix both channels' `killSync` is invoked. (tanzhenxin's other observation — channels-package duplicate ACP bridge — is the same architectural concern as chiga0 finding 1+5, already tracked under existing FIXME(stage-1.5) markers. No code change in this commit for that.)
Both findings addressed — force-kill regression fixed in
|
BkUyD force-kill fix — E2E verification at three layersFollowing up on the tanzhenxin review reply with concrete OS-level evidence that 1. Unit test (
|
tanzhenxin
left a comment
There was a problem hiding this comment.
Review
The one new commit since the last round closes the BkUyD double-signal force-kill regression cleanly. The fix introduces a separate liveChannels set that's tied to the OS-level channel.exited signal rather than to the in-memory bookkeeping shutdown() clears early — structurally cleaner than holding a snapshot in a closure, because it covers the channel-crash cleanup path symmetrically and not just the mid-drain double-Ctrl+C path. The triple-layer verification (unit test + SIGTERM-ignoring child reproducer + real-daemon double-Ctrl+C with kill -0 confirmation) is stronger evidence than the spec required.
The remaining channels-package duplicate-bridge concern is fairly deferred to Stage 1.5 with an inline FIXME tracker and the lift reordered ahead of the per-request sessionScope override so downstream Stage 1.5 work benefits from a single bridge implementation.
Verdict
APPROVE — all four critical findings across the three review passes are now closed.
Summary
qwen serveHTTP daemon that bridges ACP NDJSON over HTTP + SSE, plus an SDK-sideDaemonClientthat drives every Stage 1 route. Closes the §04 routes table for Stage 1 (health, capabilities, session create/list, prompt, cancel, events SSE, set-model, permission vote).qwen --acpchild-process per session — startup cost is not amortized across sessions, but no business-logic in core changes. Stage 2 in-process is the follow-up.packages/cli/src/serve/server.tsagainst §04; (2)HttpAcpBridgelifecycle — spawn/init/teardown, FIFO prompt queue, cancel→permission resolution per ACP spec, shutdown drain; (3)EventBuscorrectness — bounded subscriber queues withclient_evictedoverflow, replay ring, AsyncIterable abort handling; (4) auth + threat model — bearer + Host allowlist, refusing 0.0.0.0 without a token.Architecture
This PR ships Mode B headless of the "1 Daemon Instance = 1 Session" architecture (see issue #3803):
qwen --acpchild is a Daemon Instance bound to one sessionqwen serveHTTP front in this PR is the daemon HTTP server itself; in multi-session deployments an external orchestrator routes to multiple daemon instances (see External Reference Architecture)Designed against §03 §2 process model, §03 §7 dual deployment modes, and §22 single-vs-multi-session comparison.
Architecture diagram
This PR — Mode B headless daemon (single session):
Bigger picture — multi-session via external orchestrator (out of this PR's scope):
qwen-code main roadmap (in-project follow-ups)
qwen --serveflag — TUI co-hosts the same Express HTTP server + TUI as in-process EventBus subscriber (decision 6 fan-out)HttpTransportSDK adapterAfter Stage 2 the daemon protocol surface is feature complete (~3 weeks total from Stage 1).
External Reference Architecture (out of project scope)
Designed but not on qwen-code's roadmap; meant for external integrators (commercial platforms / k8s operators / cloud vendors):
qwen-coordinator, multi-daemon spawn / route / cleanup) — see §04 §8.2 orchestrator API and §22What's in (8 commits, ~5100 LOC)
qwen servesubcommand, Express + auth + Host allowlist + 0.0.0.0+token refusal, /health + /capabilities, HttpAcpBridge interface stubHttpAcpBridge: spawnqwen --acpper workspace via injectable ChannelFactory, ACP handshake with 10s init timeout, sessionScope:single reuse, BridgeClient fs proxy + sessionUpdate buffer;POST /sessionroutePOST /session/:id/prompt(per-session FIFO queue, no-poison on failure, routing-id overrides body) +POST /session/:id/cancel;SessionNotFoundErrortyped for clean 404 mappingEventBus;GET /session/:id/eventsSSE withLast-Event-IDreconnect, 15s heartbeat, AbortController onreq.close, ring-backed replay, bounded subscriber queues withclient_evictedoverflowPOST /permission/:requestIdfirst-responder vote:requestPermissionpublishespermission_requestevent, awaits HTTP vote,cancelSession/shutdownresolve outstanding ascancelledper ACP spec;permission_resolvedevent fan-outDaemonClientSDK-side HTTP client (sibling to ProcessTransport, not a Transport replacement — different protocols);DaemonHttpError;parseSseStreamGET /workspace/:id/sessions(URL-encoded cwd → list) +POST /session/:id/model(publishesmodel_switched); matching SDK methods;errorMessagehelper for JSON-RPC error visibilitycrypto.timingSafeEqual), coalesce concurrentspawnOrAttachfor same workspace under single-scope, tightenparseLastEventIdto pure decimal digits, IPv6 loopback ergonomics (::1/[::1]in LOOPBACK_BINDS), SDKfailOnErrorreads body once as text then JSON-parses (was:res.json()consumed body so the text fallback was empty), document Stage 1 trust model on BridgeClient, +13 test-gap fillsCapabilities envelope
STAGE1_FEATURESadvertises 9 tags so clients gate UI offfeatures, never offmode(per §10):Threat model (default deployment)
--hostname 0.0.0.0requires--tokenorQWEN_SERVER_TOKEN— boot refuses without one.LOOPBACK_BINDSincludes IPv6::1/[::1]so IPv6-loopback users don't need a token either.localhost:port,127.0.0.1:port,[::1]:port,host.docker.internal:port.timingSafeEqual); 401 responses are uniform across "no header"/"bad scheme"/"wrong token" so a side-channel can't distinguish.client_evictedoverflow keep a slow client from holding the daemon hostage.readTextFile/writeTextFile) do NOT enforce a workspace-cwd sandbox in Stage 1: agent runs as the same UID with shell-tool access, so a sandbox here would be theatre. Stage 4+ remote-sandbox swaps the Client for a sandbox-aware variant — see issue Daemon mode (qwen serve): proposal & open decisions #3803 §11.Out of scope for this PR (deferred follow-ups)
WS /session/:id— Stage 1.5; SSE covers the primary streaming use case.POST /file/read,/file/write— agent has direct fs; useful only for daemon-only file API to remote clients.HttpTransportadapter that translates ACP↔stream-json so the existingquery()flow gets daemon-mode for free. DaemonClient is a sibling of ProcessTransport in this PR, not a replacement.HttpAcpBridgeinterface is shaped so route handlers don't change when the bridge stops spawning subprocesses.qwen --acpchild currently forks itself once internally (sandbox / sub-process), so each session costs twonodeprocesses in the tree. Worth a follow-up to suppress when launched by the daemon (~150MB RSS per session).Test plan
npm run typecheck -w packages/cli✅npm run typecheck -w packages/sdk-typescript✅npm run lint(eslint clean across new files) ✅npm test -w packages/cli— 5043 pass / 7 skipped (was 4967 baseline, +89 serve tests, 0 regressions; 1 pre-existingdetect-terminal-themeflake unrelated)npm test -w packages/sdk-typescript— 236 pass (was 201 baseline, +35 daemon tests, 0 regressions)qwen --acpchild:POST /sessionreturns sessionId; repeat call same workspace returnsattached: truePOST /sessionfor the same workspace coalesces — ONE child spawned, both callers get the same sessionId, exactly one reportsattached: falsePOST /session/:id/promptreturns{stopReason: "end_turn"}after a short turnGET /session/:id/eventsstreamed 13 distinct frames (available_commands_update, agent_thought_chunk × 9, agent_message_chunk × 3, usage) during a single promptpermission_requestevent, votedproceed_onceviaPOST /permission/$requestId, captured matchingpermission_resolved, verified the file was actually created on disk with the requested contentPOST /session/:id/cancel→ 204Design references
sessionScope: 'single'for cross-client live-collaborationspawnOrAttachcoalesced to one channel under single-scope), events fan out to all subscribers, cancel bypasses the queue中文版本(点击展开)
摘要
qwen serveHTTP daemon,通过 HTTP + SSE 桥接 ACP NDJSON 协议;外加 SDK 侧的DaemonClient驱动每一个 Stage 1 路由。完整覆盖 §04 路由表的 Stage 1 部分(health、capabilities、session create/list、prompt、cancel、events SSE、set-model、permission vote)。qwen --acp子进程"——启动成本不能跨 session 摊销,但 core 业务逻辑零变更。packages/cli/src/serve/server.ts协议表面对照 §04;(2)HttpAcpBridge生命周期 —— spawn/init/teardown、FIFO prompt 队列、cancel→permission 按 ACP spec 应答 cancelled、shutdown drain;(3)EventBus正确性 —— 有界 subscriber 队列 +client_evicted溢出、replay ring、AsyncIterable abort 处理;(4) 认证 + 威胁模型 —— bearer + Host allowlist,0.0.0.0 无 token 拒启动。架构
本 PR 实现 Mode B headless —— "1 Daemon Instance = 1 Session" 架构(见 issue #3803):
qwen --acpchild = Daemon Instance,绑定一个 sessionqwen serveHTTP front 即是 daemon HTTP server;多 session 部署下由外部 orchestrator 路由到多个 daemon instance(见 External Reference Architecture)对应文档:§03 §2 状态进程模型、§03 §7 双部署模式、§22 单 vs 多 Session 对比。
架构图
参见英文版本中的 ASCII 图("This PR — Mode B headless daemon" + "Bigger picture — multi-session via external orchestrator")。
qwen-code 主线路线图(项目内后续 PR)
qwen --serveflag —— TUI 与 HTTP server 同进程 + TUI 作为 in-process EventBus subscriber(决策 §6 fan-out)HttpTransportSDK 适配器Stage 2 后 daemon protocol surface feature complete(从 Stage 1 起约 3 周)。
External Reference Architecture(项目范围外)
设计已完成但不在 qwen-code 主线路线图,给外部集成方(商业平台 / k8s operator / 云厂商)参考:
qwen-coordinator,multi-daemon spawn / route / cleanup)—— 见 §04 §8.2 orchestrator API 与 §22Capabilities envelope
STAGE1_FEATURES共 9 个 feature tag。客户端按featuresgate UI(不要按mode):威胁模型(默认部署)
--hostname 0.0.0.0必须配合--token或QWEN_SERVER_TOKEN,否则拒启动LOOPBACK_BINDS包含 IPv6::1/[::1]timingSafeEqual);401 在"缺 header"/"协议错"/"token 错"三种情况下完全一致client_evicted溢出机制设计参考
Issue #3803 —— Stage 1 完整实现 §04 中除延后项之外的全部内容。§03 中的架构决策 §1 / §4 / §5 / §6 完全对齐:
sessionScope: 'single'实现跨 client live-collaboration