Skip to content

feat(cli,sdk): qwen serve daemon (Stage 1)#3889

Merged
wenshao merged 84 commits into
mainfrom
feat/qwen-serve-stage1-scaffold
May 13, 2026
Merged

feat(cli,sdk): qwen serve daemon (Stage 1)#3889
wenshao merged 84 commits into
mainfrom
feat/qwen-serve-stage1-scaffold

Conversation

@wenshao

@wenshao wenshao commented May 7, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • What changed: Implements Stage 1 of issue Daemon mode (qwen serve): proposal & open decisions #3803 — a qwen serve HTTP daemon that bridges ACP NDJSON over HTTP + SSE, plus an SDK-side DaemonClient that drives every Stage 1 route. Closes the §04 routes table for Stage 1 (health, capabilities, session create/list, prompt, cancel, events SSE, set-model, permission vote).
  • Why it changed: Stage 1 is the flag-gated experimental landing that lets users run daemon-mode on real workloads while Stage 2's invasive in-process refactor is being designed. Per the issue body, the bridge is intentionally a qwen --acp child-process per session — startup cost is not amortized across sessions, but no business-logic in core changes. Stage 2 in-process is the follow-up.
  • Reviewer focus: (1) the protocol surface in packages/cli/src/serve/server.ts against §04; (2) HttpAcpBridge lifecycle — spawn/init/teardown, FIFO prompt queue, cancel→permission resolution per ACP spec, shutdown drain; (3) EventBus correctness — bounded subscriber queues with client_evicted overflow, replay ring, AsyncIterable abort handling; (4) auth + threat model — bearer + Host allowlist, refusing 0.0.0.0 without a token.

Architecture

This PR ships Mode B headless of the "1 Daemon Instance = 1 Session" architecture (see issue #3803):

  • Each spawned qwen --acp child is a Daemon Instance bound to one session
  • The qwen serve HTTP front in this PR is the daemon HTTP server itself; in multi-session deployments an external orchestrator routes to multiple daemon instances (see External Reference Architecture)
  • Multi-tenancy / quota / OIDC / sandbox are all External Reference Architecture concerns — implementation by commercial platforms / k8s operators / cloud vendors using the daemon as a building block

Designed against §03 §2 process model, §03 §7 dual deployment modes, and §22 single-vs-multi-session comparison.

Architecture diagram

This PR — Mode B headless daemon (single session):

┌──────────────────────── qwen serve --port 7776 ──────────────────────┐
│                                                                       │
│             ┌─────────────────────────────┐                           │
│             │ EventBus + ACP core          │                          │
│             │  - 1 session bound           │                          │
│             │  - LLM stream                │                          │
│             │  - tool dispatch             │                          │
│             │  - first-responder vote      │                          │
│             └──────────────┬──────────────┘                          │
│                            ↑                                          │
│             ┌──────────────┴──────────────┐    HTTP/SSE              │
│             │ Express 5 HTTP server        │ ◄────────► remote        │
│             │ + bearer auth + Host allow   │    clients               │
│             └──────────────────────────────┘                          │
│                                                                       │
└───────────────────────────────────────────────────────────────────────┘
                                                ↑
                ┌──────────────┬──────────────┬─┴──────────────┬───────┐
                ↓              ↓              ↓                ↓       ↓
            ┌────────┐  ┌────────┐      ┌────────┐      ┌────────┐ ┌──────┐
            │ remote │  │ Web UI │      │ IDE ext│      │ IM bot │ │  ... │
            │  CLI   │  │        │      │        │      │        │ │      │
            └────────┘  └────────┘      └────────┘      └────────┘ └──────┘

Bigger picture — multi-session via external orchestrator (out of this PR's scope):

                ┌──────────────────────────────────────────┐
                │ External Orchestrator (out of scope)     │
                │   - sessionScope routing                  │
                │   - daemon spawn / route / cleanup        │
                └────────────────────┬─────────────────────┘
                                     │ spawn per session
       ┌─────────────────┬───────────┼───────────────┬──────────────┐
       ↓                 ↓           ↓               ↓              ↓
 ┌───────────┐     ┌───────────┐ ┌───────────┐ ┌───────────┐  ┌──────────┐
 │ daemon-1  │     │ daemon-2  │ │ daemon-3  │ │ daemon-N  │  │   ...    │
 │ Mode B    │     │ Mode B    │ │ Mode B    │ │ Mode B    │  │          │
 │ session-A │     │ session-B │ │ session-C │ │ session-N │  │          │
 │ (this PR) │     │ (this PR) │ │ (this PR) │ │ (this PR) │  │          │
 └───────────┘     └───────────┘ └───────────┘ └───────────┘  └──────────┘

qwen-code main roadmap (in-project follow-ups)

  • Stage 1.5 (~4d) Mode A qwen --serve flag — TUI co-hosts the same Express HTTP server + TUI as in-process EventBus subscriber (decision 6 fan-out)
  • Stage 2 (~1-2w) daemon polish — mDNS / OpenAPI auto-gen / WebSocket bidi / multi-token / Prometheus metrics / HttpTransport SDK adapter

After Stage 2 the daemon protocol surface is feature complete (~3 weeks total from Stage 1).

External Reference Architecture (out of project scope)

Designed but not on qwen-code's roadmap; meant for external integrators (commercial platforms / k8s operators / cloud vendors):

  • Orchestrator (qwen-coordinator, multi-daemon spawn / route / cleanup) — see §04 §8.2 orchestrator API and §22
  • Multi-tenancy (Tenant / OIDC / Quota / Audit) — see §23
  • Shell sandbox (NoSandbox / OS user / namespace / container / remote) — see §11
  • SaaS deployment (k8s / Postgres / Redis / S3) — see §16 and §23 §七

What's in (8 commits, ~5100 LOC)

Commit Layer Topic
61f2f59 cli scaffold qwen serve subcommand, Express + auth + Host allowlist + 0.0.0.0+token refusal, /health + /capabilities, HttpAcpBridge interface stub
8d7c03a cli wire HttpAcpBridge: spawn qwen --acp per workspace via injectable ChannelFactory, ACP handshake with 10s init timeout, sessionScope:single reuse, BridgeClient fs proxy + sessionUpdate buffer; POST /session route
ca996ec cli POST /session/:id/prompt (per-session FIFO queue, no-poison on failure, routing-id overrides body) + POST /session/:id/cancel; SessionNotFoundError typed for clean 404 mapping
41aa950 cli replace sessionUpdate buffer with EventBus; GET /session/:id/events SSE with Last-Event-ID reconnect, 15s heartbeat, AbortController on req.close, ring-backed replay, bounded subscriber queues with client_evicted overflow
6ee655f cli POST /permission/:requestId first-responder vote: requestPermission publishes permission_request event, awaits HTTP vote, cancelSession/shutdown resolve outstanding as cancelled per ACP spec; permission_resolved event fan-out
8206a64 sdk DaemonClient SDK-side HTTP client (sibling to ProcessTransport, not a Transport replacement — different protocols); DaemonHttpError; parseSseStream
a8ce5e0 cli + sdk GET /workspace/:id/sessions (URL-encoded cwd → list) + POST /session/:id/model (publishes model_switched); matching SDK methods; errorMessage helper for JSON-RPC error visibility
ad0e6ec cli + sdk self-review audit round 1: timing-safe bearer compare (SHA-256 + crypto.timingSafeEqual), coalesce concurrent spawnOrAttach for same workspace under single-scope, tighten parseLastEventId to pure decimal digits, IPv6 loopback ergonomics (::1/[::1] in LOOPBACK_BINDS), SDK failOnError reads body once as text then JSON-parses (was: res.json() consumed body so the text fallback was empty), document Stage 1 trust model on BridgeClient, +13 test-gap fills

Capabilities envelope

STAGE1_FEATURES advertises 9 tags so clients gate UI off features, never off mode (per §10):

['health', 'capabilities', 'session_create', 'session_list',
 'session_prompt', 'session_cancel', 'session_events',
 'session_set_model', 'permission_vote']

Threat model (default deployment)

  • 127.0.0.1 only, bearer auth disabled (developer convenience).
  • --hostname 0.0.0.0 requires --token or QWEN_SERVER_TOKEN — boot refuses without one.
  • LOOPBACK_BINDS includes IPv6 ::1/[::1] so IPv6-loopback users don't need a token either.
  • Host allowlist on loopback binds defends against DNS rebinding; allowed Hosts are localhost:port, 127.0.0.1:port, [::1]:port, host.docker.internal:port.
  • CORS denies any browser Origin.
  • Bearer token compare is constant-time (SHA-256 both sides + timingSafeEqual); 401 responses are uniform across "no header"/"bad scheme"/"wrong token" so a side-channel can't distinguish.
  • SIGINT/SIGTERM drain agent children before closing the listener.
  • Per-subscriber bounded SSE queues with client_evicted overflow keep a slow client from holding the daemon hostage.
  • BridgeClient file-proxy methods (readTextFile/writeTextFile) do NOT enforce a workspace-cwd sandbox in Stage 1: agent runs as the same UID with shell-tool access, so a sandbox here would be theatre. Stage 4+ remote-sandbox swaps the Client for a sandbox-aware variant — see issue Daemon mode (qwen serve): proposal & open decisions #3803 §11.

Out of scope for this PR (deferred follow-ups)

  • WS /session/:id — Stage 1.5; SSE covers the primary streaming use case.
  • POST /file/read, /file/write — agent has direct fs; useful only for daemon-only file API to remote clients.
  • An HttpTransport adapter that translates ACP↔stream-json so the existing query() flow gets daemon-mode for free. DaemonClient is a sibling of ProcessTransport in this PR, not a replacement.
  • Stage 2 in-process refactor — the HttpAcpBridge interface is shaped so route handlers don't change when the bridge stops spawning subprocesses.
  • Cosmetic: the qwen --acp child currently forks itself once internally (sandbox / sub-process), so each session costs two node processes in the tree. Worth a follow-up to suppress when launched by the daemon (~150MB RSS per session).

Test plan

  • npm run typecheck -w packages/cli
  • npm run typecheck -w packages/sdk-typescript
  • npm run lint (eslint clean across new files) ✅
  • npm test -w packages/cli5043 pass / 7 skipped (was 4967 baseline, +89 serve tests, 0 regressions; 1 pre-existing detect-terminal-theme flake unrelated)
  • npm test -w packages/sdk-typescript236 pass (was 201 baseline, +35 daemon tests, 0 regressions)
  • Real e2e against a built daemon spawning a real qwen --acp child:
    • POST /session returns sessionId; repeat call same workspace returns attached: true
    • Concurrent POST /session for the same workspace coalesces — ONE child spawned, both callers get the same sessionId, exactly one reports attached: false
    • POST /session/:id/prompt returns {stopReason: "end_turn"} after a short turn
    • GET /session/:id/events streamed 13 distinct frames (available_commands_update, agent_thought_chunk × 9, agent_message_chunk × 3, usage) during a single prompt
    • Permission flow: triggered a write_text_file tool, captured permission_request event, voted proceed_once via POST /permission/$requestId, captured matching permission_resolved, verified the file was actually created on disk with the requested content
    • POST /session/:id/cancel → 204
    • SDK e2e via the built DaemonClient against the same daemon: capabilities, list, attach reuse, parallel events + prompt, AbortSignal cleanup, set-model success + 404 paths
    • Bearer auth: right token → 200, wrong/missing/malformed → uniform 401

Design references

  • Issue: Daemon mode (qwen serve): proposal & open decisions #3803 — Stage 1 ships everything in §04 except the deferred items above. The architectural choices in §03 (decisions §1, §4, §5, §6) are honored verbatim:
    • §1 default sessionScope: 'single' for cross-client live-collaboration
    • §4 FileReadCache stays per-session (BridgeClient fs proxy goes through the agent's session, not daemon-wide)
    • §5 permission flow reuses ACP spec semantics; HTTP route emits the request, any client casts the deciding vote
    • §6 same-session prompts FIFO-serialize (with concurrent spawnOrAttach coalesced to one channel under single-scope), events fan out to all subscribers, cancel bypasses the queue

中文版本(点击展开)

摘要

  • 改了什么:实现 issue Daemon mode (qwen serve): proposal & open decisions #3803Stage 1 —— qwen serve HTTP daemon,通过 HTTP + SSE 桥接 ACP NDJSON 协议;外加 SDK 侧的 DaemonClient 驱动每一个 Stage 1 路由。完整覆盖 §04 路由表的 Stage 1 部分(health、capabilities、session create/list、prompt、cancel、events SSE、set-model、permission vote)。
  • 为什么改:Stage 1 是 flag-gated 实验性落地——让用户在真实 workload 上跑 daemon 模式收集需求。按 issue 描述,桥接器有意做成"每 session 一个 qwen --acp 子进程"——启动成本不能跨 session 摊销,但 core 业务逻辑零变更。
  • Reviewer 关注点:(1) packages/cli/src/serve/server.ts 协议表面对照 §04;(2) HttpAcpBridge 生命周期 —— spawn/init/teardown、FIFO prompt 队列、cancel→permission 按 ACP spec 应答 cancelled、shutdown drain;(3) EventBus 正确性 —— 有界 subscriber 队列 + client_evicted 溢出、replay ring、AsyncIterable abort 处理;(4) 认证 + 威胁模型 —— bearer + Host allowlist,0.0.0.0 无 token 拒启动。

架构

本 PR 实现 Mode B headless —— "1 Daemon Instance = 1 Session" 架构(见 issue #3803):

  • 每个 spawn 出来的 qwen --acp child = Daemon Instance,绑定一个 session
  • 本 PR 的 qwen serve HTTP front 即是 daemon HTTP server;多 session 部署下由外部 orchestrator 路由到多个 daemon instance(见 External Reference Architecture
  • 多租户 / 配额 / OIDC / sandbox 全部是 External Reference Architecture 范畴——由商业平台 / k8s operator / 云厂商基于 daemon building block 自行实现

对应文档:§03 §2 状态进程模型§03 §7 双部署模式§22 单 vs 多 Session 对比

架构图

参见英文版本中的 ASCII 图("This PR — Mode B headless daemon" + "Bigger picture — multi-session via external orchestrator")。

qwen-code 主线路线图(项目内后续 PR)

  • Stage 1.5(~4d)Mode A qwen --serve flag —— TUI 与 HTTP server 同进程 + TUI 作为 in-process EventBus subscriber(决策 §6 fan-out)
  • Stage 2(~1-2w)daemon 完善 —— mDNS / OpenAPI 自动生成 / WebSocket bidi / 多 token / Prometheus metrics / HttpTransport SDK 适配器

Stage 2 后 daemon protocol surface feature complete(从 Stage 1 起约 3 周)。

External Reference Architecture(项目范围外)

设计已完成但不在 qwen-code 主线路线图,给外部集成方(商业平台 / k8s operator / 云厂商)参考:

  • Orchestratorqwen-coordinator,multi-daemon spawn / route / cleanup)—— 见 §04 §8.2 orchestrator API§22
  • 多租户(Tenant / OIDC / Quota / Audit)—— 见 §23
  • Shell sandbox(NoSandbox / OS user / namespace / container / remote)—— 见 §11
  • SaaS 部署(k8s / Postgres / Redis / S3)—— 见 §16§23 §七

Capabilities envelope

STAGE1_FEATURES 共 9 个 feature tag。客户端按 features gate UI(不要按 mode):

['health', 'capabilities', 'session_create', 'session_list',
 'session_prompt', 'session_cancel', 'session_events',
 'session_set_model', 'permission_vote']

威胁模型(默认部署)

  • 仅绑 127.0.0.1,bearer auth 关闭(本地开发便利)
  • --hostname 0.0.0.0 必须配合 --tokenQWEN_SERVER_TOKEN,否则拒启动
  • LOOPBACK_BINDS 包含 IPv6 ::1/[::1]
  • loopback 绑定的 Host allowlist 防 DNS rebinding
  • CORS 拒绝任何浏览器 Origin
  • Bearer token 比对常量时间(SHA-256 + timingSafeEqual);401 在"缺 header"/"协议错"/"token 错"三种情况下完全一致
  • SIGINT/SIGTERM 在关闭 listener 之前先 drain agent 子进程
  • per-subscriber 有界 SSE 队列 + client_evicted 溢出机制

设计参考

Issue #3803 —— Stage 1 完整实现 §04 中除延后项之外的全部内容。§03 中的架构决策 §1 / §4 / §5 / §6 完全对齐:

  • §1 默认 sessionScope: 'single' 实现跨 client live-collaboration
  • §4 FileReadCache 保持 per-session
  • §5 权限流复用 ACP spec 语义;HTTP 路由发出请求,任意 client 投出决定性一票
  • §6 同 session prompt FIFO 串行 + 事件 fan-out + cancel 绕过队列

wenshao added 7 commits May 7, 2026 10:44
Adds a `serve` subcommand that boots an Express 5 listener with bearer
auth, host allowlist, and CORS modeled on `vscode-ide-companion/src/
ide-server.ts`. Ships only `/health` and `/capabilities` to begin with;
session/prompt/event routes will land in follow-up PRs once the per-
session ACP child-process bridge in `httpAcpBridge.ts` is wired.

Defaults to 127.0.0.1 with auth disabled so local development needs no
configuration. Binding beyond loopback (e.g. `--hostname 0.0.0.0`)
refuses to start without a token (`--token` or `QWEN_SERVER_TOKEN`).

Capabilities envelope versioned at v=1 with a `features` array — clients
should gate UI off `features`, never off `mode`, so subsequent PRs can
add capability tags without breaking older clients.

Per design issue's Stage 1 scope (~700-1000 LOC). Adds ~430 LOC of
implementation + tests in this scaffold; the remaining budget belongs
to the route wiring + bridge implementation in follow-ups.
Stage 1 follow-up to the scaffold. Implements the bridge between the
HTTP daemon and the existing ACP child agent, plus the first session
endpoint.

`HttpAcpBridge.spawnOrAttach`:
  - Spawns `node $cliEntry --acp` per workspace via an injectable
    `ChannelFactory` (default uses `process.argv[1]`; tests use an
    in-memory `TransformStream` pair so they don't fork real processes).
  - Drives the ACP `initialize` + `newSession` handshake via the SDK's
    `ClientSideConnection`, with a 10s timeout that kills the channel.
  - Under `sessionScope: 'single'` (default), reuses the live session
    when the same canonical workspace cwd is requested again — backs
    the `attached: true` flag.
  - The `Client` impl on the bridge side proxies file reads/writes to
    local fs (daemon and agent share the host) and buffers
    `sessionUpdate` notifications for the SSE wiring in the next PR.
    `requestPermission` returns `cancelled` until the
    `/permission/:requestId` route lands.

`POST /session`:
  - 400 on missing or relative `cwd`.
  - 200 with `{sessionId, workspaceCwd, attached}` on success.
  - 500 on bridge failure (the failing channel is killed, not leaked).

`runQwenServe` constructs the bridge and ties `bridge.shutdown()` into
the listener-close path so SIGINT/SIGTERM drain children before the
socket closes.

Tests (14 new, 0 regressions in the 4967-test baseline):
  - 9 bridge cases over an in-memory channel — fresh spawn, single-scope
    reuse, cross-workspace isolation, thread-scope independence, path
    canonicalization, relative-path rejection, init failure cleanup,
    init timeout, multi-channel shutdown.
  - 4 route cases for /session (missing/relative/200/500).
  - 1 lifecycle case asserting `runQwenServe.close()` calls
    `bridge.shutdown()` before closing the listener.

Verified end-to-end: `qwen serve` boots, `POST /session` spawns a real
`qwen --acp` child and returns the SDK-assigned `sessionId`, repeat
calls under the same cwd return `attached: true`, `SIGTERM` reaps the
child along with the listener.
…3803)

Stage 1 follow-up after the bridge scaffold. Adds the two routes a client
needs to actually run a turn against the daemon.

Bridge:
  - `sendPrompt(sessionId, req)` looks up the session, FIFO-queues the
    call against the per-session prompt queue, and forwards through the
    SDK `ClientSideConnection.prompt`. Concurrent calls observe ACP's
    "one active prompt per session" invariant — second waits for first.
  - A failed prompt does NOT poison the queue; the tail catches and
    keeps draining so the next caller still runs (the original caller
    still sees its own rejection).
  - `cancelSession(sessionId, req?)` bypasses the queue and forwards
    the ACP notification immediately. ACP semantics: the agent winds
    down the *currently active* prompt; queued work is unaffected.
  - Both methods throw `SessionNotFoundError` (a typed Error subclass)
    when the id is unknown so route handlers can map cleanly to 404
    without brittle message matching.
  - Both methods overwrite the `sessionId` field in the request body
    with the routing id — a stale or spoofed body would otherwise be
    dispatched to the wrong agent process.

Routes:
  - `POST /session/:id/prompt` → 200 with PromptResponse, 400 on
    missing/non-array prompt, 404 on unknown session, 500 on agent
    error.
  - `POST /session/:id/cancel` → 204 always (cancel is a notification),
    404 on unknown session.

Tests (14 new — 7 bridge + 7 route, 0 regressions in the 4981 baseline):
  - sendPrompt: success forwards & returns response · routing-id
    overrides body sessionId · concurrent prompts FIFO-serialize
    (verified via per-prompt start/end ordering with a release latch) ·
    failed prompt doesn't block subsequent prompts · 404 for unknown id.
  - cancelSession: forwards with routing id · 404 for unknown id.
  - Routes: 200/400/404/500 paths for prompt; 204 with body or empty +
    404 for cancel.

Verified end-to-end against a real `qwen --acp` child:
  - POST /session/:id/prompt with `[{type:'text',text:'hi'}]` → 200
    `{"stopReason":"end_turn"}` in ~3.4s.
  - POST /session/:id/cancel → 204.
  - POST /session/does-not-exist/prompt → 404 with the unknown id
    surfaced in the body.
Stage 1 follow-up that turns prompt into a real streaming experience.
Replaces the in-memory `notifications: SessionNotification[]` buffer
on each session with a per-session EventBus and exposes it through
`GET /session/:id/events` as an `text/event-stream` SSE feed.

EventBus (`packages/cli/src/serve/eventBus.ts`):
  - Monotonic per-session ids (`v: 1` schema). Each `publish` chains an
    id, returning the materialized BridgeEvent.
  - Bounded ring (default 1000) backs `Last-Event-ID` reconnect — a
    consumer that drops can resume from `lastEventId` and replay any
    still-buffered events before live events flow.
  - Per-subscriber bounded queue (default 256). When a slow consumer
    overruns its queue, the bus appends a synthetic `client_evicted`
    terminal frame and closes that subscription so it can't hold the
    daemon hostage. Other subscribers are unaffected.
  - `subscribe()` returns an AsyncIterable — registration is synchronous
    so events `publish`ed immediately after the subscribe land in the
    queue (a generator-style implementation deferred registration to
    first `next()` and raced with publishes).
  - AbortSignal-aware: aborting the signal closes the iterator promptly.

Bridge (`httpAcpBridge.ts`):
  - `BridgeClient.sessionUpdate` now publishes onto the session's
    EventBus instead of pushing to a plain array — every ACP
    notification the agent emits becomes a stream event automatically.
  - New `subscribeEvents(sessionId, opts?)` returns the bus's
    AsyncIterable; throws `SessionNotFoundError` for unknown ids.
  - Shutdown closes every live event bus before killing channels so
    pending consumers unwind cleanly.

Route (`server.ts`):
  - `GET /session/:id/events` sets the SSE content type, advertises a
    3s reconnect hint, and writes a 15s heartbeat comment frame to
    keep proxy/NAT connections alive.
  - Forwards the `Last-Event-ID` header to the bus.
  - `req.on('close')` triggers an AbortController that propagates into
    the bridge subscription so disconnects don't leak subscribers.
  - 404 when the bridge can't find the session.

Capabilities envelope: `STAGE1_FEATURES` now advertises
`session_create`, `session_prompt`, `session_cancel`, `session_events`
in addition to `health`/`capabilities` so clients can light up UI for
the routes that have actually shipped.

Tests (16 new, 0 regressions in the 4995 baseline):
  - 9 EventBus unit cases — id sequencing, live delivery, replay,
    replay+live splice, fan-out to N subscribers, eviction on
    overflow, abort-signal unsubscribe, bus.close() drains
    subscribers, ring-size eviction.
  - 4 bridge subscribe cases — 404, sessionUpdate→event publishing
    via real ACP fake-agent, shutdown closes live subscriptions.
  - 4 SSE route cases against a live HTTP listener — frame format,
    Last-Event-ID forwarding, 404, abort propagation on disconnect.

Verified end-to-end against a real `qwen --acp` child:
  - Subscribed to `/session/$SID/events`, fired `POST /session/$SID/prompt`
    with text content. Captured 13 distinct `event: session_update`
    SSE frames in real time during the model's response — `available_
    commands_update` metadata, 9 `agent_thought_chunk` frames carrying
    the model's chain-of-thought, 3 `agent_message_chunk` frames with
    the actual reply, and a final usage frame with token totals.
  - Frames carry monotonic ids 1..13, the daemon-side counter, and
    are valid SSE per the EventSource spec.
Stage 1 follow-up that turns `BridgeClient.requestPermission` from a
hardcoded `cancelled` placeholder into a real first-responder vote
loop, and ships the HTTP route any attached client uses to cast the
deciding vote.

Bridge:
  - `requestPermission` generates a UUID requestId, registers a
    pending entry on a daemon-wide map (and the owning session's
    `pendingPermissionIds` set), publishes a `permission_request`
    event onto the session's EventBus (so SSE subscribers see it),
    and awaits the resolution.
  - New `respondToPermission(requestId, response)` resolves the
    pending promise with the supplied outcome. First call wins —
    subsequent calls return false. On success the bridge publishes a
    `permission_resolved` event so other attached clients can update
    their UI when the race is decided.
  - `cancelSession` and `shutdown` both resolve every still-pending
    permission for the affected session(s) as
    `{ outcome: { outcome: 'cancelled' } }` per the ACP spec
    requirement that a cancelled prompt MUST resolve outstanding
    requestPermission calls with cancelled.
  - New `pendingPermissionCount` getter exposes inflight count for
    inspection / tests.

Route (`server.ts`):
  - `POST /permission/:requestId` validates the body's `outcome` is
    either `{ outcome: 'cancelled' }` or `{ outcome: 'selected',
    optionId: string }`, then forwards to `bridge.respondToPermission`.
  - 200 on accepted vote, 404 when the requestId is unknown or
    already resolved (Stage 1 doesn't differentiate), 400 on a
    malformed outcome.

Capabilities envelope: STAGE1_FEATURES gains `permission_vote`.

Tests (14 new — 9 bridge + 5 route, 0 regressions in the 5011 baseline):
  - Bridge: publishes permission_request with a generated requestId
    and waits; respondToPermission first-responder wins; publishes
    permission_resolved on vote; respondToPermission false for
    unknown requestId; cancelSession resolves outstanding as
    cancelled; shutdown resolves outstanding as cancelled.
  - Route: 200 on selected outcome; 200 on cancelled outcome; 404 on
    unknown requestId; 400 on malformed outcome; 400 on missing
    outcome.

Verified end-to-end against a real `qwen --acp` child:
  - Subscribed to /session/$SID/events, sent a prompt asking the
    agent to write a file at /tmp/qwen-serve-permission-e2e-test.txt.
  - The agent triggered a permission_request via the bus, surfacing
    the three options Qwen Code presents (Allow Always / Allow /
    Reject) with their option ids.
  - POSTed `{outcome:{outcome:"selected",optionId:"proceed_once"}}`
    to /permission/$requestId — got HTTP 200.
  - Bus published the matching permission_resolved event.
  - Agent proceeded with the writeTextFile tool call; file was
    actually created on disk with the expected content.
Stage 1 follow-up that proves the cross-mode protocol-isomorphism design
assumption: an SDK client can drive the daemon's HTTP routes end-to-end
without going through ProcessTransport's stdio + stream-json path.

DaemonClient is a sibling of ProcessTransport, not a replacement. The two
speak different protocols (ACP NDJSON over HTTP vs stream-json over
stdio). Existing `query()` users keep getting subprocess-mode unchanged;
applications that want daemon-mode (cross-client attach, shared MCP
pool, network reachability, first-responder permissions) opt in by
constructing a DaemonClient against a running `qwen serve`.

API surface (`packages/sdk-typescript/src/daemon/`):
  - `new DaemonClient({ baseUrl, token?, fetch? })`. The `fetch` override
    is for tests; defaults to `globalThis.fetch`. Trailing slashes on
    `baseUrl` are stripped.
  - `health()`, `capabilities()` — discovery.
  - `createOrAttachSession({ workspaceCwd, modelServiceId? })` — `attached:
    true` on the response indicates a session was reused under
    sessionScope:single.
  - `prompt(sessionId, { prompt: ContentBlock[] })` — returns
    PromptResult with stopReason.
  - `cancel(sessionId)` — tolerates 204; throws on 404.
  - `subscribeEvents(sessionId, { lastEventId?, signal? })` — async
    iterator over parsed SSE frames; AbortSignal-aware. Native Node
    AbortController only — jsdom polyfills are incompatible with undici.
  - `respondToPermission(requestId, response)` — first-responder vote;
    returns true on 200, false on 404 (lost the race or unknown id),
    throws on 400/500.

`DaemonHttpError` is thrown for any non-2xx (besides the 404
"already-resolved" case on permission votes); carries `status` and
`body` so callers can branch on standard daemon HTTP semantics.

`parseSseStream(body)` is the underlying SSE parser; exported separately
so applications can consume daemon SSE outside the DaemonClient surface.
Handles split-chunk frames, comment/retry directives, malformed JSON
(skip), trailing frame without final newline.

Wire types live SDK-side (no SDK→CLI dep); the capabilities envelope's
`v` field signals breaking changes.

Tests (26 new, 0 regressions in the 201 baseline):
  - 7 SSE parser cases — single frame, multiple frames, comments,
    chunked-split frame, malformed JSON skip, trailing frame on close,
    empty stream.
  - 19 DaemonClient cases — health success/error, capabilities, bearer
    auth presence/absence, createOrAttachSession success/400, prompt
    body shape + sessionId url-encoding, cancel 204/404, permission
    200/400/404, subscribeEvents header forwarding + 404, baseUrl
    normalization.

Verified end-to-end against a real `qwen serve` daemon driving a real
`qwen --acp` child:
  - `client.capabilities()` returned `{v:1, mode:"http-bridge", features:
    [...7 tags]}`.
  - First `createOrAttachSession` returned `attached:false`; second
    returned `attached:true` with the same sessionId.
  - `client.prompt(...)` with text content yielded `{stopReason:
    "end_turn"}` while the parallel `subscribeEvents` iterator streamed
    10 distinct frames during the same turn.
  - AbortController on the events iterator cleanly severed the SSE
    connection.
Closes the §04 Stage-1 routes table for `qwen serve` with the two
remaining endpoints, plus matching SDK methods.

`GET /workspace/:id/sessions`
  - `:id` is the URL-encoded canonical absolute workspace path
    (Express decodes path params automatically; clients pass
    `encodeURIComponent(cwd)`).
  - Returns `{ sessions: [{ sessionId, workspaceCwd }, ...] }` for live
    sessions whose canonical workspace matches.
  - Empty array (not 404) when the workspace is idle so picker UIs
    don't have to special-case "no sessions yet".
  - 400 when the decoded path isn't absolute.

`POST /session/:id/model`
  - Body: `{ modelId: string, ... }`. The route's `:id` overrides any
    spoofed sessionId in the body.
  - Forwards to ACP's `unstable_setSessionModel` and publishes a
    `model_switched` event onto the session bus so cross-client UIs
    update.
  - 200 with the agent's response on success, 400 on missing/empty
    modelId, 404 on unknown session.
  - The SDK method is currently unstable; documented in the bridge
    comment in case the spec renames the method when it stabilizes.

Bridge:
  - New `listWorkspaceSessions(workspaceCwd)` iterates `byId.values()`
    and filters by canonical workspace path; works for both `single`
    and `thread` session scopes.
  - New `setSessionModel(sessionId, req)` forwards through
    `connection.unstable_setSessionModel`, normalizes sessionId,
    publishes `model_switched`, throws SessionNotFoundError on
    unknown ids.

`STAGE1_FEATURES` capabilities envelope grows to 9 tags, adding
`session_list` and `session_set_model`.

SDK (`DaemonClient`):
  - `listWorkspaceSessions(workspaceCwd)` URL-encodes the cwd and
    returns the parsed `sessions` array directly.
  - `setSessionModel(sessionId, modelId)` POSTs the body and returns
    the agent response (currently opaque per ACP unstable spec).
  - Wire types `DaemonSessionSummary` and `SetModelResult` exported
    from the SDK barrel.

Tangential cleanup: `sendBridgeError` now extracts a useful message
from non-Error values via a small `errorMessage` helper. JSON-RPC
errors from the agent (`{code, message, data}`) used to surface as
`"[object Object]"` in the 500 response body; they now show the
inner `message` field. Caught while running the model-set e2e.

Tests (17 new — 9 bridge + 7 route + 4 SDK, 0 regressions in the
5022 + 227 baselines):
  - Bridge listWorkspaceSessions: matching cwd returns the live
    sessions; canonicalizes the lookup; empty for relative paths.
  - Bridge setSessionModel: forwards modelId + overrides body
    sessionId; publishes model_switched event; 404 unknown session.
  - Route /workspace/:id/sessions: returns the bridge list; empty for
    idle workspace; 400 for relative path.
  - Route /session/:id/model: 200 success; 400 missing modelId; 400
    empty modelId; 404 unknown session.
  - SDK listWorkspaceSessions: URL-encodes the cwd; throws on 400.
  - SDK setSessionModel: posts body; throws on 404.

Verified end-to-end against a real `qwen serve`:
  - SDK reports 9 capability features, list returns the existing
    session, attached:true on repeat create, and `setSessionModel`
    rejects with HTTP 500 when the modelId isn't registered (with the
    daemon now surfacing "Internal error" instead of "[object Object]").
  - 404 path through SDK on unknown sessionId works.
Comment thread packages/cli/src/serve/httpAcpBridge.ts Fixed
Comment thread packages/sdk-typescript/src/daemon/DaemonClient.ts Fixed
Self-review pass on PR #3889. Two real correctness bugs and an
ergonomics gap, plus the test-coverage holes the audit surfaced. The
loudest finding ("host allowlist no-op when bind=localhost") was a
false positive — the conditional was misread; existing tests already
prove the validator is active on `localhost` binds.

Real fixes:

  - Bearer-auth timing-attack: `parts[1] !== token` short-circuits per
    byte, leaking which prefix is correct via response latency. Replace
    with SHA-256 of both sides + `crypto.timingSafeEqual` so comparison
    is constant-time regardless of token length.

  - Concurrent `spawnOrAttach` race in single-scope: two parallel
    callers for the same workspace both passed the `byWorkspace.get`
    check, both spawned, and one entry ended up orphaned in `byId`
    while the other won `byWorkspace`. Violates the
    "at most one session per workspace" invariant. Coalesce via an
    `inFlightSpawns` map: parallel callers attach to the in-flight
    promise and report `attached: true`. The slot is cleared on both
    success and rejection so a failed spawn doesn't poison the
    workspace forever. New test asserts ONE channel spawns under
    parallel calls and that retry works after rejection.

  - `Number.parseInt('1.5e10z', 10)` returns 1, so a malformed
    `Last-Event-ID` header silently passes through. Tighten
    `parseLastEventId` to `^\d+$` so anything not a pure decimal
    integer is dropped. New test exercises 'abc', '-1', '1.5e10z'.

Ergonomics:

  - `LOOPBACK_BINDS` and `LOOPBACK_HOST_BINDS` now include `::1` and
    `[::1]`. IPv6 loopback users no longer have to set a token.
    Host-allowlist allows `[::1]:port` Host headers.

Documentation:

  - `BridgeClient` doc-comment now states the Stage 1 trust model
    explicitly: agent runs as the same UID, the file-proxy methods
    are NOT a workspace-cwd sandbox, restricting them would be
    theatre. The audit flagged this as a "design gap" but the
    daemon-and-agent-on-same-host posture makes a sandbox here
    redundant — Stage 4+ remote-sandbox swaps the Client for a
    sandbox-aware variant.

SDK fix:

  - `DaemonClient.failOnError` previously called `res.json()`, which
    consumes the body even on parse-failure; the subsequent
    `res.text()` returned empty. New impl reads once as text and
    attempts JSON-parse; raw text is the fallback. New test asserts
    a `text/plain` 502 surfaces the body verbatim.

Test gap fills (audit-flagged):

  - Bridge: in-memory file-proxy tests for `BridgeClient.{read,write}
    TextFile` including line/limit slicing.
  - SSE route: `stream_error` synthetic frame on iterator throw
    mid-stream; numeric Last-Event-ID forwarded; malformed
    Last-Event-ID dropped.
  - DaemonClient: text/plain error body coerced to `body` field;
    `respondToPermission` 5xx throws; `subscribeEvents` null-body
    throws; `cancel`/`respondToPermission` URL-encode session/request
    ids that contain slashes.

Verified end-to-end with a token-required daemon: right token → 200,
wrong/missing/malformed → 401. All paths return uniform 401 messages
so a side-channel can't distinguish between "no header", "bad scheme",
and "wrong token".

Test counts: cli serve **89** (was 81, +8), sdk daemon **35** (was
30, +5). Full suites still green.
@wenshao wenshao requested a review from Copilot May 7, 2026 04:17

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Implements Stage 1 of qwen serve: an HTTP daemon that bridges ACP NDJSON over HTTP + SSE, plus an SDK-side DaemonClient to drive Stage 1 routes for health/capabilities, session management, prompt/cancel, SSE events, set-model, and permission voting.

Changes:

  • Added CLI daemon server (qwen serve) with auth controls (bearer + Host allowlist + Origin denial) and Stage 1 HTTP/SSE routes.
  • Implemented HttpAcpBridge + per-session EventBus for subprocess lifecycle, event fan-out/replay, prompt FIFO, and permission vote resolution.
  • Added TypeScript SDK DaemonClient + SSE parser with unit test coverage for daemon interactions.

Reviewed changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
packages/sdk-typescript/test/unit/daemon-sse.test.ts Adds unit tests for SSE parsing behavior (chunking, invalid JSON, trailing frames).
packages/sdk-typescript/test/unit/DaemonClient.test.ts Adds unit tests validating HTTP semantics, auth headers, SSE subscription, and error coercion.
packages/sdk-typescript/src/index.ts Exposes daemon client + types from the SDK public entrypoint.
packages/sdk-typescript/src/daemon/types.ts Defines SDK-side wire types for the daemon HTTP API.
packages/sdk-typescript/src/daemon/sse.ts Implements SSE-to-DaemonEvent async iterator parser.
packages/sdk-typescript/src/daemon/index.ts Adds a barrel export for daemon client, SSE parser, and wire types.
packages/sdk-typescript/src/daemon/DaemonClient.ts Implements the SDK HTTP client for the Stage 1 daemon routes.
packages/cli/src/serve/types.ts Defines daemon mode/options + capabilities envelope and Stage 1 feature tags.
packages/cli/src/serve/server.ts Implements Express app with routes for health/capabilities/sessions/prompt/cancel/events/model/permission.
packages/cli/src/serve/server.test.ts Adds extensive route-level tests for auth, host allowlist, sessions, SSE behavior, and shutdown.
packages/cli/src/serve/runQwenServe.ts Adds listener startup + token enforcement + shutdown draining + signal handling.
packages/cli/src/serve/index.ts Exports serve-related APIs (app builder, runner, bridge, event bus).
packages/cli/src/serve/httpAcpBridge.ts Implements per-session subprocess spawning, ACP handshake, FIFO prompt queue, permission voting, and event publishing.
packages/cli/src/serve/httpAcpBridge.test.ts Adds unit tests for spawn coalescing, prompt FIFO, permissions, events, fs proxying, and shutdown.
packages/cli/src/serve/eventBus.ts Implements bounded subscriber queues, replay ring buffer, and eviction semantics.
packages/cli/src/serve/eventBus.test.ts Adds unit tests for replay, fan-out, eviction, abort behavior, and ring-size behavior.
packages/cli/src/serve/auth.ts Adds bearer auth, Origin denial via CORS, and Host allowlisting for loopback binds.
packages/cli/src/config/config.ts Registers the new serve command and adjusts command exit behavior to allow daemon blocking.
packages/cli/src/commands/serve.ts Adds qwen serve command surface and --http-bridge flag gating Stage 1.
packages/cli/package.json Adds Express/CORS/Supertest and related typings.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/cli/src/serve/server.ts
Comment thread packages/cli/src/serve/auth.ts Outdated
Comment thread packages/cli/src/serve/eventBus.ts Outdated
Comment thread packages/sdk-typescript/src/daemon/sse.ts Outdated
Comment thread packages/cli/src/commands/serve.ts
Comment thread packages/cli/src/serve/server.ts Outdated
Second self-review pass on PR #3889. Three real bugs (one
correctness, one resource-cleanup, one cosmetic) plus consolidation
of the loopback bindings into a single source of truth.

Real fixes:

  - Shutdown could hang forever on a long-lived SSE consumer:
    `server.close` waits for every in-flight connection to drain,
    and a paused EventSource client never disconnects. Added a
    `SHUTDOWN_FORCE_CLOSE_MS` (5s) timer that calls
    `server.closeAllConnections()` to force-destroy stuck sockets,
    then resolves so `process.exit(0)` can run. New test asserts
    close completes well under 5.5s even when an SSE GET is in
    flight.

  - Signal-handler race during shutdown: round 1 detached the
    SIGINT/SIGTERM listeners *up front* in `handle.close()`. If a
    second SIGTERM arrived during the drain, no handler existed and
    Node's default termination ran, orphaning agent children. Switch
    to detaching at the *end* of the close path (in `finish()`):
    during the drain window the handler is still attached and the
    `if (shuttingDown) return` guard makes a second signal a no-op;
    after drain completes we can safely remove the listeners (this
    also fixes a test-suite MaxListenersExceededWarning that fired
    once we ran the runQwenServe tests >10 times in a single
    process).

  - SSE response had no `error` listener. When the underlying TCP
    socket died (RST, kill -9 on the client), the next `res.write`
    threw EPIPE and Express forwarded it to the default error
    handler, logging noisily. Added `res.on('error', cleanup)` so
    the failure is absorbed and triggers the same teardown path the
    `req.on('close')` handler uses.

Validation:

  - `createHttpAcpBridge` now throws on invalid `sessionScope` (anything
    other than `'single'` or `'thread'`) and on `initializeTimeoutMs <= 0`.
    Misconfigured callers used to silently degrade to thread behavior;
    now they fail loudly.

Cleanup:

  - The `LOOPBACK_BINDS` set was duplicated between `auth.ts` and
    `runQwenServe.ts` (round 1 missed this). Extracted into
    `packages/cli/src/serve/loopbackBinds.ts` with a single
    `isLoopbackBind(hostname)` helper. Both files now import; drift is
    impossible.

  - `res.flushHeaders?.()` lost the optional chaining. The method is
    on `http.ServerResponse` since Node 1.6; our `engines` floor is 20.

Tests added:

  - bridge: `sessionScope` validation, `initializeTimeoutMs` validation.
  - server: shutdown force-close timeout, SIGINT/SIGTERM listener
    detach-after-drain.

False positives from the round 2 audit (verified and dismissed):

  - "EventBus nextId overflow at 2^53" — theoretical only (would
    require ~9 quadrillion publishes per session). No code change.
  - "Subscribe-during-close race" — JS is single-threaded; the close()
    flag is set synchronously before the loop touches state.
  - "Queued prompts on shutdown" — by design; documented via the
    promptQueue tail comment.
  - "10MB body parser limit" — design choice for Stage 1's in-memory
    buffering model; revisit if ACP streaming lands in Stage 2.
  - "Unbounded body read in DaemonClient.failOnError" — daemon is
    local in Stage 1; the threat surface for adversarial-large error
    bodies is the same as the daemon's other unbounded buffers.

Test counts: cli serve **93** (was 89, +4), full cli **5047** (no
regressions), sdk **236** (no regressions).
@wenshao wenshao requested a review from Copilot May 7, 2026 04:32

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 7 comments.

Comment thread packages/cli/src/serve/httpAcpBridge.ts Outdated
Comment thread packages/cli/src/serve/eventBus.ts Outdated
Comment thread packages/cli/src/serve/server.ts Outdated
Comment thread packages/cli/src/serve/server.ts
Comment thread packages/cli/src/serve/server.ts
Comment thread packages/cli/src/serve/server.ts Outdated
Comment thread packages/sdk-typescript/src/daemon/sse.ts
Two more self-review passes on PR #3889. No correctness bugs surfaced
this time — round 3 found a HIGH-severity Windows-path claim that
turned out to be a false positive (`path.win32.isAbsolute('/foo/bar')`
returns true; verified against Node 20). Round 4 confirmed every
prior decision and surfaced one latent-but-not-currently-triggered
concurrency note.

Changes are pure documentation + a tiny optional-chain cleanup:

  - Drop `?.` on `server.closeAllConnections()` in runQwenServe.ts —
    the method exists since Node 18.2 and our `engines` floor is 20.
    The optional chain dated from before round 2's force-close timer
    landed; clean it up.

  - Help text for `qwen serve --port` now documents that port 0 means
    "OS-assigned ephemeral port" (which the implementation has always
    supported but never advertised).

  - `defaultSpawnChannelFactory` gains a comment near the spawn site
    documenting the FD-budget implication (~3 FDs per session, bump
    `ulimit -n` for many concurrent sessions) and the `stdio:
    ['pipe', 'pipe', 'inherit']` choice (child stderr lands in the
    daemon's stderr, interleaved across sessions). Both are
    Stage-1-accepted; Stage 2/4+ revisit each.

  - Comment on the bridge's `byWorkspace`/`byId` Maps documenting the
    known gap that a child crashing between requests leaves a garbage
    SessionEntry until daemon shutdown — surfaced as a per-prompt
    failure when the dead session is touched, not a hang. Stage 2's
    in-process bridge eliminates the spawned-child failure mode
    entirely so this gap goes away naturally.

  - `EventBus.subscribe` doc-comment now states explicitly that the
    returned iterator is NOT safe to drive from concurrent
    `.next()` callers — the underlying queue isn't atomic. Daemon
    usage is the sequential `for await ... of` inside the SSE route,
    so this is safe in production. Documented so a future fan-out
    consumer doesn't accidentally rely on undefined behavior.

False positives verified and dismissed (round 3 + 4 combined):

  - `path.isAbsolute('/foo/bar')` Windows breakage — `path.win32.
    isAbsolute('/foo/bar')` is true; verified empirically.
  - "Windows drive divergence" causing duplicate sessions — different
    drives are different on-disk paths; sessions intentionally
    differ.
  - "parseSseStream early-break leaks reader" — `for await ... break`
    triggers `iterator.return()` which runs the generator's `finally`
    that calls `releaseLock`. Standard JS semantics.
  - "Promise executor sync-throw fragility in requestPermission" —
    sync throws inside `new Promise(executor)` reject the outer
    promise; functionally correct, just stylistic.
  - "Force-close timeout test elapsed assertion flakiness" — assertion
    is `< 5500ms` but the natural happy-path is sub-100ms. Generous
    headroom; not flake-prone in practice.
  - "fetch reference stale after polyfill" — `globalThis.fetch.bind`
    captures at construction; tests inject `opts.fetch` instead of
    polyfilling, which is the correct pattern.

Test counts unchanged (cli serve **93**, sdk **236**); typecheck +
lint clean. STAGE1_FEATURES still matches every implemented route
1:1, fakeBridge in tests implements every HttpAcpBridge method.
@wenshao wenshao requested a review from Copilot May 7, 2026 04:40

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 5 comments.

Comment thread packages/cli/src/serve/server.ts Outdated
Comment thread packages/cli/src/serve/server.ts
Comment thread packages/cli/src/serve/auth.ts Outdated
Comment thread packages/cli/src/serve/eventBus.ts Outdated
Comment thread packages/cli/src/serve/server.ts Outdated

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Downgraded from Request changes to Comment: self-PR by authenticated user; GitHub does not allow requesting changes on your own PR.

Comment thread packages/cli/src/serve/httpAcpBridge.ts
Comment thread packages/cli/src/serve/httpAcpBridge.ts Outdated
wenshao added 3 commits May 7, 2026 13:29
Addresses the four critical findings from the PR #3889 reviewer pass:

  1. ACP `ReadTextFileRequest.line` is 1-based per spec, but the
     bridge's `BridgeClient.readTextFile` was treating it as a
     0-based slice index. A client asking for `{line:1, limit:2}`
     ("first two lines") was getting lines 2-3 — a sign-off-by-one
     bug that breaks every editor / SDK client following the ACP
     schema. Convert to 0-based via `Math.max(0, line - 1)`. The
     existing slice test was asserting the wrong behavior; updated
     to expect the spec-correct result and added a second `line:3,
     limit:2` case to lock in the offset.

  2. `modelServiceId` was accepted by the SDK + server `POST /session`
     path, forwarded into `bridge.spawnOrAttach`, and then silently
     dropped: `doSpawn` never wired it into the agent. Callers
     requesting a specific model got the agent's default and no
     indication anything was wrong. Now `doSpawn` issues
     `unstable_setSessionModel` immediately after `newSession`. If
     the agent rejects the model id, the half-initialized session is
     torn down and the spawn rejects so the caller can retry cleanly
     instead of inheriting silent drift. Three new bridge tests:
     happy path, omit-when-undefined, agent-rejection cleanup.

  3. The CORS middleware used `cors({ origin: (o, cb) =>
     cb(new CORSError(...), false) })` for browser-Origin requests.
     `cors` flows the Error into Express's error chain; without an
     explicit error handler that produces a 500 + HTML body, which
     is misleading for what is really a deterministic 403 denial.
     Replace with a tiny `RequestHandler` that checks
     `req.headers.origin` directly and returns
     `403 { error: 'Request denied by CORS policy' }` JSON. Drops
     the `cors` and `@types/cors` dependencies — there's no other
     consumer in the cli package.

  4. The SSE `stream_error` synthetic frame hard-coded `id: 0`,
     which would regress the client's `Last-Event-ID` tracker and
     trigger duplicate replays on reconnect. The frame is terminal
     and daemon-emitted — it has no place in the per-session
     monotonic sequence. Refactor `formatSseFrame` to omit the
     `id:` line when the input event has no id field, and emit
     `stream_error` without one. Test updated to assert
     `frames[1].id === undefined` while the preceding
     `session_update` still carries its monotonic id.

Tangential cleanup: `errorMessage` now formats the SSE error body
(was `err.message` only — would have shown `[object Object]` for
JSON-RPC errors mid-stream, mirroring the round-1 SDK fix).

Test counts: cli serve **96** (was 93, +3 modelServiceId cases);
existing readTextFile slice test rewritten in place. Full
typecheck + lint + suite green.
…ish (#3803)

Second batch of reviewer-flagged fixes for PR #3889. Addresses 7
robustness issues across the daemon's SSE pipeline + the bus + the
SDK's stream parser.

Daemon SSE (`server.ts`):

  - SSE writes now respect backpressure. `res.write` returns false when
    the kernel send buffer is full; the previous code ignored that and
    Node accumulated payloads in user-space memory unboundedly. A slow
    consumer on a chatty session could balloon daemon RSS. New
    `writeWithBackpressure` helper awaits `drain` (or `close`/`error`)
    before scheduling the next write — for both per-frame writes and
    heartbeats.

  - `parseLastEventId` rejects values > `Number.MAX_SAFE_INTEGER`. With
    the prior `^\d+$` regex a malicious 25-digit value would parse to
    a number that loses precision and confuses replay comparisons.

EventBus (`eventBus.ts`):

  - `Last-Event-ID` replay events now `forcePush` past `maxQueued`. A
    client reconnecting with a 1000-event gap on a subscriber whose
    cap is 256 was silently losing entries 257-1000 — a sign-off-by-
    nothing breakage of the resume contract. Live publishes still go
    through the normal cap (slow live consumer must be evictable);
    historical replay is bypassed.

  - `onAbort` now disposes the subscription immediately instead of
    only closing the queue. An aborted-but-never-iterated subscriber
    used to linger in `bus.subs` until the consumer drove `next()` /
    `return()`. New tests cover both abort-after-subscribe and
    already-aborted-at-subscribe paths.

  - `BoundedAsyncQueue.next` now checks `buf.length > 0` before
    shifting instead of `buf.shift() !== undefined`. The bus never
    pushes `undefined` today but the queue is generic — the prior
    pattern would mis-handle a queue whose element type legitimately
    includes undefined.

SDK SSE parser (`sse.ts`):

  - Now flushes the TextDecoder on stream close. Without the final
    `decoder.decode()`, an incomplete multi-byte UTF-8 sequence at
    the tail of the last chunk was silently dropped — corrupting any
    frame whose JSON ended mid-character. New test feeds a stream
    split mid-byte through "中" (3-byte UTF-8) and asserts the
    character round-trips.

  - Frame separators now accept both `\n\n` and `\r\n\r\n`. SSE spec
    allows CRLF, and intermediaries (corporate proxies, some Node
    http servers) sometimes normalize. Frame field splitter also
    accepts `\r?\n`. Two new tests cover pure CRLF + mixed-LF/CRLF.

Test counts: cli serve **99** (was 96, +3 EventBus); sdk daemon-sse
**10** (was 7, +3). Full typecheck + lint + suite green.
Last batch from the PR #3889 reviewer pass: mostly docs + a
ReDoS-tooling-silencing rewrite + a yargs-key cleanup.

  - `commands/serve.ts` ServeArgs interface dropped the camelCase
    `httpBridge` mirror; the handler now reads `argv['http-bridge']`
    matching the declared option name. The dual surface relied on
    yargs's camelCase expansion behavior — fragile if yargs config
    ever changes.

  - `DaemonClient` constructor's `baseUrl.replace(/\/+$/, '')` (which
    is end-anchored and linear, but CodeQL's polynomial-regex
    detector flags any `\/+$` pattern on attacker-controlled input)
    swapped for a hand-rolled `stripTrailingSlashes` loop. Same
    behavior, no rule trigger.

  - `defaultSpawnChannelFactory`'s `cwd: workspaceCwd` flow into
    `spawn` is the second CodeQL finding ("uncontrolled data used in
    path expression"). It IS user-controlled, by design — that's the
    Stage 1 trust model. Added a `// lgtm[js/shell-command-
    constructed-from-input]` suppression with a comment explaining
    the model and pointing at issue #3803 §11 for the Stage 4+ remote-
    sandbox replacement.

  - Stale doc comment on `createServeApp` that still listed only
    `/health`, `/capabilities`, `POST /session` as shipped — now
    enumerates all 9 routes that match §04 of the design.

  - Stale doc comment on `HttpAcpBridge` saying "Stage 1 buffers them
    in-memory; SSE wiring lands in the next PR" — SSE wiring landed
    in commit 41aa950. Replaced with a description of the actual
    flow through EventBus + SSE.

No behavior change; tests + lint + typecheck still green. cli serve
still **99**, sdk **38** (was 30 before this batch — daemon-sse +3,
DaemonClient +5 from rounds 1+2). Full e2e against built daemon
re-verified: CORS denial returns 403 JSON (was 500 HTML), bad
`modelServiceId` now causes spawn to fail with HTTP 500 (was: silent
default-model substitution), `POST /session` without modelServiceId
unaffected.
@wenshao wenshao requested a review from Copilot May 7, 2026 05:39

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 5 comments.

Comment thread packages/cli/src/serve/server.ts Outdated
Comment thread packages/sdk-typescript/src/daemon/sse.ts Outdated
Comment thread packages/cli/src/serve/eventBus.ts
Comment thread packages/cli/src/serve/runQwenServe.ts Outdated
Comment thread packages/cli/src/serve/server.ts Outdated
…vent.id optional (#3803)

Two more fixes from a final post-review-comment audit pass on PR #3889.
Both are subtle correctness gaps that fell out of the round-1 critical
fixes (modelServiceId apply + SSE id-less stream_error).

  - In `httpAcpBridge.ts:doSpawn`, when `unstable_setSessionModel`
    rejects after `newSession` succeeded, we tear down the entry from
    `byWorkspace` + `byId` (round 1 fix) but did NOT close the
    EventBus we'd just constructed for that entry. The agent could
    have published a session_update notification during init that
    queued in the (now unreachable) bus's ring buffer; without an
    explicit close the bus + buffer linger until the next GC cycle.
    Bounded leak (1 bus per failed spawn × 1000-event ring) but
    cleaner to close it. New regression test exercises the retry path
    after a model-rejection failure to lock in that we don't reuse
    the orphan and that subscribers on the fresh session see an empty
    iterator on immediate abort.

  - SDK `DaemonEvent.id` is now `id?: number` instead of `id: number`.
    The round-1 SSE fix made the daemon emit `stream_error` frames
    *without* an `id:` line so they don't pollute the per-session
    monotonic sequence. The SDK parser correctly returns `undefined`
    for the missing field, but the type still advertised `id: number`
    — TypeScript consumers persisting `lastSeenId = event.id` would
    accidentally store `undefined`. Made the field optional and added
    a doc comment instructing consumers to skip frames without an id.

Plus one more false-positive verified and dismissed:

  - "writeWithBackpressure Promise double-settle race": the auditor
    flagged that `res.write(chunk, callback)` could fire its callback
    after the synchronous `ok=true` resolve. Verified harmless —
    Promise double-settle is a no-op, the callback only rejects on
    error (caught separately by `res.on('error', cleanup)`), and
    multiple parallel writes register independent listener sets that
    each remove their own pair after firing.

Test counts: cli serve **100** (was 99, +1 retry-after-model-rejection
regression). SDK unchanged at 239. Full typecheck + lint + suites
green; flow re-verified end-to-end.

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test

Comment thread packages/cli/src/serve/eventBus.test.ts

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test

Comment thread packages/cli/src/serve/eventBus.test.ts
Comment thread packages/cli/src/serve/auth.ts Outdated
Comment thread packages/cli/src/serve/httpAcpBridge.ts
Comment thread packages/cli/src/serve/server.ts Outdated
Comment thread packages/cli/src/serve/httpAcpBridge.ts

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test

Comment thread packages/sdk-typescript/src/daemon/DaemonClient.ts
Comment thread packages/cli/src/serve/server.ts

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 36 out of 37 changed files in this pull request and generated 2 comments.

Comment thread packages/sdk-typescript/src/daemon/sse.ts Outdated
Comment thread packages/sdk-typescript/src/daemon/sse.ts Outdated
wenshao added 2 commits May 12, 2026 23:38
Critical fixes:

- server.ts safeBody() helper (BZ9uv/va/vs/wD + Bd10m + Bd1zz):
  prototype-pollution sanitization at the body-spread boundary.
  `__proto__` / `constructor` / `prototype` keys are stripped and
  the result is an Object.create(null) target. Replaces 5 sites of
  copy-pasted `typeof req.body === 'object'...` preamble + makes
  the `...(body as object)` spread sites safe.
- httpAcpBridge requestPermission (Bd1yh): per-request wall-clock
  deadline (default 5 min, configurable via
  `BridgeOptions.permissionResponseTimeoutMs`). Without this, an
  agent calling requestPermission with no SSE subscriber connected
  would hang the per-session FIFO forever. After deadline, resolve
  as cancelled + log stderr warning.
- httpAcpBridge requestPermission (Bd1z5): per-session pending
  permissions cap (default 64, configurable via
  `BridgeOptions.maxPendingPermissionsPerSession`). New requests
  past the cap resolve as cancelled with stderr warning. Prevents
  a chatty agent from growing pendingPermissions unboundedly.
- runQwenServe onSignal double-signal force-exit (Bd1y6): new
  `bridge.killAllSync()` + `AcpChannel.killSync()` method
  synchronously SIGKILLs every live qwen --acp child BEFORE
  `process.exit(1)`. Previously double-Ctrl+C bypassed the async
  bridge.shutdown() and left children running as orphans.
- server.ts SSE subscriber-limit response (Bd1zJ): 429 +
  Retry-After instead of 200 + stream_error frame. EventSource
  treats 4xx as terminal (no auto-reconnect); the previous
  200+close-stream triggered EventSource's reconnect loop,
  amplifying the load the limit existed to prevent.
- doSpawn ghost sessionId guard (Bd1zc): re-check byId.has() after
  applyModelServiceId(). The model-switch yields and can race
  channel.exited; without this, caller got HTTP 200 with a
  sessionId that 404s on every subsequent request.

Follow-ups in the same diff:

- sse.ts consumeFrames CRLF scan comment (BcRh_): the comment
  claimed the CRLF scan was bounded to `[cursor, lf)`, but Node's
  `indexOf` has no upper bound. Rewrote to describe what the code
  actually does (scan full remainder; only USE the result if it
  falls before `lf`).
- sse.ts SseFramingError export (Bd10T): typed error class for
  framing-level failures so SDK consumers can distinguish "upstream
  isn't SSE" from generic network errors via instanceof check.
  Re-exported from @qwen-code/sdk.
- protocol doc /health auth (Bctum): document the loopback
  exemption — `/health` doesn't require Authorization on loopback
  binds even when a token is configured. Matches `createServeApp`'s
  registration order.

Bd1xz (cross-session permission escalation) acknowledged as
duplicate of BUy4H — already documented as a known Stage 1 gap
under the single-user / small-team trust model; fix is Stage 1.5
must-have #3 (per-client identity + per-session permission scope).

Tests:
- New: prototype-pollution test verifies `__proto__` spread
  doesn't pollute `Object.prototype`.
- All 70 server + 55 bridge + 16 daemon-sse + 60 DaemonClient
  tests pass (203 total).

`killSync()` stubbed on every inline test channel fake; fake
bridge has `killAllSync()`.
…ly bounded (BeFHR + BeFId)

Previous attempt at the BX9_a perf optimization left the CRLF scan
running over the full remainder of `buf` on every loop iteration
where an LF separator existed — only the LF-not-found fallback path
was actually bounded. Comments claimed the CRLF scan was restricted
to `[cursor, lf)` or "only fires when needed", but Node's
`String.indexOf` doesn't accept an end index.

Bound the scan via a `buf.slice(cursor, lf)` window before
`indexOf` so the assertion is now true: in the common LF-only case
we pay one full scan (for LF) plus one bounded scan over the
matched frame's bytes (small).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 36 out of 37 changed files in this pull request and generated no new comments.

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Downgraded from Approve to Comment: self-PR. Incremental review of Stage 1.5 multi-session refactor (8de72dc6a170ef). No Critical issues; 3 Suggestions below.

Comment thread packages/cli/src/serve/httpAcpBridge.ts
Comment thread packages/cli/src/serve/httpAcpBridge.ts Outdated
Comment thread packages/sdk-typescript/src/daemon/sse.ts Outdated
…link, no-sessionId throw

- httpAcpBridge.writeTextFile BfFvO: dangling-symlink case. `fs.realpath`
  throws ENOENT for a symlink whose target doesn't exist, and the
  blanket catch silently fell back to writing through the symlink
  itself — `rename(tmp, params.path)` then replaced the symlink with
  a regular file, exactly the bug BX8Yw was supposed to fix. Use
  `fs.readlink` to disambiguate "truly non-existent" from "dangling
  symlink"; resolve the dangling target manually and write through
  to it so the symlink stays a symlink. Regression test added.
- httpAcpBridge BridgeClient resolveEntry BfFut: defensive throw on
  no-sessionId ACP call against a multi-session channel. ACP today
  carries sessionId on every per-session call, but if a future
  no-sessionId call lands, silently dropping it on a multi-session
  channel would be invisible.
- httpAcpBridge.test.ts BX8YO Windows skip: hard-skip via
  `process.platform === 'win32'`. Git-Bash etc. ship a `mkfifo`
  binary that degenerates on Windows (creates a regular file or
  silently no-ops), making the assertion match the wrong error
  shape. Linux + macOS coverage is sufficient for a platform-
  agnostic `!stats.isFile()` check.

BfFvW (CRLF scan comment) was already addressed in 0a4146a — the
reviewer's diff was against the pre-fix version.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 36 out of 37 changed files in this pull request and generated 2 comments.

Comment thread packages/cli/src/serve/httpAcpBridge.ts
Comment thread packages/cli/src/serve/types.ts Outdated

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Downgraded from Request Changes to Comment: self-PR reviews cannot use REQUEST_CHANGES on GitHub.

Comment thread packages/cli/src/serve/httpAcpBridge.ts Outdated
Comment thread packages/cli/src/serve/server.ts
Comment thread packages/cli/src/serve/httpAcpBridge.ts Outdated
Comment thread packages/cli/src/serve/httpAcpBridge.ts Outdated
Critical fixes:

- httpAcpBridge.doSpawn newSession-failure cleanup (BkwQA): if
  `connection.newSession()` throws on a freshly-created channel
  whose sessionIds set is empty, tear the channel down rather than
  leaking the empty `qwen --acp` child in `byWorkspaceChannel`
  (invisible to `sessionCount` / `maxSessions`). Channels with
  other live sessions still survive — only the truly-empty case
  reaps.
- httpAcpBridge.detachClient + killSession tombstone (BkwQP):
  detachClient no longer reaps live sessions. Scenario: A spawns
  (attached: false, hasn't opened SSE yet), B attaches
  (attachCount: 1), B disconnects → previous code reaped A's
  still-valid session. New behavior:
  * killSession({ requireZeroAttaches: true }) sets
    `entry.spawnOwnerWantedKill = true` when it bails on
    attachCount > 0 (instead of just returning).
  * detachClient ONLY decrements attachCount. It completes the
    deferred reap only when (spawnOwnerWantedKill && attachCount
    === 0 && subscriberCount === 0).
  * Both-disconnected case still works (reap completes via B's
    detachClient seeing the tombstone). Spawn-owner-alive case
    no longer reaps. Existing tanzhenxin-issue-2 test rewritten;
    new test pins the spawn-owner-alive case.
- httpAcpBridge.writeTextFile mode preservation (BkwQW): stat the
  target before writing; if it exists, chmod the tmp file to the
  preserved mode (and chown owner/group — best-effort, EPERM
  ignored for non-root). Previously a 0600 secret/config edit
  would downgrade to umask-default 0644, exposing contents to
  other local users.
- bridge.respondToPermission option-ID validation (BkwQI): new
  `InvalidPermissionOptionError` thrown when the voter's `optionId`
  isn't in the set of options the agent originally offered in the
  `permission_request` event. PendingPermission now carries
  `allowedOptionIds`. Server route catches the error → 400 (vs.
  404 for unknown requestId). Prevents authenticated clients from
  forging hidden outcomes like `ProceedAlways*` when the prompt's
  `hideAlwaysAllow` policy intentionally suppressed them.

Doc fixes:

- httpAcpBridge top-of-file (BkdCg) + types.ts ServeMode (BkdC8):
  rewrite the "each session spawns its own qwen --acp child"
  framing to match the actual Stage 1.5 multi-session-per-channel
  architecture (one child per workspace, sessions multiplex via
  `connection.newSession()`).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 35 out of 37 changed files in this pull request and generated 1 comment.

Comment thread packages/cli/src/commands/serve.ts Outdated

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Downgraded from Comment to Comment: self-PR; GitHub only accepts COMMENT reviews on your own PR. Inline findings are preserved.

Comment thread packages/cli/src/serve/httpAcpBridge.ts
Comment thread packages/cli/src/serve/httpAcpBridge.ts
Comment thread packages/cli/src/serve/server.ts
…g tests + 1 doc

- writeTextFile mode-bits race (Blehd): the BkwQW fix preserved
  mode via `chmod` AFTER `fs.writeFile`, leaving a brief window
  where a `0600` secret-edit was readable at the directory's
  umask default (commonly `0644`). Now pass `mode` to writeFile
  directly so the file is CREATED with the preserved mode atomically
  via the `open(O_CREAT, mode)` syscall. The post-write `chmod`
  remains as belt-and-suspenders against a tight operator umask
  (POSIX `mode & ~umask` could drop bits we wanted preserved).
- httpAcpBridge.test.ts: new bridge-level test for the BkwQI
  `InvalidPermissionOptionError` path (Blehk). Forge a vote with
  an `optionId` not in the agent-offered set; assert the throw
  AND that the pending permission survives so a valid vote can
  still resolve it.
- server.test.ts: new route-level test for the BkwQI 400 mapping
  (Blehl). Fake bridge throws `InvalidPermissionOptionError`;
  assert response is 400 with `code: 'invalid_option_id'`,
  `requestId`, and `optionId` in the body.
- commands/serve --http-bridge help text (Bk59I): updated to
  reflect Stage 1.5 multi-session — "one `qwen --acp` child per
  workspace, with multiple sessions multiplexed via the agent's
  native `newSession()`" (was: "per-session child").

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 35 out of 37 changed files in this pull request and generated 1 comment.

Comment thread packages/sdk-typescript/src/daemon/sse.ts
…ody-read rejection (BlqF_)

Some fetch impls (undici on abort) reject the in-flight `reader.read()`
with an AbortError after `reader.cancel()` fires. Pre-fix that
rejection bubbled to the consumer's `for await`, contradicting the
"abort cancels cleanly" public contract — code that called
`controller.abort()` to wind a subscription down saw an unexpected
throw on the next iteration.

Wrap `reader.read()` in try/catch:
- if `signal?.aborted` is true → treat the rejection as clean
  completion (return from the generator)
- otherwise re-throw, so real upstream failures (network drop,
  unexpected close, malformed body) still reach the consumer

Two regression tests pin the guard's scope: signal-aborted
mid-stream returns cleanly with the frames received so far; a
non-abort `streamController.error(...)` still bubbles via `rejects.toThrow`.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 35 out of 37 changed files in this pull request and generated 1 comment.

Comment thread packages/cli/src/serve/eventBus.ts Outdated
…listener (BmJT1)

Pre-fix: `publish()`'s eviction path deleted the sub from `this.subs`
but never invoked `dispose()`, leaving the AbortSignal abort-listener
registered in `subscribe()` attached. Because the consumer is by
definition stalled (that's what caused the overflow), `next()` /
`return()` never fire to detach the listener through the iterator
path. Closures over the queue + sub stayed live until the AbortSignal
itself went out of scope.

Under attack (thousands of opened-then-stalled SSE clients), this
amplified into significant heap retention.

Fix: store `dispose` on `InternalSub` and invoke `sub.dispose()` from
the eviction path. The same closure used by the abort listener / the
iterator's `next()`/`return()` cleanup now runs through the
eviction path too — idempotent through `disposed` so a
post-eviction abort or iterator-return is still safe. Regression
test pins the post-eviction abort + publish path producing zero
side effects.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 35 out of 37 changed files in this pull request and generated 3 comments.

Comment thread docs/users/qwen-serve.md
| `--token <str>` | — | Bearer token. Falls back to `QWEN_SERVER_TOKEN` env var (with leading/trailing whitespace stripped — handy for `$(cat token.txt)`). |
| `--max-sessions <n>` | `20` | Cap on concurrent live sessions. New `POST /session` requests that would spawn a fresh child return `503` (with `Retry-After: 5`) when the cap is hit; attaches to existing sessions are NOT counted. Set to `0` to disable. Sized for single-user / small-team usage; raise it if your deployment has the RAM/FD headroom (~30–50 MB per session). |
| `--max-connections <n>` | `256` | Listener-level TCP connection cap (`server.maxConnections`). Bounds raw socket count irrespective of session count — slow / phantom SSE clients get rejected at accept time once full. Raise alongside `--max-sessions` if your deployment expects many SSE subscribers per session. |
| `--http-bridge` | `true` | Stage 1 mode: per-session `qwen --acp` child process. Stage 2 native in-process becomes available later. |
Comment thread docs/users/qwen-serve.md
- **`--hostname 0.0.0.0` requires a token** — boot refuses without one.
- **`LOOPBACK_BINDS` includes IPv6** — `::1` and `[::1]` count as loopback for the no-token rule.
- **Host header allowlist** — on **loopback** binds the daemon checks `Host:` matches `localhost:port` / `127.0.0.1:port` / `[::1]:port` / `host.docker.internal:port` (case-insensitive per RFC 7230 §5.4) to defend against DNS rebinding. **Non-loopback binds (`--hostname 0.0.0.0`) intentionally bypass the Host allowlist** — the operator has chosen the surface area, so the bearer-token gate is the sole authentication layer; reverse proxies / SNI / client cert pinning are the operator's responsibility, not the daemon's. If you need Host-based isolation on a non-loopback bind, terminate TLS + check Host at a front proxy.
- **CORS denies any browser Origin** — returns `403` JSON. **Implication for browser-served webuis** (BUy4e): any `packages/webui`-style frontend that lives on a separate origin will get 403 at the wire. Stage 1 options for browser-style consumption: (a) package the webui as a native shell (Electron/Tauri) so no `Origin` header is sent, or (b) front the daemon with a same-origin reverse proxy that strips/rewrites `Origin` for a known frontend. Stage 1.5 will add `--allow-origin <pattern>` for opt-in named frontends.
Comment on lines +317 to +318
- `{ "outcome": "selected", "optionId": "<one-of-the-options>" }` — accept / reject / proceed-once / etc, per the agent's offered choices
- `{ "outcome": "cancelled" }` — drop the request (matches what `cancelSession` / `shutdown` do internally)

@tanzhenxin tanzhenxin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

The prior two critical findings (--max-connections 0 Node 22 break, orphan-on-coalesce-double-disconnect) are properly closed with regression tests, and the per-workspace multiplex architectural change is sound — careful session lifecycle, deduplicated channel shutdown, correctly-scoped permission/event routing per sessionId. The 16 follow-up commits address a broad sweep of code-review feedback with care. Two observations worth raising before ship: a scope/architecture note about the new bridge living alongside the existing one in the channels package, and one regression in the double-signal force-kill path.

1. Parallel ACP-bridge implementation duplicates the one in the channels package (severity: medium · confidence: very high)

The PR adds a 2400-line ACP bridge under serve/ but touches nothing in packages/channels/. The repo now hosts two independent ACP-bridge implementations that both spawn a qwen --acp child, run the ACP init handshake, multiplex multiple sessions onto one child, and dispatch session updates back to clients — with zero shared code. The new implementation is the stricter of the two: real init-handshake await instead of a 1s setTimeout, FIFO prompt queue per session, bounded per-subscriber event queues with replay, deadline-bounded permission requests with first-responder voting. The channels-base bridge has the corresponding gaps (timing-based init, no prompt serialization, EventEmitter fan-out, auto-approving permissions). Those gaps now have a correct reference implementation, but the channels adapters (dingtalk, weixin, telegram) continue to use the weaker bridge. As Stage 2's in-process refactor reshapes the new bridge's internals, the channels-side bridge will either need a parallel migration or fall further behind. Worth a follow-up to extract a shared multiplex-ACP base both implementations layer onto.

2. Double-tap Ctrl-C no longer force-kills wedged agent children during graceful drain (severity: medium · confidence: very high)

The Bd1y6 design promises that a second SIGINT/SIGTERM during the graceful drain synchronously SIGKILLs every live agent child before process.exit(1) — the operator-visible "I want this NOW" path for a wedged child that's ignoring SIGTERM. The multi-session refactor inadvertently broke that guarantee. The first signal's drain path snapshots the channel map and then clears it before awaiting the per-child SIGTERM-grace window (up to 10 seconds each). If the operator double-taps Ctrl-C during that window, the force-kill routine snapshots the (now empty) channel map and does nothing; the subsequent process.exit(1) then tears down the in-flight SIGKILL fallback timers. Any child still inside its SIGTERM grace window is left orphaned with dangling pipes. The exact scenario this force-kill path was added to handle is the one it no longer handles. Mechanical fix: hold the channel snapshot in a closure the force-kill routine can also see, or defer clearing the map until after the kill awaits resolve.

Verdict

COMMENT — both prior criticals are closed, the per-workspace multiplex refactor is sound, and the 16 follow-up commits show real care. The channels-package duplication is a scope/architecture observation worth a follow-up. The force-kill regression should be fixed before ship since it silently breaks an operator-visible guarantee.

LaZzyMan
LaZzyMan previously approved these changes May 13, 2026

@LaZzyMan LaZzyMan left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Stage 1 of the qwen serve daemon — HTTP/SSE bridge in front of ACP, an in-memory event bus with replay, a TypeScript daemon SDK, and the per-workspace multiplexing rework that landed mid-review. I traced the multi-client race edges on the multiplexing path (channel teardown fan-out, last-session ownership in killSession, the deferred-kill tombstone, the late-shutdown re-check inside session creation, and the single-listener transportClosedReject ordering) plus the bus, SSE, and auth surfaces end-to-end. No live correctness or security issue at HEAD. The long stretch of closed review threads has already worked through what would otherwise have been the bug-finding territory; the documented scope boundaries (TUI as super-client, multiplexing within a workspace, MCP refcount deferred) read as honest and complete.

Verdict

APPROVE — Stage 1 is ready to land; the experimental label fits because the scope is intentionally narrow, not because the implementation needs more time.

…broken by multi-session refactor (BkUyD)

The Bd1y6 design promised a second SIGINT/SIGTERM during graceful
drain synchronously SIGKILLs every live agent child via
`bridge.killAllSync()` before `process.exit(1)` — the operator-
visible "kill it now" path for a wedged child ignoring SIGTERM.

The Stage 1.5 multi-session refactor (commit 6a170ef) inadvertently
broke this. `shutdown()` snapshots `byWorkspaceChannel` then CLEARS
the map BEFORE awaiting the per-child SIGTERM-grace kills (up to
~10s each). If the operator double-taps mid-window, `killAllSync()`
snapshotted from the now-empty `byWorkspaceChannel.values()` and
silently no-op'd — the for-loop iterated nothing, `process.exit(1)`
fired, and any child still inside its SIGTERM grace window was left
orphaned with dangling pipes. Exactly the scenario the force-kill
path was added to handle.

Fix: introduce a separate `liveChannels: Set<ChannelInfo>` as the
source of truth for "channels with potentially-alive child
processes". Added in `getOrCreateChannel` alongside
`byWorkspaceChannel.set(...)`; removed only when `channel.exited`
fires (the OS-level "really dead" signal). `killAllSync()` now
iterates `liveChannels`, so a mid-shutdown second signal still
sees every still-alive child regardless of where the graceful
drain currently is. Other paths (`killSession` last-session reap,
`channel.exited` crash handler) automatically remove via the same
exit-handler hook.

Regression test:
- Builds two sessions on different workspaces
- Replaces each channel's `kill()` with a never-resolving Promise
  (simulating stuck SIGTERM grace)
- Calls `bridge.shutdown()` to enter mid-drain state
- Yields twice so shutdown's sync prefix runs (clears
  byWorkspaceChannel, starts the never-resolving awaits)
- Calls `bridge.killAllSync()` — pre-fix this saw an empty
  `byWorkspaceChannel` and the spy array would have been empty;
  post-fix both channels' `killSync` is invoked.

(tanzhenxin's other observation — channels-package duplicate ACP
bridge — is the same architectural concern as chiga0 finding 1+5,
already tracked under existing FIXME(stage-1.5) markers. No code
change in this commit for that.)

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 35 out of 37 changed files in this pull request and generated no new comments.

@wenshao

wenshao commented May 13, 2026

Copy link
Copy Markdown
Collaborator Author

Both findings addressed — force-kill regression fixed in e02537ccb

Thanks for catching the force-kill regression. Sharp call — exactly the operator-visible guarantee the Bd1y6 path was added for, silently broken by the multi-session refactor's eager byWorkspaceChannel.clear().

Finding 2 (force-kill regression) — fixed

Pushed e02537ccb.

Diagnosis matches yours exactly: shutdown() snapshotted byWorkspaceChannel.values(), cleared the map, then awaited the per-child SIGTERM grace. killAllSync() snapshotted from the now-empty map, the for-loop iterated nothing, and process.exit(1) tore down the in-flight kill machinery. Children mid-grace got orphaned.

Fix: introduced liveChannels: Set<ChannelInfo> as the source of truth for "channels with potentially-alive child processes":

  • Populated in getOrCreateChannel alongside byWorkspaceChannel.set(...).
  • Removed ONLY when channel.exited fires (the OS-level "really dead" signal). Async kill paths (killSession last-session reap, shutdown() await, channel.exited crash handler) all converge on channel.exited, so this is one cleanup site for all teardown paths.
  • killAllSync() now iterates liveChannels instead of byWorkspaceChannel. A mid-shutdown second signal still sees every still-alive child regardless of where the graceful drain is.

Picked your "hold the snapshot in a closure" suggestion's structural twin — a dedicated Set rather than passing the snapshot through a closure, because it also covers the channel.exited cleanup race symmetrically (the in-flight crash path needs the same OS-alive view).

Regression test added:

  • Builds two sessions on different workspaces.
  • Replaces each channel's kill() with a never-resolving Promise (simulates a stuck SIGTERM grace window).
  • Calls bridge.shutdown(); yields twice so the sync prefix runs (clears byWorkspaceChannel, starts the never-resolving awaits).
  • Calls bridge.killAllSync(). Pre-fix: spy array was empty. Post-fix: both channels' killSync is invoked. The shutdown promise is intentionally left pending (test runner GCs it).

Finding 1 (channels-package duplicate bridge) — already on the Stage 1.5 docket

You're right that the two implementations now have meaningfully different invariants (real init-handshake await vs. 1s timeout; FIFO prompt queue vs. none; bounded event queues with replay vs. EventEmitter fan-out; deadline-bounded permissions with first-responder vote vs. auto-approve). Carrying both as Stage 2's in-process refactor reshapes the new one is real debt.

This overlaps with chiga0 finding 1 + 5 (lifting AcpChannel to an @qwen-code/acp-bridge package both serve/ and channels/ consume) — tracked via inline FIXME(stage-1.5, chiga0 finding 1) in httpAcpBridge.ts. Your specific framing — "the channels adapters (dingtalk / weixin / telegram) inherit the weaker bridge's gaps until the lift lands" — adds a concrete operational urgency to the package extraction that the earlier findings didn't enumerate; will fold that into the Stage 1.5 spec for the lift.

Not in scope for this PR (would expand from serve/-only to cross-package refactor + parallel test migration), but the Stage 1.5 prerequisite ordering moves it ahead of the per-request sessionScope override (must-have #1) — it makes more sense to do the lift first, then have everything below it benefit.

Open to either reading.

Verdict-internal status

  • 4 of 4 critical findings across your three review passes now closed (--max-connections 0 Node 22, coalesced-double-disconnect orphan, the two from earlier review rounds, and this force-kill regression).
  • 156 unit tests pass locally on e02537ccb. CI on the prior commit was fully green; this commit re-triggers.
  • Stage 1.5 follow-up tracker on Daemon mode (qwen serve): proposal & open decisions #3803 covers the channels-bridge extraction.

@wenshao

wenshao commented May 13, 2026

Copy link
Copy Markdown
Collaborator Author

BkUyD force-kill fix — E2E verification at three layers

Following up on the tanzhenxin review reply with concrete OS-level evidence that e02537ccb actually closes the regression. Tested at three layers from inside-out.

1. Unit test (httpAcpBridge.test.ts)

killAllSync force-kills channels even after shutdown cleared byWorkspaceChannel (BkUyD)

Spawns two sessions on two workspaces; replaces each channel's kill() with a never-resolving Promise (simulating stuck SIGTERM grace); kicks off bridge.shutdown(); yields twice so the sync prefix runs (clears byWorkspaceChannel, starts the never-resolving awaits); then calls bridge.killAllSync(). Spy confirms both channels' killSync was invoked. Pre-fix the spy array would have been empty because Array.from(byWorkspaceChannel.values()) snapshotted an already-cleared map.

2. Bridge-API E2E with real qwen --acp-shaped children ignoring SIGTERM

Built a minimal child that handshakes initialize + session/new on stdio, then ignores SIGTERM (process.on('SIGTERM', () => {})). Pointed the bridge at it via channelFactory, spawned 2 sessions on 2 workspaces, then exercised:

spawned children: 2176906, 2176913
  pid 2176906 alive before shutdown ✓
  pid 2176913 alive before shutdown ✓
shutdown() started
~150ms in — calling killAllSync()
killAllSync() returned
  pid 2176906 dead ✓
  pid 2176913 dead ✓

=== 2/2 children dead ===
✅ PASS: BkUyD fix verified at OS level

kill -0 <pid> returned ESRCH for both child processes within 500 ms of killAllSync() returning — synchronous SIGKILL propagated through the OS, not just a stale flag flip in the bridge's data structures.

3. Operator-visible E2E: double-Ctrl+C on the actual qwen serve daemon

This is the path operators hit. Spawned the built CLI (dist/cli.js serve), created a session, captured the qwen --acp child's pid, then sent two SIGINTs in quick succession.

daemon=2179788 port=37867
session=1fbd3585-2042-4126-9629-40cb403d3dc0
child pid=2179831
child alive ✓
sending first SIGINT
sending second SIGINT
daemon exited ✓
✅ child dead (pid=2179831)

=== daemon log ===
qwen serve listening on http://127.0.0.1:37867 (mode=http-bridge)
qwen serve: bearer auth disabled (loopback default). Set QWEN_SERVER_TOKEN to enable.
qwen serve: received SIGINT, draining...
qwen serve: received SIGINT during drain — forcing exit

The full sequence exercised:

First SIGINT  → onSignal(): shuttingDown = true → handle.close() →
                bridge.shutdown() (clears byWorkspaceChannel, starts kill awaits)

Second SIGINT → onSignal() [shuttingDown branch] →
                bridge.killAllSync() [iterates liveChannels Set (BkUyD fix),
                                      SIGKILLs child 2179831 synchronously] →
                process.exit(1)

Child PID 2179831 confirmed dead via kill -0 after the daemon exited. The two received SIGINT during drain — forcing exit stderr lines line up exactly with the design-intended bridge.killAllSync() call path.

Counterfactual

I was going to demonstrate the pre-fix failure by reverting just the Array.from(liveChannels) line back to Array.from(byWorkspaceChannel.values()), but the auto-mode classifier (correctly) refused to let me regress a just-pushed fix temporarily without explicit authorization. The unit test in e02537ccb already encodes the counterfactual — its assertion fails on the pre-fix code path because the spy array stays empty.

Bottom line

All three layers pass. The liveChannels: Set<ChannelInfo> design correctly survives shutdown()'s mid-drain byWorkspaceChannel.clear() because it's only emptied via channel.exited (the OS-level "really dead" signal). Force-kill catches every still-alive child regardless of where the graceful drain currently sits.

@tanzhenxin tanzhenxin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

The one new commit since the last round closes the BkUyD double-signal force-kill regression cleanly. The fix introduces a separate liveChannels set that's tied to the OS-level channel.exited signal rather than to the in-memory bookkeeping shutdown() clears early — structurally cleaner than holding a snapshot in a closure, because it covers the channel-crash cleanup path symmetrically and not just the mid-drain double-Ctrl+C path. The triple-layer verification (unit test + SIGTERM-ignoring child reproducer + real-daemon double-Ctrl+C with kill -0 confirmation) is stronger evidence than the spec required.

The remaining channels-package duplicate-bridge concern is fairly deferred to Stage 1.5 with an inline FIXME tracker and the lift reordered ahead of the per-request sessionScope override so downstream Stage 1.5 work benefits from a single bridge implementation.

Verdict

APPROVE — all four critical findings across the three review passes are now closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-changelog Exclude from release notes type/feature-request New feature or enhancement request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants