[codex] Add daemon session load/resume#4222
Conversation
Adds HTTP and SDK support for restoring persisted daemon sessions through load/resume routes, including replay buffering for load and guarded concurrent restore handling. Refs #4175 Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
📋 Review SummaryThis PR adds daemon HTTP and TypeScript SDK support for loading and resuming persisted sessions through explicit restore routes ( 🔍 General Feedback
🎯 Specific Feedback🟡 High
🟢 Medium
🔵 Low
✅ Highlights
|
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
- Gate `defaultEntry` claim in `restoreSession` on `defaultSessionScope === 'single'`, mirroring `doSpawn`. Without the gate, a restored session silently became the omitted-scope attach target on `'thread'`-default daemons. - Rename advertised capability `session_resume` to `unstable_session_resume` to match the underlying ACP method (`connection.unstable_resumeSession`). `session_load` stays stable. - Seed `lastEventId: 0` in `DaemonSessionClient.resume`, symmetric with `load`. The agent's `unstable_resumeSession` schedules an `available_commands_update` via `setTimeout(0)`; without the seed the SDK consumer would miss that frame. - Add HTTP-level test for the `RestoreInProgressError → 409` envelope. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
wenshao
left a comment
There was a problem hiding this comment.
整体设计扎实——in-flight restore 合并、replay 帧用临时 EventBus 暂存再 promote 给注册后的 entry、cap 计数把 inFlightRestores.size 也算进去、shutdown 等待在飞的 restore——都处理得很干净。下面是几条具体建议,按严重程度排序;Critical 0 条,主要是 1 条未声明的行为变更 + 文档同步缺口 + 一些覆盖率和打磨。
- Cross-reference the `POST /session` disconnect-cleanup rationale
from `restoreSessionHandler`'s `!res.writable` branch so future
maintainers find the BQ9tV race + tanzhenxin attach-rollback
context without grep.
- Document `DaemonSessionState.{models, modes, configOptions}` in
the SDK so callers can narrow to the ACP `SessionModelState` /
`SessionModeState` / `SessionConfigOption` shapes.
- Add JSDoc on `DaemonClient.restoreSession` explaining why
`loadSession` and `resumeSession` collapse into one transport.
🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
Address the four Critical findings from PR #4222 review (wenshao): - Coalesced restore waiters now observe the same ACP state the original restore caller did. `state: {}` in `restoreSession`'s coalesce branch was clobbering the spread `restored.state`, so concurrent callers got different payloads based purely on timing. Cache the load/resume response on `SessionEntry.restoreState` and return it from both the existing-byId early return and the coalesce branch. - Drop the `defaultEntry` promotion on restore. Explicit `session/load` / `session/resume` is "give me THIS id"; it must not become the implicit attach target for subsequent omitted-id `POST /session` callers under `single` scope. Reserves `defaultEntry` for sessions created through `doSpawn` only. - Reserve coalesced attaches synchronously via `InFlightRestore.coalesceState.count` so the spawn owner's `requireZeroAttaches` disconnect-reaper sees a non-zero `attachCount` on the freshly registered entry and skips the kill. Without this, B's `attachCount++` happened after `await inFlight.promise`, leaving a window where A's HTTP-disconnect cleanup could reap the session out from under B. - Include `pendingRestoreIds` in the `killSession` channel-teardown decision. The last live session leaving while a restore is in-flight on the same channel would otherwise SIGTERM the channel mid-restore. - Bump `RestoreInProgressError`'s `Retry-After` from 1s to 5s (matches `SessionLimitExceededError`); under the default `initTimeoutMs` of 10s, 1s pushed clients into tight loops. Tests: new bridge cases covering state propagation through coalesce, the spawn-owner-disconnect race, the pendingRestoreIds-aware channel teardown, and the no-promote- on-restore invariant. Existing "attaches twice" test rewritten to assert the cached restore state propagates. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
Close the test-coverage gaps wenshao called out in PR #4222 review: - acpAgent.test.ts gains a `QwenAgent loadSession / unstable_resumeSession` block that locks down the new contract end-to-end at the agent layer: * `loadSession` missing persisted session → throws `RequestError.resourceNotFound("session:<id>")` (code -32002 + `data.uri`). * `loadSession` existing session → returns LoadSessionResponse AND triggers `session.replayHistory(messages)` so SSE subscribers see the persisted turns. * `unstable_resumeSession` missing session → same resourceNotFound contract. * `unstable_resumeSession` existing session → returns the response WITHOUT replaying history (resume restores model context internally; UI replay is intentionally suppressed). Required extending the mocked `RequestError` with `resourceNotFound`, and mocking `SessionService` per case. - server.test.ts adds the missing restore-route wire mappings: `WorkspaceMismatchError → 400 workspace_mismatch` and `SessionLimitExceededError → 503 + Retry-After: 5`. Combined with the existing 409 case for `RestoreInProgressError`, the route layer now has full structured-error coverage. - Updated the 409 test's `Retry-After` expectation from `1` to `5` to match the bumped retry hint. Disconnect-cleanup tests for the restore route were intentionally not added — the cleanup branch is line-for-line identical to `POST /session`'s handler (which itself ships without route-level disconnect tests due to flaky supertest + Node http close-event timing). 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
Sync the docs to the routes that landed via PR #4222: - `docs/developers/qwen-serve-protocol.md`: * Add `session_load` and `unstable_session_resume` to the advertised features list, with a note on the `unstable_` prefix mirroring ACP's underlying method name. * Document `POST /session/:id/load` and `POST /session/:id/resume` — request body, response shape (including the cached `state` field that late attachers observe), and the full error envelope: 404 unknown id, 400 workspace_mismatch, 503 session_limit_exceeded (counts in-flight restores), 409 restore_in_progress (cross-action race). * Note the SSE replay ring bound (4000 frames default) and the "subscribe immediately after load" guidance for long histories. - `docs/users/qwen-serve.md`: * Add a "Loading and resuming a persisted session" section with the SDK example (`DaemonSessionClient.load` / `DaemonSessionClient.resume`) and the load-vs-resume decision table. * Update the durability model — sessions are still ephemeral across daemon restarts in Stage 1, but persisted sessions on disk can now be reloaded. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
There was a problem hiding this comment.
Pull request overview
PR6 of the daemon session lifecycle work. It introduces explicit HTTP routes (POST /session/:id/load, POST /session/:id/resume) and matching TypeScript SDK methods so clients can reconnect to persisted ACP sessions. The bridge gains a restoreSession path that coalesces concurrent restores, buffers session/load replay frames in a temporary EventBus until the restored session entry is registered, accounts in-flight restores against maxSessions, and protects shared agent channels from being killed while restores are pending. Capabilities session_load and unstable_session_resume are advertised, and docs are updated.
Changes:
- Add ACP bridge
loadSession/resumeSession,RestoreInProgressError, pending-restore event buffering, and channel-kill guards keyed off a newpendingRestoreIdsset. - Add HTTP routes plus error mapping (404/400/409/503 +
Retry-After) and SDKDaemonClient.loadSession/resumeSession,DaemonSessionClient.load/resume, with shared restore types. - Implement
QwenAgent.unstable_resumeSession, add aresourceNotFoundguard inloadSession, refresh the qwen-serve user/protocol docs, and extend tests across bridge/server/SDK/agent.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| packages/cli/src/serve/httpAcpBridge.ts | Core restore implementation, coalescing, replay buffering, channel-kill guard. |
| packages/cli/src/serve/server.ts | New /session/:id/{load,resume} routes + RestoreInProgressError → 409 mapping. |
| packages/cli/src/serve/capabilities.ts | Adds session_load and unstable_session_resume tags. |
| packages/cli/src/acp-integration/acpAgent.ts | Adds resource-not-found guard and unstable_resumeSession impl. |
| packages/sdk-typescript/src/daemon/DaemonClient.ts | Adds loadSession/resumeSession and RestoreSessionRequest. |
| packages/sdk-typescript/src/daemon/DaemonSessionClient.ts | Adds load/resume static factories, exposes state. |
| packages/sdk-typescript/src/daemon/types.ts | New DaemonSessionState and DaemonRestoredSession types. |
| packages/sdk-typescript/src/daemon/index.ts, packages/sdk-typescript/src/index.ts | Re-export new types and RestoreSessionRequest. |
| packages/cli/src/serve/{server,httpAcpBridge}.test.ts | Coverage for routes, coalescing, error paths, channel safety. |
| packages/cli/src/acp-integration/acpAgent.test.ts | Coverage for load/resume agent semantics. |
| packages/sdk-typescript/test/unit/{DaemonClient,DaemonSessionClient}.test.ts | SDK-level wire/seeding coverage. |
| docs/users/qwen-serve.md, docs/developers/qwen-serve-protocol.md | User and protocol documentation for restore routes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The two new state-propagation tests in `httpAcpBridge.test.ts` used
`{ id, name, value }` as a `SessionConfigOption`, but ACP's actual
`SessionConfigSelect` shape requires `currentValue` + `options`. vitest
runs through esbuild and skips strict typechecking, so the local
`vitest run` passed; CI's `tsc --build` (run during `npm run prepare`)
caught it.
Switch the fixture to `_meta: { tag: '...' }` instead — `_meta` is
typed as `Record<string, unknown> | null` on the ACP response shapes,
so any payload survives. The assertions only need the bridge to
forward the state object intact, which `_meta` proves equally well
without committing the test to the full SessionConfigOption union.
🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
wenshao
left a comment
There was a problem hiding this comment.
[Suggestion] Restore route tests for parseOptionalWorkspaceCwd only cover the !path.isAbsolute branch. The non-string type check (typeof body['cwd'] !== 'string') and max-length check (> MAX_WORKSPACE_PATH_LENGTH) branches of the shared validation function are untested for restore routes — only POST /session has coverage for those. Consider adding { cwd: 123 } and >4096-char test cases to the POST /session/:id/load and /resume describe block.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
wenshao
left a comment
There was a problem hiding this comment.
Additional test coverage gaps:
-
isAcpSessionResourceNotFoundmessage fallback path (httpAcpBridge.ts:1835) is never exercised — all tests useRequestError.resourceNotFound()which setsdata.uri. The fallback for ACP servers that return the URI inmessageinstead ofdata.urihas zero coverage. -
loadSessionwithgetResumedSessionData()returningundefinedis untested (acpAgent.test.ts). The existing loadSession test always provides conversation data, butcreateAndStoreSession(config, undefined)is a distinct code path.
— mimo-v2.5-pro via Qwen Code /review
|
Supplement to my approval — local verification transcript from tmux. 1. PR-listed validation commandssdk-typescript typecheck: cli serve tests:
SDK daemon tests: Matches the PR body's "58 SDK daemon tests" exactly. git diff --check: 2. acpAgent tests (also PR-touched)3. Broader stress runsFull (The 1 failed file is the same Full sdk-typescript suite: 4. cli full
|
| Check | Result |
|---|---|
mergeable |
MERGEABLE |
mergeStateStatus |
CLEAN |
reviewDecision |
APPROVED |
| Lint / CodeQL / Test mac · ubuntu · windows | all SUCCESS |
| Local test totals | 852+ passed across cli serve, acp-integration, and full SDK suite |
Environment
- Linux x86_64, Node v20.19.2 (project engines require ≥22; only the
doctorChecksNode-version assertion correctly reflects this in remote CI's matrix) - PR head
a3f38da3aagainstorigin/mainmerge-baseef29700bc - Worktree at
/tmp/pr4222withnode_modulesandcore/distsymlinked from a parent install
wenshao
left a comment
There was a problem hiding this comment.
Flipping back to changes-requested — sorry for the noise. After my approve at 03:33Z I ran another /review pass at 03:44Z that surfaced two more Critical issues that aren't covered by the existing fix commits. Both are still present in the current head (a3f38da3a):
@doudouOUC could you take a look at both before we merge?
Critical 1 — Asymmetric coalesce guard
packages/cli/src/serve/httpAcpBridge.ts#L1863
if (action === 'load' && inFlight.action === 'resume') {
throw new RestoreInProgressError(req.sessionId, inFlight.action, action);
}Only load-on-resume is blocked; resume-on-load silently coalesces. This contradicts the protocol doc you added:
qwen-serve-protocol.md:82— "Fired when asession/loadis issued for an id that already has asession/resumein flight (or vice versa)."
qwen-serve-protocol.md:~140— "Same-action races (two concurrentsession/loadfor the same id) coalesce..."
Effect at runtime: a resume caller coalescing onto an in-flight load shares the load's EventBus and receives history replay frames — which directly violates the resume contract ("restore WITHOUT replaying history through SSE"). DaemonSessionClient.resume() further seeds lastEventId: 0, so the replayed frames are guaranteed to be delivered to the resume client.
Suggested fix (1 line):
if (action !== inFlight.action) {
throw new RestoreInProgressError(req.sessionId, inFlight.action, action);
}Also needs a test for resume-on-load returning 409 alongside the existing load-on-resume test.
Critical 2 — transportClosed dangling rejection
packages/cli/src/serve/httpAcpBridge.ts#L1928
const transportClosed = ci.channel.exited.then(() => {
throw new Error(`agent channel closed during session/${action}`);
});When Promise.race([withTimeout(...), transportClosed]) resolves via withTimeout, transportClosed stays pending. Whenever the channel later exits (next session boundary, daemon shutdown, agent crash), the .then produces a rejected promise with no attached handler.
- Node 22 default:
unhandledRejectionwarning. - Under
--unhandled-rejections=throw(common in container deployments): daemon process crash.
Suggested fix (1 line):
const transportClosed = ci.channel.exited.then(() => {
throw new Error(`agent channel closed during session/${action}`);
});
transportClosed.catch(() => {}); // suppress dangling rejection after the race settlesWorth a regression test where the channel exits after the restore promise has already resolved, asserting no unhandled rejection is emitted.
Context
The earlier 4 Critical comments from my 02:16Z review (state propagation on coalesce, no-defaultEntry-promotion, synchronous attachCount reservation, pendingRestoreIds in killSession teardown) were all addressed — confirmed by re-reading the code at c78331296 + 7a704fe65. Nice work on those.
These last two are small (1 line each + tests), so this should be a quick turnaround. CI was green and my local verification was clean apart from these; once these land I'm happy to re-approve.
… defensive cleanup Address the two new Critical findings + the test/cosmetic gaps from wenshao's second review pass on PR #4222 (`a3f38da3a`): - **[Critical] Symmetric coalesce guard.** The previous guard only rejected `load`-on-`resume`; `resume` arriving while a `load` was in flight silently coalesced and inherited the load's history- replay frames over SSE — directly violating resume's "no UI replay" contract (made worse by `DaemonSessionClient.resume()` seeding `lastEventId: 0`). Tighten the guard to `action !== inFlight.action` so any cross-action race throws `RestoreInProgressError`. Same-action coalescing is unaffected. - **[Critical] `transportClosed` dangling rejection.** When `withTimeout` wins the `Promise.race` against `channel.exited`, the `.then(throw)` chain on `channel.exited` stays pending. A later channel exit (next session boundary, daemon shutdown, agent crash) fires the `throw` with no observer attached — Node 22 logs `unhandledRejection`, and `--unhandled-rejections=throw` deployments crash the daemon. Add `transportClosed.catch(() => {})` to suppress the dangling rejection after the race settles. - **`isAcpSessionResourceNotFound` exact-match fallback.** The message-fallback path used `message.includes(expectedUri)`, which would falsely match a sessionId of `"a"` against a message containing `"session:abc"`. Tighten to exact equality on the canonical `Resource not found: <uri>` form. The primary `data.uri` path remains the dominant code path. - **`loadSession` mcpServers default symmetry.** `loadSession` now uses `params.mcpServers ?? []` to mirror `unstable_resumeSession`. Defends against a future ACP schema loosening that makes `LoadSessionRequest.mcpServers` optional — without the null-coalesce, `newSessionConfig` would `TypeError` on iteration. Tests added: - `httpAcpBridge.test.ts`: `resume`-on-`load` rejection (mirror of the existing `load`-on-`resume` test); regression for the dangling `unhandledRejection` (resolves `channel.exited` after the restore promise has already settled and asserts no `unhandledRejection` event); shutdown-awaits-restore via `Promise.race`-based ordering. - `server.test.ts`: 400 for non-string and over-length `cwd` on the restore routes (mirroring the equivalent `POST /session` cases for `parseOptionalWorkspaceCwd`). - `acpAgent.test.ts`: load with `getResumedSessionData()` returning `undefined` — distinct code path that does NOT call `replayHistory`. 🤖 Generated with [Qwen Code](https://github.com/QwenLM/qwen-code)
wenshao
left a comment
There was a problem hiding this comment.
Both Critical issues from my 03:44Z review are addressed in db55b1c8b. Verified locally on top of main.
Critical 1 — symmetric coalesce guard: confirmed at httpAcpBridge.ts:1879 — guard now reads if (action !== inFlight.action). Comment explicitly covers the bidirectional rationale (resume-on-load can't quietly coalesce because DaemonSessionClient.resume() seeds lastEventId: 0 and would receive replayed history frames; load-on-resume is rejected for the symmetric reason). Regression test rejects resume while a load for the same session is in flight (mirror of load-on-resume) mirrors the existing load-on-resume case.
Critical 2 — transportClosed dangling rejection: confirmed at httpAcpBridge.ts:1956 — transportClosed.catch(() => {}) suppresses the race-loser rejection. Regression test does not surface an unhandledRejection when the channel exits after a successful restore registers a real process.on('unhandledRejection') listener and asserts the array stays empty after handle.crash(), which is the strongest possible coverage for this class of bug.
Two bonus tightenings beyond what I asked for:
isAcpSessionResourceNotFoundmessage-fallback switched frommessage.includes(expectedUri)to exact equality onResource not found: ${expectedUri}— neutralizes the substring false-positive where a sessionId of"a"would match a message containing"session:abc".acpAgent.loadSessiongot theparams.mcpServers ?? []defensive guard to matchunstable_resumeSession— picks up one of my earlier Suggestion items at no extra cost.
Local re-verification: npx vitest run src/serve/httpAcpBridge.test.ts src/acp-integration/acpAgent.test.ts: 142 passed (97 + 45; 3 + 1 new tests over the prior commit). Remote CI all green: Lint / CodeQL / Classify PR / Test (mac · ubuntu · windows on Node 22.x).
LGTM, re-approving.
Summary
Validation
npm run buildcompleted successfully; it still reports the existing vscode companion lint warning atpackages/vscode-ide-companion/src/utils/editorGroupUtils.ts:32.Scope / Risk
Testing Matrix
Testing matrix notes:
Linked Issues / Bugs
Refs #4175