feat(daemon): session idle reaper for automatic cleanup#4833
Conversation
|
Thanks for the PR! Template: the PR body uses Direction: clearly aligned. Orphaned sessions locking out new users at the Approach: the scope feels right — a Moving on to code review. 🔍 中文说明感谢贡献! 模板: PR 正文使用了 方向: 明确对齐。孤立会话在 方案: 范围合理——桥闭包内的 进入代码审查 🔍 — Qwen Code · qwen3.7-max |
📋 Review SummaryThis PR implements a session idle reaper that automatically cleans up orphaned sessions in the daemon's in-memory registry. The implementation is well-designed, thoroughly tested, and follows existing patterns in the codebase. All 222 existing bridge tests pass with zero regression, and 11 new reaper-specific tests comprehensively cover the feature's behavior. 🔍 General Feedback
🎯 Specific Feedback🟡 High
🟢 Medium
🔵 Low
✅ Highlights
|
PR Verification ReportPR: #4833 — feat(daemon): session idle reaper for automatic cleanup Test Results
New Tests (12 reaper tests)
Code ReviewChanges (9 files, +1010/−100):
Key observations:
Caveats
Verdict✅ Ready to merge — Well-designed session reaper with thorough test coverage (12 tests, all passing). Clean Verified by wenshao |
…logs - Move byId.delete before await notifyAgentSessionClose in closeSessionImpl to match killSession ordering and prevent duplicate close cascades from concurrent callers (reaper + detach-close race) - Restore 4 load-bearing comments dropped during closeSession extraction: HAZARD (channelInfoForEntry), tombstone (markSessionClosed), ordering (publish before cancel), back-compat (closedBy field) - Add Math.min(raw, 2_147_483_647) clamp to resolvePositiveFiniteMs to prevent setInterval from treating >2^31-1 as 1ms (tight loop) - Include close reason in stderr log for operator observability - Use err.stack instead of String(err) in reaper/detach-close failure logs to preserve call stacks for debugging - Log reaper startup status (enabled with thresholds, or disabled) Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
…logs - Move byId.delete before await notifyAgentSessionClose in closeSessionImpl to match killSession ordering and prevent duplicate close cascades from concurrent callers (reaper + detach-close race) - Restore 4 load-bearing comments dropped during closeSession extraction: HAZARD (channelInfoForEntry), tombstone (markSessionClosed), ordering (publish before cancel), back-compat (closedBy field) - Add Math.min(raw, 2_147_483_647) clamp to resolvePositiveFiniteMs to prevent setInterval from treating >2^31-1 as 1ms (tight loop) - Include close reason in stderr log for operator observability - Use err.stack instead of String(err) in reaper/detach-close failure logs to preserve call stacks for debugging - Log reaper startup status (enabled with thresholds, or disabled) Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
06e99a6 to
4d77016
Compare
- Remove duplicate `promptActive: false` in createSessionEntry (rebase merge artifact) - Remove duplicate `entry.promptActive = true/false` assignments in sendPrompt (rebase merge artifact) - Add `!entry.promptActive` guard to close-on-last-detach path so sessions with an active prompt are not closed on detach (reaper handles them after prompt completes) - Update bridgeOptions.ts JSDoc to reflect that the reaper intentionally does NOT check clientIds.size (crash-path backstop) - Fix misleading "mirrors killSession" comment — the ordering intentionally diverges (synchronous teardown before agent notification) - Update design doc §4.8 to document `last_client_detached` reason value Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
…nected sessions Idle sessions accumulate when clients close browser tabs or crash without calling DELETE /session. Without cleanup, sessions leak memory (EventBus ring ~2-4 MB each) and eventually hit the maxSessions cap (default 20), locking out new sessions entirely. Add a configurable session reaper that periodically scans the in-memory session registry and closes sessions that have no SSE subscribers, no registered clients, no active prompt, and whose last heartbeat exceeds a configurable idle TTL (default 30 minutes). Key design decisions: - Uses existing closeSession path (soft close, not hard kill) - JSONL transcripts on disk are preserved — session/load or session/resume can restore any reaped session - Emits session_closed with reason 'idle_timeout' so clients can distinguish from explicit closes - Reaper timer is .unref()'d and stopped on shutdown/killAllSync - Configurable via --session-reap-interval-ms and --session-idle-timeout-ms CLI flags (0 = disabled) Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
…server integration tests - Add 'session.close.reason' attribute to telemetry event so operators can distinguish reaper-initiated closes from client-initiated ones in dashboards - Add test verifying channel idle timer fires after reaper closes the last session on a channel (design doc test #12) - Add server.test.ts integration tests: health endpoint reflects session count changes, DELETE /session passes no close opts - Update fakeBridge.closeSession signature to accept the new CloseSessionOpts third parameter Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
…redicate Add close-on-last-detach to detachClient: when clientIds.size drops to 0 AND no SSE subscribers remain, call closeSessionImpl immediately. This handles the normal tab-close path without waiting for the idle reaper. Adjust the idle reaper to NOT check clientIds.size — it now serves as a backstop for the crash path where detach was never sent (clientIds still > 0 but no subscriber and no heartbeat). Add SessionEntry.promptActive boolean flag to reliably detect active prompts regardless of whether an originator clientId was provided, fixing a gap where headless prompts (no clientId context) were invisible to the reaper's activePromptOriginatorClientId check. Update existing heartbeat detach test to use two clients (single-client detach now triggers close-on-last-detach). Add 3 close-on-last-detach tests. Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
…logs - Move byId.delete before await notifyAgentSessionClose in closeSessionImpl to match killSession ordering and prevent duplicate close cascades from concurrent callers (reaper + detach-close race) - Restore 4 load-bearing comments dropped during closeSession extraction: HAZARD (channelInfoForEntry), tombstone (markSessionClosed), ordering (publish before cancel), back-compat (closedBy field) - Add Math.min(raw, 2_147_483_647) clamp to resolvePositiveFiniteMs to prevent setInterval from treating >2^31-1 as 1ms (tight loop) - Include close reason in stderr log for operator observability - Use err.stack instead of String(err) in reaper/detach-close failure logs to preserve call stacks for debugging - Log reaper startup status (enabled with thresholds, or disabled) Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
- Remove duplicate `promptActive: false` in createSessionEntry (rebase merge artifact) - Remove duplicate `entry.promptActive = true/false` assignments in sendPrompt (rebase merge artifact) - Add `!entry.promptActive` guard to close-on-last-detach path so sessions with an active prompt are not closed on detach (reaper handles them after prompt completes) - Update bridgeOptions.ts JSDoc to reflect that the reaper intentionally does NOT check clientIds.size (crash-path backstop) - Fix misleading "mirrors killSession" comment — the ordering intentionally diverges (synchronous teardown before agent notification) - Update design doc §4.8 to document `last_client_detached` reason value Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
6898cfd to
461d58c
Compare
Merge upstream daemon_mode_b_main which added per-tier HTTP rate limiting CLI flags. Both feature sets (rate-limit + session reaper) are additive to ServeOptions/ServeArgs — no semantic interaction. Also addresses PR #4833 review feedback: - Fix bridgeOptions.ts JSDoc: reaper does NOT check clientIds.size - Narrow CloseSessionOpts.reason to string literal union - Update design doc §4.8 to document 'last_client_detached' reason Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
- Fix unused _s2 variable (TS6133 / lint failure) - Fix sendPrompt not advancing session idle clock: set sessionLastSeenAt = Date.now() on prompt start and completion - Add deferred close-on-last-detach after prompt completion: when prompt finishes and clientIds.size === 0 && subscriberCount === 0, trigger closeSessionImpl (covers the race where client detaches while prompt is still running) - Update design doc §4.2: reflect actual reaper predicate (no clientIds check, uses promptActive flag) Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
✅ Local verification report — real
|
| # | Scenario | Mechanism exercised | Result |
|---|---|---|---|
| A | Session with a registered client that never detaches (crash path), left idle | idle reaper (ignores clientIds) |
✅ reaped at ~3s |
| B | Last client detaches | close-on-last-detach (immediate) | ✅ closed 37ms after detach |
| C | Session with a live SSE subscriber, idle ≫ TTL | reaper subscriberCount guard |
✅ NOT reaped; reaped only after SSE closed |
| D | Session sending heartbeats every 1s | lastActivity reset |
✅ survived; reaped 3s after heartbeats stopped |
A — idle reaper catches the crash path. Created a session (registered client, no SSE, no prompt), never detached. /health?deep=1 sessions went 1 → 0 at ~3s:
qwen serve: reaping idle session "6f837c02-…" (idle for 3s, threshold 3s)
This confirms the reaper does not check clientIds.size — it reaps sessions whose client registered but never sent a detach (browser killed / kill -9). Close reason idle_timeout (passed to closeSessionImpl; asserted in unit tests).
B — close-on-last-detach is immediate, not the reaper. Session was 8 ms old when its client detached; /health reported sessions:0 just 37 ms later — ~2.46 s before the 2.5 s TTL could fire — and no reaping idle session line was logged for it. So this is the synchronous last_client_detached close, distinct from the reaper.
C — an SSE subscriber blocks the reaper. With a real SSE stream open on the session, it survived 5 s past the TTL (sessions stayed 1, no reap line). The instant the subscriber closed, it was reaped on the next tick. Proves the subscriberCount > 0 guard.
D — heartbeats defer reaping. Heartbeating every 1s kept the session alive through 5 s (> TTL, 0 reap lines). After heartbeats stopped it was reaped with idle for 3s (not the wall-clock session age) — confirming lastActivity = sessionLastSeenAt is reset by each heartbeat.
Unit & integration tests
packages/acp-bridge/src/bridge.test.ts→ 249 passed ✓ (matches PR claim; includes the 15 reaper + 3 close-on-last-detach tests, with explicitreason: 'idle_timeout'/'client_close'assertions)packages/cli/src/serve/server.test.ts→ 375 passed / 11 failed. The PR's two new wire tests (deep health reflects reduced sessionCount,DELETE /session/:id passes no opts → client_close) pass.- The 11 failures are pre-existing and unrelated to this PR:
- Root cause:
runQwenServereal-daemon integration tests time out on ACP child preheat (AcpSessionBridge initialize timed out after 10000ms→afterEachhook timeout) — the known pre-existing harness issue ondaemon_mode_b_main(called out in the PR description). - Proof they're not from this PR: the PR's
server.test.tsdiff is purely additive (+37 / -2; the only deletions update a mockcloseSessionsignature). The 11 failing tests — capability registry, env mutation, token binding,--max-connections, hostname/loopback parsing — are byte-identical to base and exercise no reaper code.serve.ts(+19/-0) andrunQwenServe.ts(+6/-0) are additive option-plumbing. - Confirmed by building the base and running the identical file: base
0259c1a4e= 12 failed / 372 passed; PR head = 11 failed / 375 passed. Every test that fails on PR head also fails on base (commof head-minus-base failures = ∅) → zero regressions introduced, plus the 2 new reaper tests pass (384 → 386 total). The exact failing set is mildly flaky run-to-run (timeout-driven), consistent with the environmental root cause.
- Root cause:
- The 11 failures are pre-existing and unrelated to this PR:
Code review corroboration
- Reaper predicate (
bridge.ts:885): skipspromptActive, skipssubscriberCount > 0, reaps whenidle > sessionIdleTimeoutMs; intentionally does not checkclientIds— matches design §4.2 and thebridgeOptions.tsdoc comment. Timer is.unref()'d and stopped onshutdown(). - close-on-last-detach (
bridge.ts:4464):clientIds.size === 0 && subscriberCount === 0 && !promptActive→closeSessionImpl(…, { reason: 'last_client_detached' }); when a prompt is active it correctly defers to the reaper. reasonreaches the wire: with an SSE subscriber attached and an explicitDELETE /session/:id, the stream deliveredsession_closedwithdata.reason: "client_close". The newCloseSessionOpts.reasonflows through the samecloseSessionImpl→session_closedpath, soidle_timeout/last_client_detachedare emitted identically (they just can't be observed by a same-session subscriber, since both close paths requiresubscriberCount === 0).
⚠️ One nit — design doc internal inconsistency (doc only, code is correct)
The design doc contradicts itself on the reaper predicate:
- §4.2 (and the shipped code) — the reaper does NOT check
clientIds, so it can catch the crash path. ✅ This is what actually ships and what I verified in Test A. - §4.5 — the illustrative code sample still contains
if (entry.clientIds.size > 0) continue;, which would defeat the crash-path coverage that is the whole point of the reaper.
The implementation is correct; only the §4.5 sample in docs/design/session-idle-reaper/README.md is stale. Worth fixing so a future reader doesn't "fix" the code to match the wrong sample.
Recommendation
LGTM to merge. The two-layer cleanup works exactly as designed against a real daemon — immediate close on graceful detach, idle reaper as the crash-path backstop, correctly gated by active prompts, SSE subscribers, and heartbeats; transcripts preserved (close, not kill). Unit coverage is thorough and green. Suggest only updating the §4.5 design-doc sample to drop the stray clientIds.size guard.
中文小结
用 tmux + 真实 qwen serve 守护进程(通过 HTTP 驱动,调短 reaper 参数到 500ms/2500ms,真实 setInterval、真实墙钟时间)端到端验证了 PR 全部行为,Linux 环境,补齐了测试矩阵 🐧 Linux 空缺。
- CLI flag 正确接线:启动日志
session reaper started (interval 500ms, idle threshold 2500ms);--session-idle-timeout-ms 0时session reaper disabled。 - A 闲置回收(崩溃路径):注册了 client 但从不 detach 的会话,~3s 被回收,日志
reaping idle session … (idle for 3s),/health的sessions由 1→0。证实 reaper 不检查clientIds,能兜底崩溃客户端,reason=idle_timeout。 - B 最后一个 client detach 立即关闭:会话存活 8ms 时 detach,37ms 后
sessions:0(比 2.5s TTL 早约 2.46s),且无 reap 日志——确属即时关闭(last_client_detached),与 reaper 区分开。 - C SSE 订阅者阻止回收:SSE 打开时超过 TTL 5s 仍存活;关闭 SSE 后下一 tick 即被回收。
- D 心跳延后回收:每秒心跳期间存活;停止心跳后 3s 被回收(
idle for 3s证明sessionLastSeenAt被心跳刷新)。 - 单测:
bridge.test.ts249 全过(含 15 reaper + 3 detach-close,断言了 reason 字符串);server.test.ts375 过 / 11 失败,PR 新增的 2 个 wire 测试通过。11 个失败为既有问题:根因是runQwenServe真实子进程 ACP preheat 10s 超时(base 分支既有,PR 描述已注明);且 PR 对server.test.ts的改动纯增量(+37/-2,删除仅改了 mock 签名),这 11 个测试与 base 逐字节相同、不涉及 reaper 代码。已构建 base 分支跑同一文件确认:base0259c1a4e= 12 失败/372 通过,PR head = 11 失败/375 通过,head 的失败集 ⊆ base 失败集(head−base 差集为空)→ 零新增回归,且新增 2 个 reaper 测试通过。 - 唯一问题(仅文档):设计文档 §4.5 的示例代码仍保留
if (entry.clientIds.size > 0) continue;,与 §4.2 及实际代码矛盾(实际代码正确地未检查 clientIds)。建议修正 §4.5 示例。
结论:两层清理机制在真实守护进程上行为完全符合设计,单测充分且绿,可安全合并;建议仅修正设计文档 §4.5 示例。
- Replace silent .catch(() => {}) in prompt-complete deferred close
with error logging (stack trace included)
- Update design doc §4.5 pseudocode to match implementation:
use entry.promptActive instead of activePromptOriginatorClientId,
remove clientIds.size check
Generated with AI
Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
Generated with AI Co-authored-by: Qwen-Coder <qwen-coder@alibabacloud.com>
|
Qwen Code review did not complete successfully: Qwen review timed out after 55 minutes. See workflow logs. |
chiga0
left a comment
There was a problem hiding this comment.
Code Review Overview (AI Generated)
PR: #4833 feat(daemon): session idle reaper for automatic cleanup
Type: New Feature
Author: @chiga0
Change size: +1203/-108 across 9 files, 9 commits
HEAD: 57a694d6 (commit 9: docs cleanup)
Independent Review Result (Phase 1 — Blind): 0 findings
Phase 1 blind review of all 9 file diffs at HEAD found no issues. The two-layer session lifecycle cleanup is well-designed and thoroughly implemented:
- Close-on-last-detach correctly guards with
clientIds.size === 0 && subscriberCount === 0 && !promptActive— the!promptActivecheck prevents killing in-flight work - Idle reaper predicate (no active prompt, no SSE subscribers, idle > TTL) correctly omits
clientIds.size— covers the crash path where detach was never sent closeSessionImplextraction consolidates all close paths with correct synchronous teardown ordering (byId.deletebeforeawait notifyAgentSessionClose) — prevents reaper + detach-close re-entrancy- Deferred close-on-prompt-complete in
.finally()covers the race window where client detaches while prompt is in flight resolvePositiveFiniteMsclamps to 2^31-1 preventing setInterval tight looppromptActiveflag more reliable thanactivePromptOriginatorClientId— covers headless prompts- 15 reaper tests + 3 detach-close tests + 2 integration tests = 20 new tests with comprehensive coverage
Cross-Validation
wenshao's findings (4 rounds of CR, COMMENTED at HEAD):
| # | Severity | Finding | Resolution | Status |
|---|---|---|---|---|
| W1 | Suggestion | Dropped load-bearing comments in closeSessionImpl | Fixed commit 4: restored HAZARD, tombstone, ordering, back-compat | ✅ |
| W2 | Suggestion | Missing reason in stderr log | Fixed commit 4 | ✅ |
| W3 | Suggestion | Missing reason in telemetry | Fixed commit 2 | ✅ |
| W4 | Suggestion | String(err) loses stack trace | Fixed commit 4: err.stack ?? err.message |
✅ |
| W5 | Suggestion | resolvePositiveFiniteMs missing 2^31-1 clamp | Fixed commit 4: Math.min added | ✅ |
| W6 | Suggestion | closeSessionImpl re-entrancy (byId.delete after await) | Fixed commit 4: byId.delete moved before await | ✅ |
| W7 | Suggestion | No startup log for reaper status | Fixed commit 2 | ✅ |
| W8 | Suggestion | No test for reaper natural fire + reason assertion | Impractical: subscriber prevents reaping; covered by direct closeSession test | ✅ |
| W9 | Suggestion | Active prompt guard test not isolated | False positive: reaper doesn't check clientIds.size | ✅ |
| W10 | Suggestion | Re-entrancy amplified by 3rd call site | Fixed root cause (same as W6) | ✅ |
| W11 | Critical | Duplicate promptActive property (build error) | Fixed commit 5: removed merge artifact | ✅ |
| W12 | Critical | Missing !promptActive guard in close-on-last-detach | Fixed commit 5: guard added | ✅ |
| W13 | Suggestion | Duplicate promptActive assignments (merge artifact) | Fixed commit 5 | ✅ |
| W14 | Suggestion | JSDoc says "zero registered clients" but reaper skips clientIds | Fixed commit 5/6: JSDoc corrected | ✅ |
| W15 | Suggestion | reason as open string vs literal union | Initially kept open; changed to literal union in commit 6 | ✅ |
| W16 | Suggestion | "mirrors killSession ordering" comment misleading | Fixed commit 4: reworded to "intentionally diverges" | ✅ |
| W17 | Suggestion | sessionLifecycle counter lacks reason attribute | Deferred: cross-package change, telemetry event already carries reason | ✅ |
| W18 | Suggestion | 'last_client_detached' not in design doc §4.8 | Fixed commit 5 | ✅ |
| W19 | Critical | Missing !promptActive guard test for close-on-last-detach | Deferred: deferred close covers the scenario | ✅ |
| W20 | Suggestion | No deferred close after prompt completion | Fixed commit 7: deferred close added in .finally() | ✅ |
| W21 | Suggestion | 'last_client_detached' reason never asserted in test | Deferred | ✅ |
| W22 | Suggestion | Design doc §4.2 stale clientIds.size reference | Fixed commit 6 | ✅ |
| W23 | Suggestion | Design doc §4.7 stale byId.delete ordering | Fixed commit 6 | ✅ |
| W24 | Critical | Unused _s2 variable (TS6133) | Fixed commit 7 | ✅ |
CI bot findings (DISMISSED at HEAD):
| # | Severity | Finding | Resolution | Status |
|---|---|---|---|---|
| B1 | Critical | sendPrompt doesn't advance sessionLastSeenAt | Fixed commit 8: set sessionLastSeenAt on prompt start + completion | ✅ |
| B2 | Suggestion | resolvePositiveFiniteMs clamp/NaN branches untested | Deferred: low-risk utility function | ✅ |
| B3 | Suggestion | .catch(() => {}) silently discards deferred close errors | Fixed commit 8: error logging with stack | ✅ |
| B4 | Suggestion | Design doc §4.5 pseudocode stale | Fixed commit 8 | ✅ |
| B5 | Suggestion | Date.parse NaN could cause unconditional reap | Deferred: createdAt always valid ISO 8601 | ✅ |
| B6 | Suggestion | Deferred close reason 'last_client_detached' semantically imprecise | Deferred: acceptable — session is orphaned because last client detached | ✅ |
Additional Audit Coverage
Areas I independently checked that go beyond existing findings:
-
Re-entrancy race (Round 5.5 — Handler Parallelism): Reaper fires
void closeSessionImpl()with.catch().byId.deletenow runs synchronously before firstawait— concurrent close from detach-close or explicit DELETE hitsSessionNotFoundError. Three call sites (DELETE route, reaper, detach-close) all converge oncloseSessionImplsafely. Verified correct. -
Deferred close race window (Round 5.5 — State Field Initialization): When prompt completes and
clientIds.size === 0 && subscriberCount === 0, deferred close fires. If a new client reattaches between detach and prompt completion,clientIds.size > 0→ deferred close skipped. If client reattaches and subscribes,subscriberCount > 0→ skipped. Race window is extremely narrow (synchronous finally block). Verified correct. -
promptActive lifecycle (Round 5.5 — Data Provenance Tracing): Set
true+sessionLastSeenAt = Date.now()at prompt start. Setfalse+sessionLastSeenAt = Date.now()in.finally(). Reaper checkspromptActivefirst (fast skip). Close-on-last-detach checks!promptActive. Deferred close in.finally()covers the prompt-active-then-detach race. Verified correct. -
Map iteration during concurrent modification (Round 1 — Architecture): ES2015 Map spec guarantees
for...oftolerates deletion of current/previous keys. New sessions created during iteration won't meet idle threshold (freshsessionLastSeenAt). Verified correct. -
Shutdown integration (Round 1 — Architecture):
stopSessionReaper()called in bothshutdown()andkillAllSync(). Timer is.unref()'d.shuttingDownflag checked at top of reaper callback. Verified correct. -
Channel idle timer interaction (Round 1): When closeSessionImpl removes the last session on a channel, it calls
startIdleTimeron the channel — same as explicit close. Test 12 verifies this flow. Verified correct. -
Wire-format backward compatibility (Round 5.5 — Data Structure Blast Radius):
session_closed.reasonfield already existed with value'client_close'. New values'idle_timeout'and'last_client_detached'are additive. Existing SDK code checkingreason === 'client_close'continues to work.isTerminalLifecycleEventhandles allsession_closedregardless of reason. Verified correct. -
CloseSessionOpts type narrowing (Round 6): Changed from open
stringto literal union'client_close' | 'idle_timeout' | 'last_client_detached'— catches typos at compile time while documenting the wire contract. Good improvement.
This review was generated by QoderWork AI
Final Verdict — LGTM, Recommend Merge独立盲审 0 findings,wenshao 4 轮 CR(24 条,含 4 Critical)和 ci-bot(6 条,含 2 Critical)的所有发现均已在 9 个 commit 中修复或合理 defer。 核心设计评价:
无 blocking issue。可以 merge。 This comment was generated by QoderWork AI |
jifeng
left a comment
There was a problem hiding this comment.
Code Review — 重点关注误回收问题
整体设计两层清理机制(close-on-last-detach + idle reaper)思路清晰,测试覆盖较完整。以下是几个关于误回收的风险点:
问题 1(高风险):Close-on-last-detach 在页面刷新场景下会误杀 session
detachClient 中新增的 close-on-last-detach 是零延迟立即关闭。单 tab 用户刷新页面时:
- 旧页面 React cleanup 发
POST /session/:id/detach - SSE 已断开 →
subscriberCount === 0 clientIds.size === 0,promptActive === false→ 立即关闭 session- 新 tab 加载尝试 attach → session 已不存在
虽然 JSONL 保留可通过 session/load 恢复,但:
- 新 tab 的前端是否知道要走
session/load而非spawnOrAttach? - EventBus ring buffer(2-4 MB 历史事件)已丢失
- 与 idle reaper 的 30 分钟宽限期不同,这里是零宽限
建议:在 detachClient 的关闭路径增加一个短延迟(如 5-10 秒),给页面刷新留出 reconnect 窗口:
setTimeout(() => {
const e = byId.get(sessionId);
if (e && e.clientIds.size === 0 && e.events.subscriberCount === 0 && !e.promptActive) {
void closeSessionImpl(sessionId, undefined, { reason: 'last_client_detached' });
}
}, 5_000);问题 2(中风险):closeSessionImpl 的 teardown 顺序变更
原 closeSession 顺序:notifyAgentSessionClose → byId.delete
新 closeSessionImpl 顺序:byId.delete → notifyAgentSessionClose
notifyAgentSessionClose 被移到 byId.delete 之后。如果其内部或触发的回调尝试通过 byId.get(sessionId) 访问 session,将会失败。需要确认 notifyAgentSessionClose 的实现不依赖 byId。
问题 3(中风险):.finally() 与 reaper 存在 double-close 竞态
Prompt 完成时 .finally() 中 promptActive = false 后触发 closeSessionImpl(fire-and-forget),同一时刻 reaper tick 也看到 promptActive === false 并尝试关闭同一 session。虽然第二次调用会因 SessionNotFoundError 被 .catch() 捕获,但依赖异常做控制流不够 clean。
建议:在 closeSessionImpl 开头加 if (!byId.has(sessionId)) return; 前置检查。
问题 4(低风险):Reaper 不检查 clientIds 可能误杀 headless client
Headless client 注册了 clientId(clientIds.size > 0),不开 SSE,不发心跳,计划在 session 创建后较长时间才发第一个 prompt。默认 30 分钟 TTL 下可能被误 reap。
建议:文档或日志中更明确提示 headless client 必须定期发心跳来 keep-alive。
问题 5(低风险):Reaper 遍历中 fire-and-forget 并发关闭
for (const [id, entry] of byId) {
void closeSessionImpl(id, ...).catch(...);
}多个并发的 closeSessionImpl 可能导致 channelInfoForEntry 等共享状态的 race。建议先收集待 reap 的 sessionId 列表再串行执行,或用 Promise.allSettled 等待全部完成。
缺少的测试场景
- 页面刷新(detach + 快速 re-attach)→ 验证 session 是否存活
.finally()和 reaper 同时触发的 double-close 竞态- Prompt 完成时
.finally()中的 deferred close 逻辑
| 问题 | 严重性 | 建议 |
|---|---|---|
| Close-on-last-detach 零延迟误杀(页面刷新) | 高 | 增加 5-10s grace period |
closeSessionImpl teardown 顺序变更 |
中 | 确认 notifyAgentSessionClose 不依赖 byId |
.finally() 与 reaper double-close 竞态 |
中 | 前置 byId.has() 检查 |
Reaper 不检查 clientIds 误杀 headless client |
低 | 文档提示 heartbeat 要求 |
| Reaper 遍历中并发关闭 | 低 | 收集后串行或 allSettled |
Local runtime verification — real daemon driven over HTTP (Linux) ✅Built this PR's head (
The core paths, as observedIdle reaper (crash path) — session created, then abandoned (no SSE, no heartbeat, no detach): Close-on-last-detach (normal path) — immediate, no TTL wait: Base build, same moves — the leak this PR fixes: Steps & probes (PR build)
Notes (non-blocking)
Merge reference
中文摘要在 Linux 上真实构建并运行两个版本的
结论:两条新生命周期路径及全部守卫条件在真实 HTTP 表面与设计一致,base 泄漏可复现,可作为合并参考。 |
doudouOUC
left a comment
There was a problem hiding this comment.
PR Review Summary
整体设计扎实——两层防御(close-on-last-detach + idle reaper)覆盖正常路径和 crash 兜底,并发安全性处理正确,API 向后兼容。以下是按优先级整理的发现。
Critical (2)
Date.parse(createdAt)可能返回NaN导致 session 被立即回收 — 见 inline comment- 设计文档 Test #4 与实际设计矛盾 — 见 inline comment
Important (5)
reason应使用 union type 而非裸string— 见 inline comment- SDK 侧
DaemonSessionClosedReason未同步新增'idle_timeout'/'last_client_detached'(packages/sdk-typescript/src/daemon/events.ts) - 缺少 deferred close-on-prompt-complete 的测试(prompt 执行中 detach,prompt 完成后自动关闭)— 见 inline comment
- 缺少 close-on-last-detach 在 active prompt 期间跳过的测试
- 空字符串 reason 可导致 SDK 侧静默丢弃
session_closed事件 — 见 inline comment
Suggestions (5)
resolvePositiveFiniteMs缺少Infinity/NaN/ 负数 / 超大值的测试- reaper disabled 日志应包含原始输入值,方便排查误配置
closeSessionImpl操作顺序相对旧closeSession有变更(byId.delete移到notifyAgentSessionClose之前),建议加注释说明- 缺少
reason: 'last_client_detached'的事件内容验证测试 - Test #12 名称 "triggers channel idle timer after reaping the last session" 实际测试的是 close-on-last-detach 路径,不是 reaper 路径
Strengths
- 两层防御设计:close-on-last-detach(正常)+ idle reaper(crash 兜底),单层失败另一层兜住
- 并发安全:reaper
for...of+ fire-and-forget,byId.delete在await前同步执行,double-close race 安全 - 生命周期集成完整:
shutdown()/killAllSync()都调stopSessionReaper() - Telemetry 正确包含
session.close.reason - 测试质量高:
vi.useFakeTimers使用规范,heartbeat 刷新测试设计精巧
ytahdn
left a comment
There was a problem hiding this comment.
Review Summary
Verdict: Comment — No critical issues. Well-designed two-layer session lifecycle cleanup with comprehensive test coverage.
Deterministic analysis: TypeCheck ✅ 0 errors | ESLint ✅ 0 errors
Findings: 2 suggestions. See inline comments for details.
— qwen3.7-max via Qwen Code /review
ytahdn
left a comment
There was a problem hiding this comment.
No blocking issues. Design is clean, tests are thorough. LGTM! ✅ — qwen3.7-max via Qwen Code /review
What this PR does
Add two-layer session lifecycle cleanup for the daemon bridge:
detachClientremoves the last registered client AND no SSE subscribers remain, the session is closed immediately. Handles the normal path: user closes a tab → React cleanup →POST /session/:id/detach.Both paths use
closeSessionImplwhich preserves JSONL on disk —session/loadorsession/resumecan restore any closed session.Why it's needed
When clients close browser tabs, crash, or lose network, sessions accumulate in the bridge's
byIdMap. Each orphaned session holds an EventBus ring (~2-4 MB). Without cleanup, the daemon eventually hitsmaxSessions=20and rejects all new sessions — a hard availability failure requiring restart.Reviewer Test Plan
How to verify
npx vitest run packages/acp-bridge/src/bridge.test.ts— 249 tests pass (includes 15 reaper + 3 detach-close tests)promptActive), no SSE subscribers, idle > TTLclientIds.size === 0 && subscriberCount === 0 && !promptActiveEvidence (Before & After)
Non-UI change — unit tests only. Design document at
docs/design/session-idle-reaper/README.md.Tested on
Risk & Scope
undicidependency issue ondaemon_mode_b_main— not introduced by this PR.session_closedreasons (idle_timeout,last_client_detached) are additive.Linked Issues
None — discovered during daemon architecture analysis.
中文说明
本 PR 做了什么
为 daemon bridge 添加两层 session 生命周期清理机制:
detachClient移除最后一个注册客户端且无 SSE 订阅者时,立即关闭 session。覆盖正常路径:用户关闭 tab → React cleanup →POST /session/:id/detach。两条路径都使用
closeSessionImpl,保留磁盘 JSONL——session/load或session/resume可以恢复任何已关闭的 session。为什么需要
客户端关闭浏览器 tab、崩溃或断网时,session 在 bridge 的
byIdMap 中累积。每个孤儿 session 持有一个 EventBus 重放环(~2-4 MB)。如果不清理,daemon 最终会达到maxSessions=20上限并拒绝所有新 session——需要重启才能恢复的可用性故障。风险与范围
session_closedreason(idle_timeout、last_client_detached)是增量添加的。🤖 Generated with Qwen Code