feat(serve): shared MCP transport pool [F2]#4336
Conversation
📋 Review SummaryThis PR (#4336) implements the F2 shared MCP transport pool foundation (commits 1-3 of 6), enabling multiple ACP sessions to share a single MCP child process per unique server configuration. The implementation is well-architected with strong documentation, comprehensive tests (161/161 passing), and careful attention to edge cases like spawn deduplication, drain grace periods, and cross-platform process cleanup. The code demonstrates solid engineering practices with clear separation of concerns between the pool, entry, and session view layers. 🔍 General FeedbackPositive aspects:
Areas of note:
🎯 Specific Feedback🟡 High
🟢 Medium
🔵 Low
✅ Highlights
Overall assessment: Strong foundational implementation for the MCP transport pool. The code is production-ready for commits 1-3, with the suggested improvements being refinements rather than blockers. Eager to review commits 4-6 (daemon wiring, status routes, budget guardrails) once landed. |
There was a problem hiding this comment.
Pull request overview
This draft checkpoint lays groundwork for the upcoming shared MCP transport pool by refactoring MCP discovery into snapshot-producing helpers and adding supporting primitives (pool keying, per-session registry projection, process cleanup). It also includes several broader refactors and worktree Phase C functionality (session-persisted worktree state + UI affordances) that materially expand the PR’s scope.
Changes:
- MCP pooling foundations: pure MCP discovery snapshots, per-session
SessionMcpView, pool key fingerprinting, and cross-platform descendant PID enumeration for shutdown. - Model selector/fast-model routing refactor: richer
resolveModelIdcontext (configured model ownership), cross-auth fast model support across subagents/forked agents/BaseLlmClient, and related API adjustments. - Worktree Phase C: persist/restore worktree session sidecar, add Footer/statusline surfacing, and introduce a WorktreeExitDialog + headless
--resumereminder injection.
Reviewed changes
Copilot reviewed 91 out of 91 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/vscode-ide-companion/schemas/settings.schema.json | Adds hideBuiltinWorktreeIndicator setting to IDE companion schema. |
| packages/core/src/utils/sideQuery.ts | Updates documentation/behavior around fast model selection for side queries. |
| packages/core/src/utils/modelId.ts | Expands model selector resolution with configured-model context + helper to build context from Config. |
| packages/core/src/utils/modelId.test.ts | Adds coverage for bare model resolution across auth types + fast selector behavior. |
| packages/core/src/utils/forkedAgent.ts | Resolves model selectors via resolveModelId and runs forked work under a runtime content-generator view when needed. |
| packages/core/src/utils/forkedAgent.agent.test.ts | Tests runtime view creation/propagation for cross-auth fast models in forked agent execution. |
| packages/core/src/tools/skill.test.ts | Broadens skill modelOverride tests to accept selector syntax (fast, authType:model). |
| packages/core/src/tools/session-mcp-view.ts | New: per-session projection of pooled MCP tool/prompt snapshots into session registries. |
| packages/core/src/tools/session-mcp-view.test.ts | New: unit tests for filtering + trust-clone behavior + teardown. |
| packages/core/src/tools/pid-descendants.ts | New: cross-platform descendant PID enumeration + best-effort SIGTERM helper. |
| packages/core/src/tools/pid-descendants.test.ts | New: validation tests + POSIX-gated integration test for descendant enumeration. |
| packages/core/src/tools/mcp-tool.ts | Adds DiscoveredMCPTool.withTrust() to support per-session trust cloning for pooled snapshots. |
| packages/core/src/tools/mcp-pool-key.ts | New: pool fingerprinting + transport classification + poolability gating. |
| packages/core/src/tools/mcp-pool-key.test.ts | New: tests for fingerprint stability/divergence + OAuth canonicalization + parsing. |
| packages/core/src/tools/mcp-pool-events.ts | New: pool event types + MCPCallInterruptedError for interrupted in-flight calls. |
| packages/core/src/tools/mcp-client.ts | Splits discover() into discoverAndReturn() + adds listMcpPrompts() for pure prompt listing; adds getTransportPid(). |
| packages/core/src/tools/exit-worktree.ts | Preserves sidecar on keep; clears sidecar on remove paths (best-effort, slug-checked). |
| packages/core/src/tools/exit-worktree.test.ts | Increases test timeouts to reduce flakiness under slow git/hook environments. |
| packages/core/src/tools/exit-worktree.session.integ.test.ts | New: integration coverage for WorktreeSession sidecar cleanup semantics on exit. |
| packages/core/src/tools/enter-worktree.ts | Persists WorktreeSession sidecar on enter and captures original HEAD commit hash. |
| packages/core/src/tools/enter-worktree.session.integ.test.ts | New: integration coverage for sidecar persistence + overwrite semantics. |
| packages/core/src/subagents/validation.ts | Switches subagent model validation to resolveModelId. |
| packages/core/src/subagents/types.ts | Updates docs for fast selector semantics (cross-auth fast model support). |
| packages/core/src/subagents/subagent-manager.ts | Uses resolveModelId + configured-model context; builds runtime view for resolved overrides. |
| packages/core/src/subagents/subagent-manager.test.ts | Adds tests for authType-qualified fastModel and cross-auth generator creation. |
| packages/core/src/subagents/model-selection.ts | Removes legacy subagent-only model selector parser. |
| packages/core/src/subagents/model-selection.test.ts | Removes tests for legacy subagent-only parser. |
| packages/core/src/skills/types.ts | Documents skill model selector to include fast. |
| packages/core/src/services/worktreeSessionService.test.ts | New: tests for WorktreeSession sidecar read/write/clear/restore behaviors. |
| packages/core/src/services/toolUseSummary.ts | Removes explicit model override parameter; uses configured fast model only. |
| packages/core/src/services/toolUseSummary.test.ts | Drops tests for explicit model override. |
| packages/core/src/services/sessionTitle.ts | Uses getFastModel() directly (removes getFastModelForSideQuery usage). |
| packages/core/src/services/sessionService.ts | Adds getWorktreeSessionPath() for session-bound sidecar location. |
| packages/core/src/services/sessionRecap.ts | Changes recap return type to `string |
| packages/core/src/services/gitWorktreeService.ts | Adds best-effort core.hooksPath setup during worktree creation. |
| packages/core/src/services/gitWorktreeService.hooks.integ.test.ts | New: integration tests for core.hooksPath configuration behavior. |
| packages/core/src/services/chatRecordingService.ts | Updates fast-model selection comment/logic to rely on getFastModel(). |
| packages/core/src/models/modelConfigResolver.ts | Fixes env-var-only path: auto-detect modalities via defaultModalities(); uses fallback model id for source attribution. |
| packages/core/src/models/modelConfigResolver.test.ts | Adds regression tests for issue #4219 (modalities auto-detection). |
| packages/core/src/memory/relevanceSelector.ts | Raises side-query timeout safety net from 1s to 30s with clearer intent. |
| packages/core/src/memory/recall.ts | Improves abort vs timeout vs error logging and heuristic fallback semantics. |
| packages/core/src/index.ts | Re-exports worktreeSessionService. |
| packages/core/src/followup/suggestionGenerator.ts | Prefers fast model (or cache-safe) for suggestions; adjusts override behavior. |
| packages/core/src/followup/suggestionGenerator.test.ts | Adds tests ensuring correct model passed in cache mode with/without fast model. |
| packages/core/src/core/openaiContentGenerator/provider/default.ts | Mirrors reasoning_content into reasoning for Qwen3 models when needed. |
| packages/core/src/core/openaiContentGenerator/provider/default.test.ts | Adds tests for Qwen3 reasoning field mirroring + non-overwrite behavior. |
| packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts | Logs auth type from generator config (not from mutable Config auth type). |
| packages/core/src/core/loggingContentGenerator/loggingContentGenerator.test.ts | Updates expectations to reflect generator-auth-type logging. |
| packages/core/src/core/baseLlmClient.ts | Builds model-id context from Config; uses runtime generator when active view differs; avoids caching fallback across runtime view changes. |
| packages/core/src/core/baseLlmClient.test.ts | Adds tests for runtime generator selection + no caching of fallback across view changes. |
| packages/core/src/config/config.ts | Redefines getFastModel() to return selectors (possibly authType-qualified) and removes prior split responsibility. |
| packages/core/src/config/config.test.ts | Updates tests to match new getFastModel() selector semantics. |
| packages/cli/src/ui/hooks/useWorktreeSession.ts | New: watches WorktreeSession sidecar file changes and exposes session worktree state to UI. |
| packages/cli/src/ui/hooks/useWorktreeSession.test.tsx | New: tests for creation/deletion watching behavior. |
| packages/cli/src/ui/hooks/useStatusLine.ts | Adds worktree fields to statusline stdin payload and avoids over-triggering on object identity. |
| packages/cli/src/ui/hooks/useDialogClose.ts | Allows global close (Ctrl+C/Esc) to dismiss WorktreeExitDialog. |
| packages/cli/src/ui/hooks/useAwaySummary.ts | Updates to new recap return type (string). |
| packages/cli/src/ui/hooks/useAwaySummary.test.ts | Updates mocks/expectations for recap return type change. |
| packages/cli/src/ui/contexts/UIStateContext.tsx | Adds activeWorktree + WorktreeExitDialog visibility state. |
| packages/cli/src/ui/contexts/UIActionsContext.tsx | Renames suggestion visibility action to tab-consumer state + adds worktree exit handler. |
| packages/cli/src/ui/components/WorktreeExitDialog.tsx | New: UI dialog that probes git dirty state + commit delta and offers keep/remove/cancel. |
| packages/cli/src/ui/components/WorktreeExitDialog.test.tsx | New: basic render test for initial loading frame. |
| packages/cli/src/ui/components/InputPrompt.tsx | Splits narrow “autocomplete dropdown visible” from broad “Tab consumer active” signal. |
| packages/cli/src/ui/components/InputPrompt.test.tsx | Adds regression tests for Tab-consumer reporting and narrow suggestion-visibility semantics. |
| packages/cli/src/ui/components/Footer.tsx | Adds built-in worktree indicator line (hideable via new setting). |
| packages/cli/src/ui/components/DialogManager.tsx | Wires WorktreeExitDialog into dialog selection. |
| packages/cli/src/ui/components/Composer.tsx | Uses narrow dropdown visibility locally; forwards Tab-consumer state upward. |
| packages/cli/src/ui/commands/recapCommand.ts | Adapts to recap return type change. |
| packages/cli/src/nonInteractiveCli.ts | On --resume, restores worktree context and injects a <system-reminder> + emits a system message. |
| packages/cli/src/nonInteractiveCli.test.ts | Adds tests for headless resume worktree reminder injection and stale sidecar cleanup. |
| packages/cli/src/config/settingsSchema.ts | Adds ui.hideBuiltinWorktreeIndicator setting definition. |
| packages/cli/src/acp-integration/session/Session.ts | Adds one-shot pendingWorktreeNotice injection into next prompt as <system-reminder>. |
| packages/cli/src/acp-integration/acpAgent.ts | Restores worktree context on ACP load/resume and queues the session notice. |
| docs/users/features/sub-agents.md | Updates subagent model selector docs (adds fast, cross-auth resolution guidance). |
| docs/design/worktree.md | Updates worktree design doc to phase-based plan (but currently contradicts implemented Phase C features). |
| docs/design/2026-05-15-async-memory-recall-design.md | New design doc for async memory recall approach and rationale. |
Comments suppressed due to low confidence (1)
packages/core/src/tools/session-mcp-view.ts:111
- This debug log message is misleading: it prints
${snapshot.length} toolsand then says “filtered to N registered” without actually computing or logging the registered count. Consider tracking the number of tools that pass the filter and logging that value (or removing the “filtered to …” clause) so the log is actionable during pool debugging.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Design document for F2 shared MCP transport pool — workspace-scoped pool that replaces today's per-session McpClient spawning so N sessions in one workspace share one process per unique server config. v2.1 folds in 12 review corrections on top of v2: - single-PR delivery per #4175 branching strategy (commit-by-commit review) - sessionToEntries reverse index for O(refs) releaseSession - ?entryIndex= selective restart route - spawn-failure slot leak fix - in-flight tool call during reconnect semantics (MCPCallInterruptedError) - /mcp disable triggers SessionMcpView re-apply - entryIndex exposure instead of raw fingerprint (avoid token-rotation side-channel) - reconnect backoff spec (stdio 5s x3, HTTP exponential 1/2/4/8/16s x5) - canonicalOAuth normalization - legacyInProcessAcquire renamed to createUnpooledConnection - drainAll(opts?) signature with timeoutMs - locked SDK reducer field names (no public API rename) - extension uninstall orphan entries deferred to MAX_IDLE_MS natural reap Refs: #3803, #4175 F2 Generated with Qwen Code
Replace-all regression from prior commit: both sides of the rename arrow ended up as createUnpooledConnection. Restore the meaning (old name was descriptive, not a literal symbol). Generated with Qwen Code
…4175 F2 commit 1) Foundation for the F2 shared MCP transport pool. Splits the existing side-effecting discovery API into a pure version that returns a {tools, prompts} snapshot, so the upcoming pool (#4175 F2 commit 2) can let a single shared McpClient produce one snapshot and have N per-session SessionMcpView instances each register a filtered copy into their own ToolRegistry / PromptRegistry. Changes: - Extract listMcpPrompts(serverName, mcpClient) — pure version of discoverPrompts that returns DiscoveredMCPPrompt[] (with serverName and bound invoke) WITHOUT touching any PromptRegistry. - Refactor discoverPrompts(name, client, registry) to wrap listMcpPrompts + register; preserves historical Promise<Prompt[]> return type (strips serverName / invoke from returned plain Prompt objects so existing callsites are unaffected). - Add McpClient.discoverAndReturn(cliConfig) — pure method returning {tools, prompts}. Same error semantics as discover(): flips status to DISCONNECTED on any failure and re-throws; "No prompts or tools found on the server." sentinel preserved so wrapping managers / pools can distinguish "server up but empty" from "server down". - Refactor McpClient.discover(cliConfig) to delegate: calls discoverAndReturn then explicitly registers BOTH tools and prompts into the per-instance registries. Pre-F2 prompts were registered as a side effect inside discoverPrompts; post-F2-1 registration happens in discover() after the pure call returns. Observable side effects identical (both registries populated by end of call); the order flip (tools first, then prompts vs. prompts first as side effect, then tools) has no observable race because discover() is awaited as a unit by connectAndDiscover and the two registries are independent maps. - Remove dead private methods McpClient.discoverTools and McpClient.discoverPrompts that delegated to the exported functions. Tests: - 7 new tests covering discoverAndReturn (snapshot purity, no registration, no-prompts-or-tools rejection with DISCONNECTED status flip, unconnected-state guard) and listMcpPrompts (enriched return type with invoke, no-prompts-capability fallback, protocol error swallow). - 1 new backward-compat test asserting discoverPrompts wrapper still registers prompts AND strips enrichment fields from return value. - 1 forward-defense assertion: the no-prompts-or-tools throw path verifies registries were strictly untouched, catching future regressions in commits 2-6 that might register a partial batch before the guard fires. Backward compatibility: - McpClient.discover() signature and side-effect contract unchanged for all standalone qwen callers + existing tests (44/44 pass). - discoverPrompts() exported signature unchanged. - No new public exports from packages other than listMcpPrompts + McpClient.discoverAndReturn (additive). - All 36 pre-existing tests in mcp-client.test.ts pass; all 71 tests in mcp-client-manager.test.ts pass. - packages/core typecheck clean; lint clean on touched files. Refs: #3803, #4175 F2; design doc docs/design/f2-mcp-transport-pool.md §7 Generated with Qwen Code
Core implementation of the F2 shared MCP transport pool. Workspace-
scoped pool that lets N ACP sessions share one MCP client per unique
(serverName, fingerprint) tuple instead of each session spawning
its own MCP child process.
New files:
- mcp-pool-events.ts: PoolEvent discriminated union, PoolEntryState
enum, MCPCallInterruptedError class (§13.4), type guards.
- mcp-pool-key.ts: fingerprint() with sorted canonical form for
stable hashing across env-key permutations; canonicalOAuth()
collapses {enabled:false}/undefined/null/{} to null (V21-9);
mcpTransportOf() classification; isPoolable() opt-in gate;
POOLED_TRANSPORTS_DEFAULT = {stdio, websocket} (V21 C8);
connectionIdOf / parseConnectionId.
- session-mcp-view.ts: per-session, per-server projection of the
pool's snapshot into a session's own ToolRegistry +
PromptRegistry. passesSessionFilter() preserves pre-F2
include/exclude semantics. applyTools clones each tool via
withTrust() so per-session trust never cross-contaminates the
shared snapshot (V21 C7). teardown() drops all this view's
registrations.
- mcp-pool-entry.ts: PoolEntry class with refcount, drain state
machine (spawning -> active <-> draining -> closed | failed),
generation counter for stale-handler guard (§7.3), snapshot
replay on attach (§7.2 / V21 C4), restart() with in-flight
coalescing (§13.2), forceShutdown() with idempotency,
MAX_IDLE_MS hard cap that survives drain/attach flap.
defaultPoolEntryOptions() returns transport-keyed defaults
(stdio: 5s fixed x3, http: 1/2/4/8/16s exponential x5 per §6.6).
- mcp-transport-pool.ts: top-level McpTransportPool class.
- acquire(name, cfg, sid, toolReg, promptReg): pool lookup,
spawnInFlight dedup for concurrent acquires, slot reservation
released on spawn failure (V21-4), sessionToEntries reverse
index for O(refs) releaseSession (V21-2).
- release(id, sid) / releaseSession(sid).
- restartByName(name, {entryIndex?}): V21-3 selective restart
via opaque entryIndex; returns RestartResult[].
- getSnapshot(): includes entryCount + entrySummary (with
opaque entryIndex, NOT raw fingerprint per V21-7) for the
pool-aware status route in commit 5.
- aggregateStatusByName(): "any-CONNECTED wins" across
multi-entry name collisions (§8.1).
- drainAll({force?, timeoutMs?}): wall-clock bounded graceful
shutdown for QwenAgent.close (§17 + V21-11).
- createUnpooledConnection(): SDK MCP + HTTP-no-opt-in path
constructs a per-session McpClient and uses the legacy
discover() (which writes to session registries directly).
- poisonedToolRegistry/PromptRegistry: stub passed to pool's
own McpClient instances; throws on any registration to catch
regressions where a pool path accidentally fell back to
side-effecting discover() instead of discoverAndReturn().
Changes:
- mcp-tool.ts: added DiscoveredMCPTool.withTrust(trust) clone
method (analogue of asFullyQualifiedTool but only updates trust;
returns this when trust unchanged to skip allocation in the
common case).
Tests (40 new):
- mcp-pool-key.test.ts (18 tests): fingerprint stability across
env permutations, divergence on auth byte changes, exclusion
of per-session filters from key, canonicalOAuth collapse,
transport classification, isPoolable gate, connectionId
round-trip with :: in server names.
- session-mcp-view.test.ts (11 tests): filter semantics, trust
copy invariant (snapshot tool NOT mutated), allocation pin
when trust unchanged, include/exclude precedence, prompt
fan-out, updateConfig + re-apply, idempotent teardown.
- mcp-transport-pool.test.ts (11 tests): 3-session sharing
with 1 spawn, credential isolation via env divergence,
drain timer cancellation by re-attach, drain timer expiry,
spawnInFlight dedup of 5 concurrent acquires, reverse-index
releaseSession, restartByName + entryIndex selectivity,
subprocessCount in snapshot, drainAll teardown.
No integration with daemon yet (acpAgent / Config / ToolRegistry
wiring lands in commit 4). Pool currently constructible in
isolation; existing standalone qwen + per-session McpClient path
untouched and all 71 mcp-client-manager + 44 mcp-client tests
pass unchanged.
Refs: #3803, #4175 F2; design doc docs/design/f2-mcp-transport-pool.md
§4 architecture, §5 fingerprint, §6 lifecycle, §7 SessionMcpView
Generated with Qwen Code
…2 commit 3) Two adjacent concerns in one commit: 1. Cross-platform descendant pid sweep (new file pid-descendants.ts) 2. Two P1 bug fixes folded back from commit-2 self-review == Pid descendant enumeration == `listDescendantPids(rootPid)` walks the process tree below the MCP child's root pid and returns all descendant pids in BFS order. `sigtermPids(pids)` sends SIGTERM tolerantly (ESRCH swallowed). Both are platform-aware: - Linux/macOS: `pgrep -P <pid>` recursion (pgrep exit code 1 means no children, NOT an error — special-cased) - Windows: PowerShell `Get-CimInstance Win32_Process` filtered by `ParentProcessId` (CIM replaces deprecated wmic on Win10 21H1+) Bounded by `QUERY_TIMEOUT_MS=2000`, `MAX_DESCENDANTS=256`, `MAX_DEPTH=8` so a runaway process tree can't stall daemon shutdown. Graceful degradation: tool missing or timeout returns `[]` and logs warn; OS will eventually reap the orphans (Linux init / Windows job objects). `PoolEntry.forceShutdown` now calls `getTransportPid()` → `listDescendantPids` → `sigtermPids` BEFORE `client.disconnect()`. Closes the leaked-wrapper-process gap that pre-F2 per-session McpClient teardown also had — wrappers like `npx`, `uvx`, `pnpm dlx` spawn the actual server as a grandchild; killing only the wrapper leaves the real server hanging. New `McpClient.getTransportPid()` public getter that introspects `StdioClientTransport.pid` (returns undefined for non-stdio transports + already-exited children). Optional-chained call site in PoolEntry tolerates older mock McpClient stubs in tests. == P1 fixes folded back from commit-2 review == P1 #1: PooledConnection.release() was a documented no-op that leaked refs until releaseSession bulk-cleanup. Wired `PooledConnectionImpl.releaseCallback` to the pool-supplied `pool.release(id, sessionId)`. Pool's `acquire` (both fast-path existing-entry and post-spawn paths) passes the callback through `PoolEntry.attach`'s new `opts.release` parameter. P1 #2: createUnpooledConnection double-teardown. Path: client.discover() registers tools/prompts into session registries → entry.markActive([], []) → entry.attach(sid, view) which synchronously called view.applyTools([]) → removeMcpToolsByServer(serverName) wiping the registrations discover() just made. Fix: PoolEntry.attach now accepts `opts.skipReplay?: boolean`. createUnpooledConnection passes `skipReplay: true` AND a release callback that calls forceShutdown directly (per-session lifetime, no pool refcount). Existing pool paths pass `release` but NOT `skipReplay`, preserving snapshot replay for the late-attach race. Tests (6 new on pid-descendants.test.ts): - input validation (non-positive, NaN, no-children) - sigtermPids empty input + ESRCH tolerance - integration: spawn shell that spawns node grandchild, verify listDescendantPids finds at least one descendant (POSIX-only, CI-skip gated) Verification: - 161/161 MCP-related tests pass (44 mcp-client + 71 mcp-client-manager + 18 mcp-pool-key + 11 session-mcp-view + 11 mcp-transport-pool + 6 pid-descendants) - packages/core typecheck clean - lint clean on touched files Not included (deferred to later commits): - Health monitor / auto-reconnect inside PoolEntry. Existing per-server reconnect logic lives in McpClientManager (consecutiveFailures + isReconnecting + reconnectDelayMs); pool doesn't yet have its own monitor. PoolEntry.restart() works for manual restart; future commit will plumb `client.onerror` → pool's reconnect path with §6.6 backoff strategy. Refs: #3803, #4175 F2; design doc §6.4 pid sweep, §6.5/§6.6 spawn failure + reconnect backoff, §7.2 snapshot replay Generated with Qwen Code
…F2 commit 4)
Daemon-mode integration of the F2 shared MCP transport pool. Sessions
running in the same workspace now share one MCP transport per unique
server config, instead of each session spawning its own child process.
Touches:
- packages/core/src/config/config.ts: setMcpTransportPool /
getMcpTransportPool. Pool reference stored on Config so
ToolRegistry's nested McpClientManager construction can pick it
up at config.initialize() time. Forward-declared via inline
`import('...').McpTransportPool` to avoid a circular import
between config.ts and tools/.
- packages/core/src/tools/tool-registry.ts: forwards
config.getMcpTransportPool() into the McpClientManager ctor.
When undefined, manager keeps its pre-F2 behavior (71/71
existing manager tests pass unchanged).
- packages/core/src/tools/mcp-client-manager.ts: new optional
`pool?` ctor param + new `discoverAllMcpToolsViaPool` branch in
discoverAllMcpTools. Gated on pool presence so standalone qwen
is unaffected. Pool path:
* Iterates servers with disable check
* Calls pool.acquire(name, cfg, sessionId, toolReg, promptReg)
* Tracks returned PooledConnection in `pooledConnections` map
* On disconnectServer: pooled.release() + map delete
* On stop(): releaseAllPooledConnections + existing flow
SDK MCP servers stay on the legacy path inside the pool itself
(createUnpooledConnection); manager doesn't need a parallel
SDK code path.
- packages/cli/src/acp-integration/acpAgent.ts: QwenAgent.mcpPool
field, eager construction in ctor (V21-13 Q6 resolved). Reads
options from env vars set by runQwenServe:
* QWEN_SERVE_NO_MCP_POOL=1 → kill switch (mcpPool stays
undefined; sessions fall back to per-session spawn)
* QWEN_SERVE_MCP_POOL_TRANSPORTS=stdio,websocket,http,sse →
operator opt-in for HTTP/SSE pooling (V21 C8); default
keeps stdio + websocket only
* QWEN_SERVE_MCP_POOL_DRAIN_MS=N → drain grace override
(default 30s; bounded [1s, 10min])
newSessionConfig calls config.setMcpTransportPool(this.mcpPool)
BEFORE config.initialize() so the ToolRegistry that initialize
constructs picks up the pool reference.
New `shutdownMcpPool(timeoutMs)` method called from the
SIGTERM/SIGINT handler in runAcpAgent before runExitCleanup
so the pool's descendant pid sweep (commit 3) catches npx/uvx
wrapper grandchildren.
- packages/core/src/index.ts: barrel exports for the pool
primitives (McpTransportPool, POOLED_TRANSPORTS_DEFAULT, types,
helpers).
- packages/core/src/tools/mcp-pool-key.ts: dedupe — removed local
McpTransportKind / mcpTransportOf definitions and re-export from
mcp-client-manager.ts (avoids name collision in the index.ts
barrel).
Tests:
- mcp-client-manager.test.ts: 2 new tests
* "routes discovery through the pool when one is injected" —
asserts pool.acquire called with (name, cfg, sessionId,
toolReg, promptReg); inverse invariant that McpClient is NOT
constructed by the manager when pool present (catches a
regression where the pool branch silently bypasses).
* "falls back to per-session McpClient spawn when no pool
injected" — explicit backward-compat assertion.
- All 73/73 mcp-client-manager tests pass (71 existing + 2 new)
- All 161/161 MCP-related tests pass (44 + 73 + 18 + 11 + 11 + 6
— incremented manager count)
- packages/core typecheck clean
- packages/cli typecheck: pool-related imports resolve;
pre-existing serve/status.ts + @google/genai issues unrelated
to F2 unchanged
Backward compatibility:
- Standalone qwen (non-daemon): QwenAgent not constructed; pool
not constructed; behavior identical to pre-F2
- QWEN_SERVE_NO_MCP_POOL=1: kill switch falls back to per-session
spawn even in daemon mode
- ACP child invoked with no pool env vars: defaults activate
(pool on, stdio+websocket transports, 30s drain)
- Existing McpClientManager construction sites (ToolRegistry,
test fixtures with the older 1-6 arg signatures) unchanged
because new pool param is optional and trailing
- McpTransportKind / mcpTransportOf still exported from the
same module path consumers used pre-F2
Not included (deferred to commits 5-6):
- Pool-aware GET /workspace/mcp snapshot (commit 5) —
buildWorkspaceMcpStatus still reads from bootstrap session's
manager; pool snapshot integration via QwenAgent extMethod is
next commit
- Pool-aware POST /workspace/mcp/:server/restart route with
?entryIndex= (commit 5)
- Budget guardrails graduation to workspace scope (commit 6) —
pool currently has no `--mcp-client-budget` integration, so
per-session budget enforcement still applies in pool mode (each
session's manager state machine is independent). PR 14b push
events still fire per session.
Refs: #3803, #4175 F2; design doc §2 current state, §10
per-session injection, §17 shutdown ordering
Generated with Qwen Code
…at (#4175 F2 commit 4 follow-up) The pre-commit eslint --fix in the previous commit (3dcdddf) merged the value imports into the type-only import block, which yielded `import type { ... type McpTransportKind, ... }` — TypeScript rejects nested `type` modifier inside `import type`. Restore the original two-block layout: value imports for runtime symbols (McpTransportPool, POOLED_TRANSPORTS_DEFAULT, etc.) and a separate `import type { ... }` for types only (McpTransportKind, ApprovalMode, Config, ConversationRecord, DeviceAuthorizationData). Pre-existing unrelated issues (ServeMcpTransport / @google/genai in cli/) are not addressed here. Generated with Qwen Code
… F2 commit 4 follow-up 2) Self-review found a regression: pool mode would route SDK MCP servers through pool.acquire which delegates to createUnpooledConnection. createUnpooledConnection constructs an McpClient with the pool's `sendSdkMcpMessage` callback — but the pool was constructed in QwenAgent ctor with no callback, so SDK MCP server tool calls would fail in daemon mode. Fix: discoverAllMcpToolsViaPool checks isSdkMcpServerConfig per server and routes SDK servers to the legacy discoverMcpToolsForServer path which preserves the per-session sendSdkMcpMessage wiring from McpClientManager's ctor. Non-SDK servers continue through pool.acquire. Bypass is per-server, not per-manager, so a workspace mixing SDK and non-SDK servers gets both pool-shared transports for the non-SDK ones AND working SDK MCP for the rest. Generated with Qwen Code
b580550 to
5a0a59c
Compare
…s + 4 suggestions (#4175 F2 PR #4336) Folds in @wenshao's first review pass on PR #4336. 7 critical bugs in pool lifecycle / race handling, 4 smaller suggestion fixes. Each issue keyed by its label in the PR comment thread for back-reference. == Critical fixes == C1 (acpAgent.ts:269) — Normal IDE close path missing pool drain. `await connection.closed` returned without calling `shutdownMcpPool`, leaking shared MCP entries (subprocess + wrappers) until OS reaped them — a real regression vs pre-F2 where each session's manager torn down its own clients on disconnect. Mirror SIGTERM handler's pool drain on the normal-close branch too. C2 (mcp-pool-entry.ts:291 area) — `attach()` ref ordering broke max-idle hard cap. Pre-fix, `attach` added the ref before calling `cancelDrainTimer`, so the `refs.size > 0` check inside cancelDrainTimer was always true and the maxIdle timer + firstIdleAt got reset on every attach — completely defeating its purpose (per design §6.3: "started at first idle and NEVER reset"). Fix: cancelDrainTimer now only cancels the drain grace timer; maxIdle survives the entire entry lifetime, cleared only by forceShutdown. C3 (mcp-pool-entry.ts:401) — `doRestart()` zombie state on reconnect failure. Pre-fix, a thrown `client.connect()` / `client.discoverAndReturn()` propagated up but left the entry with `localStatus = CONNECTED`, `state = 'active'`, stale snapshot — pool snapshot lies, subsequent acquires reuse the broken entry. Fix: try/catch wraps connect + discover; on failure transitions to terminal `'failed'` state, sets DISCONNECTED status, emits `failed` event, detaches subscribers via SessionMcpView.teardown, calls onClosed so pool drops the entry from its map. C4 (mcp-pool-entry.ts:361) — `forceShutdown`/`attach` race creates zombie connections. Pre-fix, `state = 'closed'` was assigned AFTER two async yields (`await listDescendantPids`, `await client.disconnect()`). During those yields, a concurrent `acquire` calling `attach` only rejected `'closed'`/`'failed'` states — got a handle to an entry mid-teardown. Fix: flip state to `'closed'` synchronously at the top of forceShutdown, before any await. Concurrent attach now sees 'closed' immediately and rejects. C5 (mcp-transport-pool.ts:399) — `drainAll` race with in-flight spawns. Pre-fix, after Promise.race resolved, `entries.clear()` + `spawnInFlight.clear()` ran synchronously. But in-flight spawn promises continued executing and called `entries.set(id, entry)` AFTER the clear — orphan entries leaking subprocesses past pool shutdown. Fix: introduce `draining` mutex flag (acquire rejects when set), and `await Promise.allSettled` on in-flight spawns BEFORE taking the entry snapshot. Spawn completion before clear is now ordered correctly. C6 (mcp-pool-entry.ts:155) — PoolEntry ignored transport- level errors. Pre-fix, McpClient.onerror writes DISCONNECTED to the global `serverStatuses` map on transport drop, but PoolEntry's `localStatus` stayed CONNECTED — pool's `aggregateStatusByName` then read the stale localStatus and "any-CONNECTED-wins" overwrote the correct DISCONNECTED back into the global map. Fix: PoolEntry registers a module-level status change listener filtered by serverName, mirrors the GLOBAL value into localStatus on every change. `suppressNextStatusEcho` flag guards against listener loops when the entry's own updateGlobalStatus writes to the global map. Listener detached on forceShutdown / failed-state transition. Sub-fix in spawnEntry: order is now `entries.set(id, entry)` BEFORE `entry.markActive(...)`. Pre-fix, markActive ran updateGlobalStatus before entries.set, so aggregateStatusByName couldn't find the just-spawned entry, returned DISCONNECTED, wrote that to the global map, the new status listener echoed it back as `localStatus = DISCONNECTED` — defeating the CONNECTED state markActive had just set. Reorder + idempotent `entries.delete(id)` in catch covers the race. C7 (mcp-client-manager.ts:966) — `discoverAllMcpToolsIncremental` bypassed pool. The pool gate in `discoverAllMcpTools` correctly routed the bulk path through `discoverAllMcpToolsViaPool`, but `discoverAllMcpToolsIncremental` (called from `Config.startMcpDiscoveryInBackground` during boot's default progressive mode) had no such guard — silently reverting to per-session McpClient spawning during the exact path most daemon sessions take. Fix: same `if (this.pool) return discoverAllMcpToolsViaPool(cliConfig)` gate at the top of discoverAllMcpToolsIncremental. == Suggestions == S1 (session-mcp-view.ts:38) — Docstring claimed both includeTools and excludeTools support `<name>(<args>)` parens form, but only includeTools strips parens. excludeTools uses direct equality (matches pre-F2 `mcp-client.ts:isEnabled` history). Doc fixed to reflect actual behavior. S2 (pid-descendants.ts:166) — `sigtermPids` docstring claimed it used `taskkill /F` on Windows, but the implementation always calls `process.kill(pid, 'SIGTERM')` regardless of platform. On Windows, Node polyfills SIGTERM to TerminateProcess (similar effect, no shell-out needed). Doc fixed; implementation unchanged. S3 (session-mcp-view.ts:110) — Debug log contained literal "N" instead of `${count}` interpolation. Operators enabling debug logging saw a meaningless placeholder. Track actual `registered` count and interpolate. S4 (mcp-transport-pool.ts:545) — `createUnpooledConnection` passed `() => MCPServerStatus.CONNECTED` as the status aggregator callback. After forceShutdown, this would write CONNECTED to the global serverStatuses map even though the transport was dead. Fix: aggregator now delegates to `client.getStatus()` so the global map reflects the actual McpClient state. == Verification == - 163/163 MCP-related tests pass (44 + 71 + 18 + 11 + 11 + 6 + 2) - packages/core typecheck clean - All fixes folded into the commit-where-the-bug-lived (commit 2 / commit 3 / commit 4) via fix-up commit on top — preserves bisectability of the buggy state for future forensics Refs: PR #4336 review by @wenshao (commit 4 round 1) Generated with Qwen Code
Wire the F2 transport pool into the daemon's `GET /workspace/mcp` and
`POST /workspace/mcp/:server/restart` surfaces, plus advertise two new
conditional capability tags.
Status route enrichment (`buildWorkspaceMcpStatus`):
- pool snapshot taken once outside the per-server loop (avoids N walks)
- per-server cells gain `entryCount` + `entrySummary` (V21-7 opaque
`entryIndex`, NOT raw fingerprint) when the pool holds at least one
matching entry
- pool snapshot failure is a stderr-loud non-fatal — the legacy
budget-accounting cells still render
Restart route routing (`workspaceMcpRestart` ext method):
- new `?entryIndex=N` query param (or `*` / omitted) on
`/workspace/mcp/:server/restart` — bounded non-negative integer or
the literal `*`; bad inputs return `400 invalid_entry_index`
- ACP child routes through `pool.restartByName(name, {entryIndex})`
when the pool holds entries; falls back to the legacy
`discoverToolsForServer` path otherwise (`--no-mcp-pool` daemons,
unpooled HTTP/SSE/SDK transports, or names that drained out)
- legacy single-entry response shape `{restarted, durationMs}`
preserved; multi-entry responses use the new
`{entries: RestartResult[]}` shape — clients gated on the
`mcp_pool_restart` capability tag are the only senders of
`entryIndex`
- pool-mode hard restart failure fans out one
`mcp_server_restart_refused` event per failed entry with
`reason: 'restart_failed'` (additive enum value) plus `details`
carrying the underlying error text; soft-skip pre-flight checks
(`disabled` / `in_flight` / `budget_would_exceed`) still run
BEFORE the pool branch
Capability advertisement:
- `mcp_workspace_pool` + `mcp_pool_restart` both gated on a new
`mcpPoolActive` toggle in `AdvertiseFeatureToggles`
- conditional predicate is default-OFF (matches `require_auth`
pattern); server.ts call site flips to default-ON via
`opts.mcpPoolActive !== false`, so a daemon booted without the
kill switch advertises both tags by default
- `runQwenServe.ts` infers `mcpPoolActive: false` when the parent
process has `QWEN_SERVE_NO_MCP_POOL=1` so the envelope tracks the
ACP child's actual feature set
SDK type extensions (additive only):
- `ServeWorkspaceMcpServerStatus.entryCount` + `entrySummary`
- `DaemonMcpServerRestartedData.entryIndex?`
- `DaemonMcpServerRestartRefusedData.{reason: 'restart_failed', entryIndex?, details?}`
- `MCP_RESTART_REFUSED_REASONS` widened to include `restart_failed`
Tests:
- `EXPECTED_REGISTERED_FEATURES` gains the two pool tags; conditional-
features drift test asserts `mcpPoolActive` predicate behavior
- `daemonEvents.test.ts` exercises the new `restart_failed` reason
through the reducer
163 F2 tests + 62 acp-bridge tests + 46 daemon events tests pass.
…SDK doc Two findings from the code-reviewer pass on `edeb0a5cf`: R1 (critical): the `/capabilities` v1-envelope test was asserting `features` against `getAdvertisedServeFeatures()` (no toggles → both new pool tags filtered out by the default-OFF predicate), but the actual response uses `mcpPoolActive: opts.mcpPoolActive !== false` (default-ON at the call site). Anchored the assertion against the same toggle the route uses, plus added a separate test that explicitly boots with `mcpPoolActive: false` and verifies both pool tags drop out (mirrors the `QWEN_SERVE_NO_MCP_POOL=1` kill-switch path). R3 (doc clarity): the `restart_failed` reason's jsdoc claimed old SDK reducers "see the new value as `unknown` (TS structural widening) and surface it generically rather than crashing." That described the type system but mis-stated the runtime: `isMcpServerRestartRefusedData` calls `MCP_RESTART_REFUSED_REASONS.has(...)` and returns false for unknown reasons, so `parseDaemonEvent` silently DROPS the event. New text explains the closed-set predicate + how the additive-protocol contract still holds (pre-PR SDKs gate on `mcp_pool_restart` before sending `entryIndex`, so they shouldn't be observing pool-mode multi-entry restarts).
Eight findings from wenshao's review of commit 5; six adopted as real
bug fixes / encapsulation wins, two with partial / declined replies.
R1 (critical): `maxIdleTimer` force-closed actively-used pool entries.
The C2 fix intentionally let the timer survive attach/detach flap, but
the fire-action didn't re-check `refs.size`. A session that re-attached
inside the 30s drain grace and stayed busy for 4+ minutes would lose
the entry permanently when `maxIdleTimer` (started at the earlier
detach) fired. Now: if active refs exist at fire time, log + reset
`firstIdleAt` so the next idle window gets a fresh hard cap.
R2 (critical): incremental discovery released ALL pooled connections
then re-acquired everything. Pre-fix every progressive-mode boot pass
or `/mcp refresh` produced a brief window with zero MCP tools
registered AND bounced every entry's drain timer. Now: diff
`pooledConnections` against the desired (name, fingerprint) set and
release only stale entries; survivors stay attached, no tool registry
churn. SDK MCP servers still re-run via the legacy path
(idempotent re-call).
R3 (correctness): `doRestart` updated `toolsSnapshot`/`promptsSnapshot`
and emitted typed events but no `SessionMcpView` instance subscribed
to that event stream — so session ToolRegistry instances kept stale
pre-restart registrations. Latent until commit 5 landed the restart
HTTP route; now a real correctness bug. Iterate `subscribers` directly
after snapshot update so views actually pick up the new tools/prompts.
R4 (cosmetic→correctness): `getSnapshot()` counted websocket toward
`subprocessCount`, but websocket transports dial a (potentially
remote) server and don't spawn a local OS child — inflated the
operator-facing capacity-planning metric. Restricted to `stdio` only.
R5 (defense-in-depth): the Windows `Get-CimInstance` PowerShell
script interpolated `${pid}` directly into the `-Filter` string. The
entry-point integer guard makes injection impossible today, but
binding the pid to a `$p` variable up front makes the integer-only
contract robust against future relaxations of the guard.
R6 (encapsulation): `PoolEntry.cfg` was readonly-public, exposing
secrets (env API keys, header auth tokens, OAuth fields) to anyone
holding an entry reference. Made private; added `transportKind`
getter for the only external reader (subprocessCount classification
in `getSnapshot`).
R7 (partial): removed five PoolEvent type guards, the `Prompt`
re-export, and `PoolEntryConnectionStatus` — all premature public
API with zero callers in source or tests. Kept
`MCPCallInterruptedError` because design §13.4 declares it as the
user-facing contract for the V21-5 in-flight call interruption
follow-up; removing it would lose the invariant carrier.
R8 (cleanup): SIGTERM handler and IDE-initiated close path had
identical `if (agentInstance) { try { await shutdownMcpPool(8_000) }
catch ... }` blocks. Extracted into `drainPoolBeforeExit(label)` so
both paths share the timeout + log labels and future drain-semantic
changes happen in one place.
R9 / R10 deferred: the McpClientManager 7th-arg sentinel pattern
(R9) and per-PID-per-level pgrep cost (R10) work correctly today;
both are refactoring/perf optimizations for a later cleanup PR
rather than F2 correctness blockers.
Tests:
- All 163 F2 tests pass; all 73 mcp-client-manager tests pass
- No new tests added; the existing R3 fix was caught only because
commit 5's restart route activated the latent path. Adding a unit
test for the snapshot fan-out would require wiring a mock
SessionMcpView; deferred to commit 6's test harness expansion.
wenshao
left a comment
There was a problem hiding this comment.
Round 3 review — 1 Critical, 3 Suggestions beyond the 10 previously-reported issues.
General note: discoverMcpToolsForServer (mcp-client-manager.ts) lacks a pool-mode gate. When the pool is active, /mcp reconnect or OAuth re-discovery would create a second McpClient + child process, bypassing SessionMcpView filtering/trust decoration. This path is reachable but may be intentionally deferred to commit 5 — worth addressing before wiring.
— DeepSeek/deepseek-v4-pro via Qwen Code /review
…F2 commit 6) Move slot reservation + 75% hysteresis + refused-batch coalescing from per-session McpClientManager copies onto a single workspace-scoped controller owned by the pool. 4 sessions × budget=2 now caps the workspace at 2, not 8. Core class (`packages/core/src/tools/mcp-workspace-budget.ts`): - New `WorkspaceMcpBudget` mirrors the manager's state machine (`tryReserve` / `release` / `recordRefusal` / hysteresis at `MCP_BUDGET_WARN_FRACTION`/`MCP_BUDGET_REARM_FRACTION` / bulk-pass coalescing) but is constructed once per workspace. - Reservation key is server NAME (matches PR 14 v1 contract; two pool entries with same name but divergent fingerprints share one slot). - `recordRefusal` flushes inline as a length-1 batch when called out-of-bulk-pass; bulk passes accumulate and `endBulkPass` does the coalesced emit (mirrors `McpClientManager.refuseAndLog → emitRefusedBatchIfAny`). Pool integration (`mcp-transport-pool.ts`): - New optional `budget?: WorkspaceMcpBudget` ctor option + `getBudget()` accessor for snapshot builders. - `acquire()` calls `tryReserve` pre-spawn; `'refused'` returns `BudgetExhaustedError` after `recordRefusal`. Spawn-failure path rolls back the slot (V21-4) when no sibling entry holds the name. - Entry close callback releases the slot if no other entry shares the same `serverName` (multi-fingerprint preservation). Manager integration (`mcp-client-manager.ts`): - `discoverAllMcpToolsViaPool` brackets the pass with `beginBulkPass`/`endBulkPass` so per-server BudgetExhaustedError refusals coalesce into ONE `refused_batch` event at end of pass. - `BudgetExhaustedError` from pool is logged at debug (deliberate refusal, not a failure); other errors stay at `error`. Daemon wiring (`acpAgent.ts`): - `QwenAgent` ctor reads `QWEN_SERVE_MCP_CLIENT_BUDGET` / `QWEN_SERVE_MCP_BUDGET_MODE` env vars (same path as per-session manager) and constructs `WorkspaceMcpBudget` when budget > 0, passes it to the pool. - `broadcastBudgetEvent(event)` fans workspace-scoped events to every attached session via per-sid `extNotification`s on the shared connection — replaces N per-session callbacks with one pool callback fanning out N times. - `newSessionConfig` skips the per-session `setMcpBudgetEventCallback` wiring when the workspace budget is active (prevents double-firing). - `buildWorkspaceMcpStatus` reads pool budget when active, marks the cell `scope: 'workspace'`. Per-session fallback unchanged. - `buildBudgetCells` accepts optional `scope` parameter; pre-F2 daemons / `--no-mcp-pool` keep `'session'` for back-compat. SDK additive surface (`sdk-typescript/src/daemon/events.ts`): - `DaemonMcpBudgetWarningData.scope?: 'workspace' | 'session'` - `DaemonMcpChildRefusedBatchData.scope?: 'workspace' | 'session'` - New helper `isWorkspaceScopedBudgetEvent(data)` for SDK consumers branching on scope. Type predicates unchanged (scope is optional). - Reducer counters (`mcpBudgetWarningCount` / `mcpChildRefusedBatchCount`) increment regardless of scope per V21-12 — workspace events fan to all sessions so counters move in lockstep. Tests: - 17 new `WorkspaceMcpBudget` tests covering tryReserve, release, hysteresis state machine, refused-batch coalescing, getters - 3 new pool integration tests covering acquire-refused-on-cap, slot release on entry close, slot rollback on spawn failure - All 163 pre-existing F2 tests pass; 229 total core+SDK tests Total: 1 new core class, ~600 LOC production + ~270 LOC tests.
… iter safety
Three findings from the code-reviewer pass on `ef2974b85`; one real
race fix + two clarity/defensive improvements.
R1 (race, important — 86): close-callback released the budget slot
prematurely when a same-name in-flight spawn was still running. The
sibling check inspected only `this.entries`, missing entries that
hadn't yet completed `markActive`. Sequence: entry A for 'srvA'
finishes spawn → registers in `entries`. Entry B (different
fingerprint, same name) starts spawning. Entry A drains; close-
callback finds no siblings in `entries` (B not yet registered) →
releases the slot. B finishes; slot is unreserved while B occupies
capacity. A subsequent acquire for a third name slips past the cap.
Fix: new `hasNameSibling(name)` helper checks BOTH `this.entries`
and `this.spawnInFlight.keys` (form `${name}::${fingerprint}`, so a
`startsWith(`${name}::`)` test isolates same-name in-flight spawns).
Used by the close-callback AND the spawn-failure rollback. Order of
catch/finally chained on the spawn promise is also fixed: `finally`
removes from `spawnInFlight` BEFORE the `catch` runs the rollback,
so `hasNameSibling` sees the post-cleanup state. Pre-fix the catch
ran first while the in-flight entry was still in the Map — masked
the rollback's release decision.
New test: `preserves slot when entry closes during a same-name
in-flight spawn (R1 race fix)` exercises exactly this sequence.
R2 (docs): SDK reducer counter docstrings updated to call out the
N× workspace fan-out multiplier explicitly. A workspace-scoped
`mcp_budget_warning` event fires once at the budget but produces N
reducer increments across N attached sessions on the daemon's
connection. Pre-fix the docstring didn't mention this and consumers
aggregating `mcpBudgetWarningCount` across sessions would
double-count silently. Now both `mcpBudgetWarningCount` and
`mcpChildRefusedBatchCount` docstrings have a "workspace-scope
multiplier" paragraph pointing consumers at
`isWorkspaceScopedBudgetEvent` for branching.
R3 (defense): `broadcastBudgetEvent` snapshots `this.sessions.keys`
into `Array.from(...)` BEFORE the per-id async fan-out so a
concurrent `killSession` (which mutates `this.sessions`
synchronously inside its handler) can't corrupt the iterator. No
known reproducer in the current code paths but cheap defensive
hardening — matches the same pattern used by the bridge's
`broadcastWorkspaceEvent`.
R2 of the original review (V21-12 reducer scope-blindness) is by-
design per design §11.4: SDK consumers wanting a deduplicated
"workspace events fired" tally use `lastMcpBudgetWarning?.scope`
to gate. The docstring fix (above) closes the documentation gap
that made this contract invisible.
Tests: 151 pool + workspace-budget + manager + SDK events tests
pass (3 new pool integration tests including the R1 regression).
Lint clean.
Twelve real fixes (7 critical + 5 minor) + 3 declined-with-reply.
W1 (critical): pool spawn-failure leaked `statusChangeListener` —
catch only ran `entries.delete` + `client.disconnect`, never
`forceShutdown` (the sole removal path). Each failure leaked one
listener permanently. Fix: call `entry.forceShutdown('manual')`
before disconnect; wrap in try/catch since the entry never reached
`active`.
W2 (critical): `statusChangeListener` corrupted sibling entries'
`localStatus` for multi-fingerprint name collisions. Module-level
`serverStatuses` is shared across all entries with the same
`serverName`; entry A's transport error wrote DISCONNECTED, B's
listener fired with that status, and the `if (status !==
this.localStatus)` guard didn't catch it because B was CONNECTED.
Fix: cross-check `this.client.getStatus() !== status` (per-entry
truth) before mirroring — sibling writes are now ignored.
W3 (critical): `doRestart()` skipped the `listDescendantPids` +
`sigtermPids` sweep that `forceShutdown` performs. For stdio MCP
servers wrapped by `npx`/`uvx`/`pnpm dlx`, every restart-via-HTTP
left the actual server grandchild as an orphan. Fix: mirror the
sweep BEFORE `client.disconnect`; per-pid failures tolerated.
W4 (critical): `doRestart()` didn't `cancelDrainTimer` or transition
`'draining' → 'active'`. An entry in drain grace whose restart
arrived would yield to the drain timer mid-disconnect, get
force-closed, then `client.connect` would spawn an orphan that the
pool no longer tracks. Fix: cancel drain + transition state at the
top of `doRestart`.
W5 (critical): `McpClientManager.pooledConnections` held dead
handles after a pool entry transitioned to `'failed'` (entry
removed from `pool.entries`, manager never learned). Subsequent
discovery passes saw `pooledConnections.has(name)` and skipped
re-acquiring → server's tools permanently lost for the session
until full `stop` + rediscovery. Fix: subscribe to entry events on
`acquire`; evict on `'failed'` (idempotent via `get(name) === conn`
guard).
W6 (critical): `discoverAllMcpToolsViaPool` was not re-entrant. Two
concurrent passes (full + incremental, or two incrementals) could
both see `pooledConnections.has(name) === false` before either
called `.set()` → second `.set` overwrote first → conn1 leaked
forever. Fix: per-manager `discoveryInFlight` mutex; second caller
awaits the same promise.
W14 (critical): `createUnpooledConnection`'s catch path had the same
`statusChangeListener` leak as W1 (different code path, same root
cause — only `forceShutdown` removes the listener). Fix: same
mirror in the unpooled catch.
W9 (minor): `parsePoolDrainMs` accepted `'30000ms'` / `'30000abc'`
silently via `Number.parseInt` truncation. Fix: strict `^\d+$`
regex; reject with stderr warning + default fallback.
W10 (minor): pool's `acquire` called `indexAttach(sessionId, id)`
BEFORE `entry.attach()`. If `attach` threw (e.g., entry transitioned
to `closed`/`failed` between the existence check and the call), the
reverse index retained a stale mapping. Fix: index AFTER
`attach` succeeds (both fast path + in-flight path).
W13 (doc): `subprocessCount` JSDoc still claimed `stdio + websocket`
after R4 restricted it to stdio in commit 5. Fix: doc updated.
W15 (defensive): bridge's pool-mode response handler cast
`response as PoolEntries` and iterated `response.entries` without
runtime shape validation. A buggy/out-of-sync ACP child returning a
malformed shape would crash the route with TypeError. Fix:
`Array.isArray` check + per-entry shape guard; malformed entries
skipped with stderr warning.
W7 (test gaps, partial): added regression test `serializes
concurrent discovery passes via mutex` for W6. Other coverage gaps
(drain mutex, spawnEntry failure, restart failure,
createUnpooledConnection) are deferred — better addressed via a
focused test-coverage commit after F2 series merges.
Declined (with reply on PR):
- W8 (`maxReconnectAttempts`/`reconnectStrategy` unused) — health
monitor reconnect is a deferred F2 follow-up per design §6.6;
the fields stay as forward-compat placeholders.
- W11 (duplicate fast-path/in-flight-path attach blocks) — accepted
refactor opportunity; not blocking F2 series merge.
- W12 (passesSessionFilter O(M×N)) — micro-perf optimization;
measurable only with hundreds of tools / large filter lists.
Tests: 231 F2/SDK tests pass (1 new mutex regression test); 62
acp-bridge tests pass. Lint clean.
…ed identity)
Round 22 closes 3 new threads wenshao posted at 12:34Z (after my
Round 21 push) on the W125 fix from Round 20. Two are real Critical
bugs I introduced; the third is a partial accept on test coverage +
observability.
W125-followup A — budget leak (Critical, transport-pool.ts:266):
Pre-R20 my W125 catch + else-if branches called
`this.entries.delete(id)` directly — bypassing the pool's onClosed
callback which is the only place budget slots were released. Real
race window:
1. `forceShutdown('drain_timer')` runs sync portion (state='closed',
listener detach, emit, subscriber detach), then
`await sweepAndDisconnect`. pool.entries STILL holds the
terminal entry; onClosed has NOT fired yet.
2. Concurrent `pool.acquire` for the same id hits W125's else-if
(existing && existing.isTerminated()) → entries.delete(id).
Budget slot leaked: never released here, and the pending
onClosed will see `entries.get(id) === undefined` and skip
its budget.release call.
3. Spawn path's `tryReserve(serverName)` consumes ANOTHER slot.
Net: one slot permanently leaked per occurrence. After
enough silent-drop + reacquire cycles, the workspace budget
fills with phantom reservations and legitimate acquires
refuse with `BudgetExhaustedError`.
W125-followup B — onClosed identity corruption (Critical,
transport-pool.ts:825):
Same race window, sibling failure mode:
1. forceShutdown sync done; await sweepAndDisconnect pending.
2. Concurrent acquire's W125 else-if evicts old entry; spawn
path inserts NEW entry under same id via `entries.set`.
3. forceShutdown's await finishes; calls `this.onClosed(id)`.
4. Pre-fix the callback did `entries.get(closedId)` and
unconditionally `entries.delete(closedId)` — silently evicting
the NEW entry from the pool, AND (worse) running its
`hasNameSibling` + `budget.release` against the new entry's
serverName, prematurely freeing the slot the new entry just
reserved.
Symptom: a brand-new pool entry could be evicted milliseconds
after construction, while the budget slot it nominally held was
released to other servers — the new session would later see
"Cannot attach in state failed" or attach succeed against a
zombie no longer in the map. Highly racy; only reproduces under
specific await ordering, but trivially possible in production.
Combined fix:
- New private `evictEntry(id, entry)` helper in
`mcp-transport-pool.ts:781-820`. Single source of truth for
eviction: identity-checks `entries.get(id) === entry` before
deleting, then mirrors the existing `hasNameSibling` budget-
release semantics.
- Pool-managed onClosed callback in `spawnEntry` now captures
its `entry` via a mutable holder (`entryRef`) and routes
through `evictEntry`. Stale onClosed calls (the fired-after-
respawn variant) no-op via the identity check.
- Both W125 self-heal branches (catch at line ~272 and else-if
at line ~287) now route through `evictEntry`. Pre-fix the
bare delete leaked budget; post-fix the slot is correctly
released and the spawn path's tryReserve consumes a fresh
one (net 1 slot, not 0 (leak) or 2 (double-reserve)).
W125-followup C — partial accept (Suggestion, transport-pool.ts:275):
- debugLogger.warn added at both W125 self-heal branches with
distinct messages ("evicted terminal entry mid-attach race"
vs "evicted stale terminal entry before fast-path attach").
The pool self-heal IS an operator-actionable degraded-state
event (silent transport drop or in-flight forceShutdown race)
— distinct from W128/W129 which were declined as pure noise.
- Else-if path regression test
(`mcp-transport-pool.test.ts:476-543`): reproduces the exact
forceShutdown-yield + concurrent-acquire race, asserts:
• Pre-acquire pool.entries STILL holds the old terminal entry
(confirms the race window IS reachable, addressing the
reviewer's point about reachability).
• Post-acquire conn2.entryIndex === 1 (fresh spawn).
• Post-acquire entries.get(targetId) !== oldEntry (eviction
worked).
• budget.getReservedSlots() === ['srv'] before + after acquire
+ after timer drain (confirms NO leak — pre-R22 the slot
leaked when onClosed's late fire saw entries.get returning
undefined).
• After `vi.runAllTimersAsync()` the new entry survives
intact (confirms onClosed identity check made the late
eviction a no-op — pre-R22 the new entry would have been
silently evicted here).
- Catch-block test deferred: deterministic reproduction requires
spying on `entry.attach` to flip state mid-call which is
infrastructure-heavy for a microsecond race that the same
`evictEntry` helper covers via the else-if path. Filed as
F2 follow-up.
Test sweep: 239 F2 + cli + 69 SDK tests pass; ESLint clean;
typecheck clean (core + cli).
…ll fixes
Round 23 closes 8 unique threads from wenshao's 13:07Z + 13:24Z
batches (the 13:09Z batch was an exact duplicate; resolve-as-
duplicate). One Critical, one real ordering bug, four cleanups,
one investigation outcome. Two declined.
T1 Critical (mcp-client.ts:179 — discover() filter regression):
Pre-PR `discover()` called `discoverTools(cliConfig)` with no
opts → `applyConfigFilters` defaulted to `true` → trust set from
`mcpServerConfig.trust`, `includeTools`/`excludeTools` filtering
applied. The F2 refactor routed `discover()` through
`discoverAndReturn` which hardcoded `{ applyConfigFilters: false }`
(correct for the pool path where `SessionMcpView.applyTools` is
the authoritative filter, but WRONG for legacy non-pool callers).
Net: every operator who configured `trust: true` saw unexpected
permission prompts (tools' trust silently became `undefined`),
and `excludeTools` was ignored.
Fix: `discoverAndReturn` now accepts an `opts.applyConfigFilters`
parameter that defaults to `true`. Legacy `discover()` doesn't
pass it → filters applied. Pool callers (3 sites: `pool.spawnEntry`,
`pool.createUnpooledConnection` snapshot path, `PoolEntry.doRestart`)
explicitly pass `{ applyConfigFilters: false }` because per-session
views handle filtering.
Test: existing pool-snapshot test renamed +
`discoverAndReturn (default) applies config filters and trust —
legacy non-pool callers (R23 T1)` regression test added.
Pre-fix: snapshot returns both tools, `trust === undefined`. Post-
fix: `excludeTools: ['filtered_out']` excludes second tool, first
carries `trust: true`.
T15 Suggestion (mcp-pool-entry.ts:354 — sweep ordering undid R21
updateGlobalStatus):
R21's W122 followup added `void sweepAndDisconnect('silent_drop')`
+ sync `updateGlobalStatus()`. But `sweepAndDisconnect`'s tail
awaits `client.disconnect()` which unconditionally writes
`updateMCPServerStatus(name, DISCONNECTED)` at `mcp-client.ts:250`,
OVERWRITING the aggregate the sync `updateGlobalStatus()` just
computed. For multi-fingerprint servers with one alive sibling,
global map flapped CONNECTED → DISCONNECTED on every silent drop.
Fix: chain a SECOND `updateGlobalStatus()` inside the sweep's
`.then()` so it lands AFTER `client.disconnect()`'s stale write.
Keep the synchronous call (best-effort coverage for any reader
racing between now and sweep settle). `aggregateStatusByName` is
idempotent + reads only `localStatus` of remaining entries; this
entry is removed from `pool.entries` by `onClosed` before the
chained call runs the second time.
T2 Suggestion (mcp-pool-entry.ts:1112 — PooledConnectionImpl.on
double-attach):
Local `Set<>` deduped at the public-API level but
`entry.internalOn` is a thin wrapper over `EventEmitter.on` which
does NOT dedup. Calling `on(cb)` twice with the same listener
added 2 registrations on the entry's emitter while appearing as
1 entry in `this.listeners`. On `release()`, `internalOff(cb)`
was called once → 1 registration leaked, fired once per future
event for the entry's lifetime. Fix: detect the duplicate
pre-attach and short-circuit.
T5 Suggestion (mcp-client.ts:802 — isEnabled side-effect call):
Pre-fix `isEnabled(funcDecl, ...)` was invoked solely to trigger
its embedded warn log when `funcDecl.name` was missing (return
value ignored). Refactored to inline the warning directly,
letting `isEnabled` stay a pure predicate. Same warning text;
no behavior change.
T6 Suggestion (mcp-pool-entry.ts:1086 — _view dead parameter):
`_view: SessionMcpView` accepted but never stored or referenced.
The underscore prefix signaled "kept for parity / future use"
but the deferred-need never materialized; per-subscriber filters
live on `SessionMcpView` itself, not the connection handle.
Removed entirely; updated the single call site.
T16 Suggestion (mcp-client-manager.ts:1678 — W115 JSDoc rewrite):
Pre-fix comment said "set BEFORE the race" — misleading because
the flag is actually set INSIDE the timeout callback (at the
instant the grace timer fires). Rewrote to describe the actual
mechanism + corrected stale line citation (was `~1539`, actual
consumer guard is at `~1572`).
T4 investigation (mcp-pool-entry.ts:65 — websocket classification):
Reviewer flagged that `defaultPoolEntryOptions` classified
websocket as LOCAL (3 attempts, fixed delay) while
`discoveryTimeoutFor` classified it as REMOTE (5s discovery
timeout, via `cfg.tcp` / `cfg.url` predicate). Aligned the
former by including `'websocket'` in the remote set.
NOTE: `maxReconnectAttempts` and `reconnectStrategy` are
currently unconsumed by any pool code path (no health monitor in
pool mode), so the classification fix is forward-looking for
when the health monitor lands. Documented this in-line.
Declined:
T3 (acpAgent.ts:173 — bootstrapSkipsMcpDiscovery dead const):
deliberate documentation-as-code from W132. The const-then-
`if (!flag)` pattern documents the contract that bootstrap-
level discovery is bypassed AND keeps a clearly-marked
re-enable point if a future code path needs to restore it.
Removing the flag would erase that signal.
T7 (pid-descendants.ts:69 — BFS pgrep O(B^D)):
real perf concern but not a correctness bug; depth on stdio
wrapper trees is bounded (npx → tool, uvx → tool, pnpm dlx
→ tool) so B^D stays small. A genuine fix (single `ps`
snapshot + in-process tree walk) is its own perf PR. Filed
as F2 follow-up.
Test sweep: 230 F2 + cli acpAgent + 69 SDK tests pass; ESLint
clean; typecheck clean (core + cli).
wenshao
left a comment
There was a problem hiding this comment.
No issues found. LGTM! ✅ — gpt-5.5 via Qwen Code /review
…self-heal
Round 24 closes 2 of 3 new threads from wenshao's 14:32Z batch.
T18 (3 untested doRestart failure paths) declined as F2 hardening
follow-up — pure test-coverage suggestion for stable code paths,
~100 LOC of mocking infrastructure not commensurate with regression
risk on currently-stable code.
T17 Critical (transport-pool.ts:381 — phantom budget release on
'already_held' + sibling eviction race):
Pre-fix the spawn-failure catches in `pool.acquire` (both unpooled
at line ~343 and pooled at line ~377) called
`budget.release(serverName)` whenever `!hasNameSibling(serverName)`
was true — regardless of whether THIS acquire actually reserved a
new slot. When `tryReserve` returned `'already_held'` (a same-
name sibling held the slot), and the sibling was concurrently
evicted between this acquire's `tryReserve` and its catch, the
catch released a slot this acquire never reserved.
Practical impact today is masked by `WorkspaceMcpBudget.release`'s
`Set.delete` idempotency (phantom release on a non-held key is a
no-op for the Set). But the contract was wrong: the catch's job
is to roll back THIS acquire's reservation, not to re-attempt
cleanup that already happened. If a future refactor makes
`release` non-idempotent (e.g. adds a counter that decrements
unconditionally), the latent bug would surface as budget drift.
Fix: hoist `reservationResult` (`'reserved' | 'already_held' |
undefined`) into outer scope so both catches can distinguish.
Catches now check `reservationResult === 'reserved'` before
releasing — `'already_held'` skips the release entirely
(sibling owns it; not ours to roll back).
Test: `mcp-transport-pool.test.ts` adds "does NOT phantom-release
when 'already_held' spawn fails (R24 T17)" — sets up budget
cap=2, acquires A successfully (`'reserved'`), kicks off acquire
B for same-name different-fingerprint (`'already_held'`) with
spawn failure, asserts `budget.release` was NOT called and
`getReservedSlots()` still shows A's slot intact.
T19 Suggestion (mcp-client-manager.ts:2374 — pooled readResource
skips health check):
Pre-fix the pooled fast-path
(`pooledConnections.get` → `pooled.client.readResource`)
delegated without checking `pooled.client.getStatus()`. There is
a narrow window between a silent transport drop (W120/W131 flips
entry to `'failed'` + emits `'failed'` event) and the manager-
side `onFailed` listener evicting the handle from
`pooledConnections`. A `readResource` landing in that window hit
a dead transport and surfaced an opaque MCP `"Transport is
closed"` error instead of a clear server-unavailable signal.
Fix: pre-call check `pooled.client.getStatus() !==
MCPServerStatus.CONNECTED`. On dead handle, evict inline (so the
next call re-acquires through the legacy spawn path below) and
throw `"MCP server '<name>' pool entry disconnected; retry after
discovery."`. Mirrors the W122 self-heal philosophy: pool already
owns the eviction, this closes the observability gap on the read
path.
Test: `mcp-client-manager.test.ts` adds "readResource self-heals
when pooled handle is dead (R24 T19)" — fakes pool entry whose
`client.getStatus()` flips from CONNECTED → DISCONNECTED mid-
test; asserts the second readResource throws clear error AND
`pooledConnections.has('srv') === false`. The pre-existing
pooled-fast-path test was extended to provide `getStatus: () =>
CONNECTED` on its mock client (the new code path requires it).
Test sweep: 287 F2 + cli + 69 SDK tests pass; ESLint clean;
typecheck clean (core + cli).
| if ( | ||
| this.opts.budget !== undefined && | ||
| reservationResult === 'reserved' && | ||
| !this.hasNameSibling(serverName) |
There was a problem hiding this comment.
[Suggestion] Pooled catch: reservationResult === 'reserved' guard can cause permanent budget-slot leak in a narrow race.
Consider this interleaving:
- Acquire A (fingerprint X):
tryReserve→'reserved', spawn succeeds → entry inthis.entries - Acquire B (same name, fingerprint Y):
tryReserve→'already_held', spawns its ownspawnInFlight - A's entry fails →
evictEntry→hasNameSiblingsees B inspawnInFlight→ true → slot NOT released (correct — defers to B) - B's spawn fails →
.finally()removes B fromspawnInFlight→.catch()runs →reservationResult === 'reserved'is false ('already_held') → slot NOT released
Net: reservedSlots holds serverName forever. Neither caller releases it. Under enforce mode with a capped budget, each occurrence permanently consumes one slot until daemon restart.
The existing test ("does NOT phantom-release when 'already_held' spawn fails") covers the case where A's entry survives — in that case hasNameSibling returns true regardless, so both old and new code correctly skip release. But the eviction-during-spawn interleaving above is untested.
For the pooled catch specifically, !hasNameSibling alone is the correct condition (matching evictEntry at line 868). When reservationResult === 'already_held' AND !hasNameSibling, it means A's eviction deferred the release to B because B was in spawnInFlight, so B must release on failure:
| !this.hasNameSibling(serverName) | |
| if ( | |
| this.opts.budget !== undefined && | |
| !this.hasNameSibling(serverName) | |
| ) { |
The unpooled catch (line 362) can keep the reservationResult guard — the unpooled path has no spawnInFlight mechanism, so !hasNameSibling alone would be a true phantom release there.
— qwen-latest-series-invite-beta-v36 via Qwen Code /review
| // Mirrors the W122 self-heal philosophy: the pool already | ||
| // owns the eviction, we just close the observability gap on | ||
| // the read path. | ||
| if (pooled.client.getStatus() !== MCPServerStatus.CONNECTED) { |
There was a problem hiding this comment.
[Suggestion] Health-check eviction deletes from pooledConnections without calling pooled.release(), leaving a dangling session ref in the pool entry's refs set.
Every other eviction site in this file calls release() before map deletion:
disconnectServer(line 1769):pooled.release(); this.pooledConnections.delete(serverName)releaseAllPooledConnections(line 1622):conn.release()per entrystopTimedOutguard (line 1574):conn.release()
The existing onFailed listener (line 1556-1557) also skips release(), but relies on releaseAllPooledConnections / stop as a backstop during shutdown or next discovery pass. The health-check eviction has no such backstop — it's a new synchronous trigger path.
Each leak leaves refs.size > 0, so the dead entry's drain timer never fires and the stdio subprocess persists until daemon shutdown. In a long-running daemon with a flaky MCP server, this accumulates zombie processes.
| if (pooled.client.getStatus() !== MCPServerStatus.CONNECTED) { | |
| if (pooled.client.getStatus() !== MCPServerStatus.CONNECTED) { | |
| try { pooled.release(); } catch { /* best-effort — entry may be terminal */ } | |
| this.pooledConnections.delete(serverName); | |
| throw new Error( | |
| `MCP server '${serverName}' pool entry disconnected; retry after discovery.`, | |
| ); | |
| } |
And add to the test:
expect(pooled.release).toHaveBeenCalledTimes(1);— qwen-latest-series-invite-beta-v36 via Qwen Code /review
Local verification report — F2 shared MCP transport poolVerified PR HEAD Summary
Test counts
Real-subprocess functional tests
Test 5 targets the W3 / W31 / W32 orphan-grandchild bug class. To make it a true test of Two investigated non-issues
Notes for merge
Build / typecheck / lint / unit + full regression / real-subprocess functional tests are all green — no engineering blockers to merge. |
…4411) * refactor(core): F2 PR A R9 — McpClientManager options-object ctor R9 (filed as F2 follow-up from #4336 review): 7 positional ctor args collapse to (config, toolRegistry, options?: McpClientManagerOptions). The trailing 5 (eventEmitter, sendSdkMcpMessage, healthConfig, budgetConfig, pool) become named fields on `McpClientManagerOptions`. Test factory `mkManager(overrides?)` introduced at the top of `mcp-client-manager.test.ts` so each of the prior 80 inline constructions becomes a single line naming only the field(s) the test overrides; the 4 `undefined` sentinels each test threaded through to reach the trailing `pool` arg are gone. Net: 113 LOC removed (test) + 35 LOC added (src exposes interface + mkManager factory + tool-registry call site update). Behavior unchanged — same field assignments, same downgrade-enforce-without- budget breadcrumb, same budget event wiring. Filed bucket: F2 perf / cleanup PR A (R9 + W11 + W12 + R10/R23 T7), see issue #4175 item 7 "F2 post-merge cleanup PRs". This is the first of the 4 fixes in PR A; W11/W12/R10 follow as separate commits. Test sweep: 84/84 mcp-client-manager.test.ts pass; typecheck clean. * refactor(core): F2 PR A W11 — extract attachPooledSession + rollbackReservationOnSpawnFailure W11 (filed as F2 follow-up from #4336 review): two private helpers on `McpTransportPool` to eliminate inline duplication in `acquire()`: - `attachPooledSession(entry, id, serverName, cfg, sessionId, toolReg, promptReg)`: builds `SessionMcpView` + `entry.attach` with the standard pool release callback. Used by both the fast-path attach (existing entry) and the post-spawn attach (after `await inFlight`). NOT used by `createUnpooledConnection` — its release callback runs `entry.forceShutdown('manual')` + `indexDetach` directly (no pool refcount accounting since unpooled entries are per-session). - `rollbackReservationOnSpawnFailure(reservationResult, serverName)`: R24 T17 contract — only release the budget slot if THIS acquire actually reserved a new slot (`'reserved'`); `'already_held'` skips because the sibling owns it. Used by both the unpooled catch and the pooled spawn-in-flight catch. Race-window invariants (W10 / W77 / W90 / W111 / W125 / R24 T17) stay at the call sites because they describe the SURROUNDING ordering, not the helpers themselves. Helpers are documented to defer those decisions back to callers. Behavior unchanged. Filed bucket: F2 perf cleanup PR A (R9 done / W11 this commit / W12 + R10 to follow). Test sweep: 28/28 mcp-transport-pool.test.ts pass; typecheck clean. * refactor(core): F2 PR A W12 — SessionMcpView precompute filter Sets W12 (filed as F2 follow-up from #4336 review): `applyTools` / `applyPrompts` precompute `excludeSet` + `includeSet` once per pass instead of scanning `cfg.includeTools` / `cfg.excludeTools` arrays inside every per-tool iteration. Pre-fix the per-tool predicate (`passesSessionFilter`) walked both arrays for every snapshot entry → O(M × N) per `applyTools` call. With M tools × N filter entries, typical M=5-20 / N=2-5 case finishes in microseconds either way; the win is data-structure correctness and code clarity, not perceived perf. `passesSessionFilter` / `passesSessionPromptFilter` (the array- based predicates) stay exported and unchanged for unit tests + any caller wanting to test a single name without paying Set construction. The bulk path uses two new private helpers `compileNameFilter` + `compiledFilterAccepts` whose Sets live on the `applyTools` / `applyPrompts` stack frame. Same semantics: `excludeTools` is direct-equality match (no parens strip — pre-F2 behavior preserved); `includeTools` strips the first `(...)` suffix so `toolName(args)` matches `toolName`. Filed bucket: F2 perf cleanup PR A (R9 + W11 done / W12 this commit / R10 to follow). Test sweep: 13/13 session-mcp-view.test.ts pass; typecheck clean. * perf(core): F2 PR A R10 / R23 T7 — pid-descendants ps snapshot + pgrep fallback R10 / R23 T7 (filed as F2 follow-up from #4336 review): the Linux / macOS pid-descendant enumeration moves from per-pid `pgrep -P <pid>` BFS (one subprocess fork per node visited) to a single `ps -A -o pid=,ppid=` snapshot followed by an in-memory tree walk over `Map<ppid, pid[]>`. Windows analog: single `Get-CimInstance Win32_Process | ConvertTo-Csv` snapshot of all `(ProcessId, ParentProcessId)` rows replaces per-pid `Get-CimInstance -Filter "ParentProcessId=$p"` BFS. Two motivations: 1. **Fork count**: typical `npx → tool` / `uvx → tool` wrapper trees are 2-3 levels deep with B=1-3 children per node → pre-fix BFS forked ~5-10 subprocesses per pool-shutdown call. Post-fix: exactly 1 fork regardless of tree depth. 2. **Snapshot consistency**: pre-fix BFS walked the table level by level; a child that forked between two adjacent BFS levels could be missed (we'd see the child but query its descendants AFTER the new fork). The snapshot path captures the table at one instant; new descendants forked after the snapshot are tolerated by the existing ESRCH-tolerant SIGTERM loop. Caveats: - `ps -A -o pid=,ppid=` is POSIX standard (macOS / Linux / *BSD), but BusyBox `ps` <v1.28 (2018) doesn't support `-o`. Distroless containers may not have `ps` at all. To preserve behavior on those edge platforms, the legacy per-pid `pgrep` BFS is retained as a fallback (`listDescendantPidsUnixPgrepFallback`). Same retention on Windows for the per-pid filter path. - Snapshot path uses `maxBuffer: 8MB` to cover ~250k-process pathological hosts. Default 1MB would clip at ~30k processes. - `MAX_DESCENDANTS = 256` / `MAX_DEPTH = 8` caps preserved on both snapshot + fallback paths. - Snapshot scans the entire host process table (not just the target subtree). On the typical 200-500 process developer machine this parses in <10ms; the win over BFS is real but not order-of-magnitude — ~2x improvement, not 100x. PR A's motivation framing is "fork hygiene + consistency", not raw perf. Empty-result detection: snapshot path tracks `parsedRows`. If the ps/CIM tool runs successfully but produces 0 parseable rows (BusyBox without `-o` echoing usage, AppLocker truncating CIM output, etc.), we throw — the outer catch falls back to the per-pid path. A genuine "root has no children" case parses many rows and just returns empty from the walk. So the "no-children-found" semantics are preserved across both paths. Test gate update: pre-fix `integration: spawn-and-enumerate` test skipped on `CI === '1'` because pgrep wasn't available on minimal CI runners. Post-fix `ps -A` is universally available on non-distroless Linux/macOS — only the Windows skip remains. 6/6 pid-descendants tests pass including the now-active integration spawn test. Design doc (`docs/design/f2-mcp-transport-pool.md` §6.4 + the F2 follow-up table at lines 82-85) updated to reflect the snapshot + fallback shape, and to mark W11 / W12 / R9 / R10 as ✅ Done in PR A with the per-fix commit refs. This commit completes F2 cleanup PR A. Filed bucket order: R9 (commit 0cb1eaa) → W11 (commit 2d546ef) → W12 (commit a4a855a) → R10 (this commit). Issue #4175 item 7 "F2 post- merge cleanup PRs": PR A done; PR B (W93 + W133-a + W134) and PR C (W133-c SDK breaking) to follow as separate clusters. Test sweep: 287/287 F2 + cli pass; ESLint clean; typecheck clean (core + cli). Integration test on macOS local runs the new snapshot path successfully. * refactor(core): F2 PR A R2 — wenshao followup (visited set + dedup predicate) Two Suggestions from wenshao's first PR #4411 review pass (07:15Z), both small and worth folding before merge: PR-A-R2 #1 (pid-descendants.ts:309 — walkDescendants visited set): `walkDescendants`'s BFS lacked a `visited` set. If the snapshot captures a PID-reuse cycle — rare but possible on busy hosts with rapid pid churn between `ps -A`'s start and parse, where Linux wraparound can show a freed pid in a different parent's children list creating an A→B / B→A cycle — pre-fix BFS would revisit nodes and fill the MAX_DESCENDANTS=256 quota with duplicate entries, starving legitimate descendants. Pre-PR-A the per-pid `pgrep` BFS had the same theoretical issue but was less exposed (each `pgrep -P pid` call returns only DIRECT children; snapshot captures the whole tree at once, making cycles instantly visible). Fix: 3-LOC `Set<number>` add. `root` seeded into `visited` so a malformed snapshot listing root as a descendant of its own child doesn't re-enqueue root either. PR-A-R2 #2 (session-mcp-view.ts:117 — predicate dedup): After W12, the exported `passesSessionFilter` / `passesSessionPromptFilter` still called `passesNameFilter` (the pre-W12 array-based implementation), while `applyTools` / `applyPrompts` used `compiledFilterAccepts(compileNameFilter(...))`. Two parallel implementations of the same predicate — future change to one without the other would silently diverge: - the exported function's tests (passesSessionFilter unit tests) would still pass - the production filter path in applyTools/applyPrompts would behave differently Reviewer also noted `passesSessionPromptFilter` had zero callers in production code or tests after W12 — `applyPrompts` no longer references it. Kept the export rather than deleting it (matches the `passesSessionFilter` shape for symmetry + the F3 audit-path comment block earmarks both as the replay predicates), but routed both through `compiledFilterAccepts(compileNameFilter(...))` so there is a single source of truth. Set construction is per-call for these exports (negligible for unit-test / one-off probes); the bulk paths in `applyTools` / `applyPrompts` still construct ONE filter per pass via the original W12 code path. `passesNameFilter` (the standalone array-based helper) deleted — its only callers were the two exports, which now use the compiled path. Public-API surface unchanged: the two exported functions keep their signatures and semantics. Test sweep: 19/19 pid-descendants + session-mcp-view tests pass; typecheck + ESLint clean. Continues commit chain: f059170 (R9) → 20d2f1b (W11) → 6cf18f6 (W12) → 2a41c6f (R10) → this (R2 followups). * fix(core): F2 PR A R3 T3 — Windows CSV delimiter locale fix `ConvertTo-Csv -NoTypeInformation` honors the system locale's list separator on PowerShell 5.1. On German / French / Dutch / Italian / ... locales the separator is `;` not `,`, so the regex `^"(\d+)","(\d+)"$` in `snapshotProcessTreeWin` never matched → `parsedRows === 0` → snapshot threw → fell back to the per-pid CIM filter path with ~0.5-1s extra PowerShell startup latency per descendant on every pool shutdown. Fix: 1-LOC `-Delimiter ","` on `ConvertTo-Csv`. Forces comma regardless of locale or PowerShell version. PowerShell 7+ defaults to comma already; 5.1 (the Windows-bundled version most users have without explicit upgrade) honored locale. The explicit delimiter makes both consistent. Skipped wenshao's companion Suggestion T4 (test coverage for walkDescendants MAX_DESCENDANTS / MAX_DEPTH caps) as F2 hardening follow-up — the caps are simple 2-line guards exercisable by inspection; ~50 LOC of mock infrastructure isn't commensurate with the regression risk on currently-stable defensive code, and (per the issue #4175 follow-up bucket) we keep dedicated test-coverage work out of perf-cleanup PRs. Continues commit chain: f059170 (R9) → 20d2f1b (W11) → 6cf18f6 (W12) → 2a41c6f (R10) → ced5d62 (R2) → this (R3 T3). Test sweep: 6/6 pid-descendants tests pass; typecheck + ESLint clean.
…4460) * fix(core): F2 cleanup PR B — self-heal observability (W133-a + W134) W93 declined as already satisfied by W1 fix in #4336 commit 6 (spawnEntry's catch already calls forceShutdown which runs the full cleanup table — listener removal, timer clear, subscriber detach, sweep+disconnect, onClosed eviction). Source-verified non-repro. W133-a: McpClient.onerror now captures the error in a private `lastTransportError` field (reset at each connect()); the W120 silent-drop block at mcp-pool-entry.ts:346 reads it via the new `getLastTransportError()` getter and appends `: <error.message>` to the lastError string on the emitted 'failed' event. Preserves the literal "silent transport drop" prefix invariant for log-grep backward compat — pre-fix marker stays a substring. W134: sweepAndDisconnect now returns SweepResult instead of void — { pidSweepError?, disconnectError?, descendantsFound?, descendantsSignaled? }. The silent-drop fire-and-forget caller chains to inspect the result and emits a structured warn log when either pid-sweep threw OR sigtermPids partially signaled (signaled < found) — surfaces orphan-process pressure without inflating PR scope (no new SSE event or SDK reducer state; deferred to W134-followup if maintainers want metrics). forceShutdown / doRestart sweep callers ignore the return value (JS implicit-void at await sites preserves behavior). 4 new tests in mcp-transport-pool.test.ts covering W133-a happy path + fallback (no prior onerror) + W134 pidSweepError + W134 partial-signal failure modes. Module-mocks pid-descendants.js for controllable sweep behavior, and debugLogger.js to observe warn calls (production logger is session-gated and a no-op in tests). Singleton-stub debugLogger mock so production module-load `createDebugLogger('McpPool:Entry')` and the test's retrieval get the same vi.fn instances. Verification: - tsc clean: packages/core, packages/cli (server.ts pre-existing errors unchanged) - F2 transport-pool: 32/32 pass (28 pre-existing + 4 new) - mcp-client: 46/46 pass - eslint --max-warnings 0 clean on 3 touched files Part of #4175 #4336 follow-up bucket. * fix(core): #4460 round 1 fold-in — 4 copilot doc/comment threads adopted T1 [copilot mcp-pool-entry.ts:116 — stale line ref in SweepResult JSDoc]: replaced `mcp-pool-entry.ts:383` with stable method-anchor reference to the W120 silent-drop block inside `statusChangeListener`. Line numbers drift on every edit; method names don't. T2 [copilot mcp-pool-entry.ts:453 — `?? 0` ambiguous in warn payload]: silent-drop warn log now prints `descendantsFound=unknown` and `descendantsSignaled=unknown` when the values are undefined (only reachable in the pidSweepError branch — sweep threw before assignment). Operators triaging the warn can now distinguish "sweep succeeded but found 0 descendants" from "sweep itself threw, count is genuinely unmeasured". Locked in via a new assertion in the W134 pidSweepError test. T3 [copilot mcp-client.ts:116 — brittle line refs in lastTransportError JSDoc]: replaced `mcp-pool-entry.ts:346` and `mcp-client.ts:130` with stable method/block names (the `statusChangeListener` silent- drop block; the `client.onerror` arrow inside connect()). Same fix applied to the parallel comment in mcp-transport-pool.test.ts:730 for consistency. T4 [copilot mcp-transport-pool.test.ts:797 — singleton-stub mock comment contradictory]: rewrote the comment to unambiguously describe what the mock DOES (factory body runs once; inner arrow returns the same object on every call) instead of the prior hypothetical phrasing ("Returning a fresh object would have...") which read as a description of current behavior at first glance. All 4 are doc/comment fixes — zero behavior change apart from the T2 string format ('unknown' instead of '0'). Verified: - 32/32 mcp-transport-pool.test.ts pass - tsc clean on packages/core - eslint --max-warnings 0 clean on 3 touched files * fix(core): #4460 round 2 fold-in — remove dead SweepResult.disconnectError field T5 [wenshao mcp-pool-entry.ts:134 — `disconnectError` is dead data]: glm-5.1 review caught that the field was populated when `client.disconnect()` threw (line 844) but no consumer ever read it — the silent-drop `.then()` handler gated only on `pidSweepError` and partial-signal; `forceShutdown` and `doRestart` ignore the return; no test asserted on it. Removed the field from `SweepResult` and the assignment in the disconnect catch. The pre-existing `debugLogger.error(`client.disconnect failed for ...`)` inside `sweepAndDisconnect` already gives operators the signal — adding it to the outer silent-drop warn would have been duplicate noise. If a future consumer needs to gate logic on disconnect failures, re-add the field + reader at that point. Verification: 32/32 mcp-transport-pool.test.ts pass; tsc + eslint clean on the touched file.
…4411) * refactor(core): F2 PR A R9 — McpClientManager options-object ctor R9 (filed as F2 follow-up from #4336 review): 7 positional ctor args collapse to (config, toolRegistry, options?: McpClientManagerOptions). The trailing 5 (eventEmitter, sendSdkMcpMessage, healthConfig, budgetConfig, pool) become named fields on `McpClientManagerOptions`. Test factory `mkManager(overrides?)` introduced at the top of `mcp-client-manager.test.ts` so each of the prior 80 inline constructions becomes a single line naming only the field(s) the test overrides; the 4 `undefined` sentinels each test threaded through to reach the trailing `pool` arg are gone. Net: 113 LOC removed (test) + 35 LOC added (src exposes interface + mkManager factory + tool-registry call site update). Behavior unchanged — same field assignments, same downgrade-enforce-without- budget breadcrumb, same budget event wiring. Filed bucket: F2 perf / cleanup PR A (R9 + W11 + W12 + R10/R23 T7), see issue #4175 item 7 "F2 post-merge cleanup PRs". This is the first of the 4 fixes in PR A; W11/W12/R10 follow as separate commits. Test sweep: 84/84 mcp-client-manager.test.ts pass; typecheck clean. * refactor(core): F2 PR A W11 — extract attachPooledSession + rollbackReservationOnSpawnFailure W11 (filed as F2 follow-up from #4336 review): two private helpers on `McpTransportPool` to eliminate inline duplication in `acquire()`: - `attachPooledSession(entry, id, serverName, cfg, sessionId, toolReg, promptReg)`: builds `SessionMcpView` + `entry.attach` with the standard pool release callback. Used by both the fast-path attach (existing entry) and the post-spawn attach (after `await inFlight`). NOT used by `createUnpooledConnection` — its release callback runs `entry.forceShutdown('manual')` + `indexDetach` directly (no pool refcount accounting since unpooled entries are per-session). - `rollbackReservationOnSpawnFailure(reservationResult, serverName)`: R24 T17 contract — only release the budget slot if THIS acquire actually reserved a new slot (`'reserved'`); `'already_held'` skips because the sibling owns it. Used by both the unpooled catch and the pooled spawn-in-flight catch. Race-window invariants (W10 / W77 / W90 / W111 / W125 / R24 T17) stay at the call sites because they describe the SURROUNDING ordering, not the helpers themselves. Helpers are documented to defer those decisions back to callers. Behavior unchanged. Filed bucket: F2 perf cleanup PR A (R9 done / W11 this commit / W12 + R10 to follow). Test sweep: 28/28 mcp-transport-pool.test.ts pass; typecheck clean. * refactor(core): F2 PR A W12 — SessionMcpView precompute filter Sets W12 (filed as F2 follow-up from #4336 review): `applyTools` / `applyPrompts` precompute `excludeSet` + `includeSet` once per pass instead of scanning `cfg.includeTools` / `cfg.excludeTools` arrays inside every per-tool iteration. Pre-fix the per-tool predicate (`passesSessionFilter`) walked both arrays for every snapshot entry → O(M × N) per `applyTools` call. With M tools × N filter entries, typical M=5-20 / N=2-5 case finishes in microseconds either way; the win is data-structure correctness and code clarity, not perceived perf. `passesSessionFilter` / `passesSessionPromptFilter` (the array- based predicates) stay exported and unchanged for unit tests + any caller wanting to test a single name without paying Set construction. The bulk path uses two new private helpers `compileNameFilter` + `compiledFilterAccepts` whose Sets live on the `applyTools` / `applyPrompts` stack frame. Same semantics: `excludeTools` is direct-equality match (no parens strip — pre-F2 behavior preserved); `includeTools` strips the first `(...)` suffix so `toolName(args)` matches `toolName`. Filed bucket: F2 perf cleanup PR A (R9 + W11 done / W12 this commit / R10 to follow). Test sweep: 13/13 session-mcp-view.test.ts pass; typecheck clean. * perf(core): F2 PR A R10 / R23 T7 — pid-descendants ps snapshot + pgrep fallback R10 / R23 T7 (filed as F2 follow-up from #4336 review): the Linux / macOS pid-descendant enumeration moves from per-pid `pgrep -P <pid>` BFS (one subprocess fork per node visited) to a single `ps -A -o pid=,ppid=` snapshot followed by an in-memory tree walk over `Map<ppid, pid[]>`. Windows analog: single `Get-CimInstance Win32_Process | ConvertTo-Csv` snapshot of all `(ProcessId, ParentProcessId)` rows replaces per-pid `Get-CimInstance -Filter "ParentProcessId=$p"` BFS. Two motivations: 1. **Fork count**: typical `npx → tool` / `uvx → tool` wrapper trees are 2-3 levels deep with B=1-3 children per node → pre-fix BFS forked ~5-10 subprocesses per pool-shutdown call. Post-fix: exactly 1 fork regardless of tree depth. 2. **Snapshot consistency**: pre-fix BFS walked the table level by level; a child that forked between two adjacent BFS levels could be missed (we'd see the child but query its descendants AFTER the new fork). The snapshot path captures the table at one instant; new descendants forked after the snapshot are tolerated by the existing ESRCH-tolerant SIGTERM loop. Caveats: - `ps -A -o pid=,ppid=` is POSIX standard (macOS / Linux / *BSD), but BusyBox `ps` <v1.28 (2018) doesn't support `-o`. Distroless containers may not have `ps` at all. To preserve behavior on those edge platforms, the legacy per-pid `pgrep` BFS is retained as a fallback (`listDescendantPidsUnixPgrepFallback`). Same retention on Windows for the per-pid filter path. - Snapshot path uses `maxBuffer: 8MB` to cover ~250k-process pathological hosts. Default 1MB would clip at ~30k processes. - `MAX_DESCENDANTS = 256` / `MAX_DEPTH = 8` caps preserved on both snapshot + fallback paths. - Snapshot scans the entire host process table (not just the target subtree). On the typical 200-500 process developer machine this parses in <10ms; the win over BFS is real but not order-of-magnitude — ~2x improvement, not 100x. PR A's motivation framing is "fork hygiene + consistency", not raw perf. Empty-result detection: snapshot path tracks `parsedRows`. If the ps/CIM tool runs successfully but produces 0 parseable rows (BusyBox without `-o` echoing usage, AppLocker truncating CIM output, etc.), we throw — the outer catch falls back to the per-pid path. A genuine "root has no children" case parses many rows and just returns empty from the walk. So the "no-children-found" semantics are preserved across both paths. Test gate update: pre-fix `integration: spawn-and-enumerate` test skipped on `CI === '1'` because pgrep wasn't available on minimal CI runners. Post-fix `ps -A` is universally available on non-distroless Linux/macOS — only the Windows skip remains. 6/6 pid-descendants tests pass including the now-active integration spawn test. Design doc (`docs/design/f2-mcp-transport-pool.md` §6.4 + the F2 follow-up table at lines 82-85) updated to reflect the snapshot + fallback shape, and to mark W11 / W12 / R9 / R10 as ✅ Done in PR A with the per-fix commit refs. This commit completes F2 cleanup PR A. Filed bucket order: R9 (commit 0cb1eaa) → W11 (commit 2d546ef) → W12 (commit a4a855a) → R10 (this commit). Issue #4175 item 7 "F2 post- merge cleanup PRs": PR A done; PR B (W93 + W133-a + W134) and PR C (W133-c SDK breaking) to follow as separate clusters. Test sweep: 287/287 F2 + cli pass; ESLint clean; typecheck clean (core + cli). Integration test on macOS local runs the new snapshot path successfully. * refactor(core): F2 PR A R2 — wenshao followup (visited set + dedup predicate) Two Suggestions from wenshao's first PR #4411 review pass (07:15Z), both small and worth folding before merge: PR-A-R2 #1 (pid-descendants.ts:309 — walkDescendants visited set): `walkDescendants`'s BFS lacked a `visited` set. If the snapshot captures a PID-reuse cycle — rare but possible on busy hosts with rapid pid churn between `ps -A`'s start and parse, where Linux wraparound can show a freed pid in a different parent's children list creating an A→B / B→A cycle — pre-fix BFS would revisit nodes and fill the MAX_DESCENDANTS=256 quota with duplicate entries, starving legitimate descendants. Pre-PR-A the per-pid `pgrep` BFS had the same theoretical issue but was less exposed (each `pgrep -P pid` call returns only DIRECT children; snapshot captures the whole tree at once, making cycles instantly visible). Fix: 3-LOC `Set<number>` add. `root` seeded into `visited` so a malformed snapshot listing root as a descendant of its own child doesn't re-enqueue root either. PR-A-R2 #2 (session-mcp-view.ts:117 — predicate dedup): After W12, the exported `passesSessionFilter` / `passesSessionPromptFilter` still called `passesNameFilter` (the pre-W12 array-based implementation), while `applyTools` / `applyPrompts` used `compiledFilterAccepts(compileNameFilter(...))`. Two parallel implementations of the same predicate — future change to one without the other would silently diverge: - the exported function's tests (passesSessionFilter unit tests) would still pass - the production filter path in applyTools/applyPrompts would behave differently Reviewer also noted `passesSessionPromptFilter` had zero callers in production code or tests after W12 — `applyPrompts` no longer references it. Kept the export rather than deleting it (matches the `passesSessionFilter` shape for symmetry + the F3 audit-path comment block earmarks both as the replay predicates), but routed both through `compiledFilterAccepts(compileNameFilter(...))` so there is a single source of truth. Set construction is per-call for these exports (negligible for unit-test / one-off probes); the bulk paths in `applyTools` / `applyPrompts` still construct ONE filter per pass via the original W12 code path. `passesNameFilter` (the standalone array-based helper) deleted — its only callers were the two exports, which now use the compiled path. Public-API surface unchanged: the two exported functions keep their signatures and semantics. Test sweep: 19/19 pid-descendants + session-mcp-view tests pass; typecheck + ESLint clean. Continues commit chain: f059170 (R9) → 20d2f1b (W11) → 6cf18f6 (W12) → 2a41c6f (R10) → this (R2 followups). * fix(core): F2 PR A R3 T3 — Windows CSV delimiter locale fix `ConvertTo-Csv -NoTypeInformation` honors the system locale's list separator on PowerShell 5.1. On German / French / Dutch / Italian / ... locales the separator is `;` not `,`, so the regex `^"(\d+)","(\d+)"$` in `snapshotProcessTreeWin` never matched → `parsedRows === 0` → snapshot threw → fell back to the per-pid CIM filter path with ~0.5-1s extra PowerShell startup latency per descendant on every pool shutdown. Fix: 1-LOC `-Delimiter ","` on `ConvertTo-Csv`. Forces comma regardless of locale or PowerShell version. PowerShell 7+ defaults to comma already; 5.1 (the Windows-bundled version most users have without explicit upgrade) honored locale. The explicit delimiter makes both consistent. Skipped wenshao's companion Suggestion T4 (test coverage for walkDescendants MAX_DESCENDANTS / MAX_DEPTH caps) as F2 hardening follow-up — the caps are simple 2-line guards exercisable by inspection; ~50 LOC of mock infrastructure isn't commensurate with the regression risk on currently-stable defensive code, and (per the issue #4175 follow-up bucket) we keep dedicated test-coverage work out of perf-cleanup PRs. Continues commit chain: f059170 (R9) → 20d2f1b (W11) → 6cf18f6 (W12) → 2a41c6f (R10) → ced5d62 (R2) → this (R3 T3). Test sweep: 6/6 pid-descendants tests pass; typecheck + ESLint clean.
…4460) * fix(core): F2 cleanup PR B — self-heal observability (W133-a + W134) W93 declined as already satisfied by W1 fix in #4336 commit 6 (spawnEntry's catch already calls forceShutdown which runs the full cleanup table — listener removal, timer clear, subscriber detach, sweep+disconnect, onClosed eviction). Source-verified non-repro. W133-a: McpClient.onerror now captures the error in a private `lastTransportError` field (reset at each connect()); the W120 silent-drop block at mcp-pool-entry.ts:346 reads it via the new `getLastTransportError()` getter and appends `: <error.message>` to the lastError string on the emitted 'failed' event. Preserves the literal "silent transport drop" prefix invariant for log-grep backward compat — pre-fix marker stays a substring. W134: sweepAndDisconnect now returns SweepResult instead of void — { pidSweepError?, disconnectError?, descendantsFound?, descendantsSignaled? }. The silent-drop fire-and-forget caller chains to inspect the result and emits a structured warn log when either pid-sweep threw OR sigtermPids partially signaled (signaled < found) — surfaces orphan-process pressure without inflating PR scope (no new SSE event or SDK reducer state; deferred to W134-followup if maintainers want metrics). forceShutdown / doRestart sweep callers ignore the return value (JS implicit-void at await sites preserves behavior). 4 new tests in mcp-transport-pool.test.ts covering W133-a happy path + fallback (no prior onerror) + W134 pidSweepError + W134 partial-signal failure modes. Module-mocks pid-descendants.js for controllable sweep behavior, and debugLogger.js to observe warn calls (production logger is session-gated and a no-op in tests). Singleton-stub debugLogger mock so production module-load `createDebugLogger('McpPool:Entry')` and the test's retrieval get the same vi.fn instances. Verification: - tsc clean: packages/core, packages/cli (server.ts pre-existing errors unchanged) - F2 transport-pool: 32/32 pass (28 pre-existing + 4 new) - mcp-client: 46/46 pass - eslint --max-warnings 0 clean on 3 touched files Part of #4175 #4336 follow-up bucket. * fix(core): #4460 round 1 fold-in — 4 copilot doc/comment threads adopted T1 [copilot mcp-pool-entry.ts:116 — stale line ref in SweepResult JSDoc]: replaced `mcp-pool-entry.ts:383` with stable method-anchor reference to the W120 silent-drop block inside `statusChangeListener`. Line numbers drift on every edit; method names don't. T2 [copilot mcp-pool-entry.ts:453 — `?? 0` ambiguous in warn payload]: silent-drop warn log now prints `descendantsFound=unknown` and `descendantsSignaled=unknown` when the values are undefined (only reachable in the pidSweepError branch — sweep threw before assignment). Operators triaging the warn can now distinguish "sweep succeeded but found 0 descendants" from "sweep itself threw, count is genuinely unmeasured". Locked in via a new assertion in the W134 pidSweepError test. T3 [copilot mcp-client.ts:116 — brittle line refs in lastTransportError JSDoc]: replaced `mcp-pool-entry.ts:346` and `mcp-client.ts:130` with stable method/block names (the `statusChangeListener` silent- drop block; the `client.onerror` arrow inside connect()). Same fix applied to the parallel comment in mcp-transport-pool.test.ts:730 for consistency. T4 [copilot mcp-transport-pool.test.ts:797 — singleton-stub mock comment contradictory]: rewrote the comment to unambiguously describe what the mock DOES (factory body runs once; inner arrow returns the same object on every call) instead of the prior hypothetical phrasing ("Returning a fresh object would have...") which read as a description of current behavior at first glance. All 4 are doc/comment fixes — zero behavior change apart from the T2 string format ('unknown' instead of '0'). Verified: - 32/32 mcp-transport-pool.test.ts pass - tsc clean on packages/core - eslint --max-warnings 0 clean on 3 touched files * fix(core): #4460 round 2 fold-in — remove dead SweepResult.disconnectError field T5 [wenshao mcp-pool-entry.ts:134 — `disconnectError` is dead data]: glm-5.1 review caught that the field was populated when `client.disconnect()` threw (line 844) but no consumer ever read it — the silent-drop `.then()` handler gated only on `pidSweepError` and partial-signal; `forceShutdown` and `doRestart` ignore the return; no test asserted on it. Removed the field from `SweepResult` and the assignment in the disconnect catch. The pre-existing `debugLogger.error(`client.disconnect failed for ...`)` inside `sweepAndDisconnect` already gives operators the signal — adding it to the outer silent-drop warn would have been duplicate noise. If a future consumer needs to gate logic on disconnect failures, re-add the field + reader at that point. Verification: 32/32 mcp-transport-pool.test.ts pass; tsc + eslint clean on the touched file.
* perf(core): F2 cleanup PR A — R9/W11/W12/R10 (post-merge follow-ups) (#4411)
* refactor(core): F2 PR A R9 — McpClientManager options-object ctor
R9 (filed as F2 follow-up from #4336 review): 7 positional ctor args
collapse to (config, toolRegistry, options?: McpClientManagerOptions).
The trailing 5 (eventEmitter, sendSdkMcpMessage, healthConfig,
budgetConfig, pool) become named fields on `McpClientManagerOptions`.
Test factory `mkManager(overrides?)` introduced at the top of
`mcp-client-manager.test.ts` so each of the prior 80 inline
constructions becomes a single line naming only the field(s) the test
overrides; the 4 `undefined` sentinels each test threaded through to
reach the trailing `pool` arg are gone.
Net: 113 LOC removed (test) + 35 LOC added (src exposes interface +
mkManager factory + tool-registry call site update). Behavior
unchanged — same field assignments, same downgrade-enforce-without-
budget breadcrumb, same budget event wiring.
Filed bucket: F2 perf / cleanup PR A (R9 + W11 + W12 + R10/R23 T7),
see issue #4175 item 7 "F2 post-merge cleanup PRs". This is the first
of the 4 fixes in PR A; W11/W12/R10 follow as separate commits.
Test sweep: 84/84 mcp-client-manager.test.ts pass; typecheck clean.
* refactor(core): F2 PR A W11 — extract attachPooledSession + rollbackReservationOnSpawnFailure
W11 (filed as F2 follow-up from #4336 review): two private helpers
on `McpTransportPool` to eliminate inline duplication in `acquire()`:
- `attachPooledSession(entry, id, serverName, cfg, sessionId,
toolReg, promptReg)`: builds `SessionMcpView` + `entry.attach`
with the standard pool release callback. Used by both the
fast-path attach (existing entry) and the post-spawn attach
(after `await inFlight`). NOT used by `createUnpooledConnection`
— its release callback runs `entry.forceShutdown('manual')` +
`indexDetach` directly (no pool refcount accounting since
unpooled entries are per-session).
- `rollbackReservationOnSpawnFailure(reservationResult, serverName)`:
R24 T17 contract — only release the budget slot if THIS acquire
actually reserved a new slot (`'reserved'`); `'already_held'`
skips because the sibling owns it. Used by both the unpooled
catch and the pooled spawn-in-flight catch.
Race-window invariants (W10 / W77 / W90 / W111 / W125 / R24 T17)
stay at the call sites because they describe the SURROUNDING
ordering, not the helpers themselves. Helpers are documented to
defer those decisions back to callers.
Behavior unchanged. Filed bucket: F2 perf cleanup PR A (R9 done /
W11 this commit / W12 + R10 to follow).
Test sweep: 28/28 mcp-transport-pool.test.ts pass; typecheck clean.
* refactor(core): F2 PR A W12 — SessionMcpView precompute filter Sets
W12 (filed as F2 follow-up from #4336 review): `applyTools` /
`applyPrompts` precompute `excludeSet` + `includeSet` once per pass
instead of scanning `cfg.includeTools` / `cfg.excludeTools` arrays
inside every per-tool iteration.
Pre-fix the per-tool predicate (`passesSessionFilter`) walked both
arrays for every snapshot entry → O(M × N) per `applyTools` call.
With M tools × N filter entries, typical M=5-20 / N=2-5 case
finishes in microseconds either way; the win is data-structure
correctness and code clarity, not perceived perf.
`passesSessionFilter` / `passesSessionPromptFilter` (the array-
based predicates) stay exported and unchanged for unit tests + any
caller wanting to test a single name without paying Set construction.
The bulk path uses two new private helpers `compileNameFilter` +
`compiledFilterAccepts` whose Sets live on the `applyTools` /
`applyPrompts` stack frame.
Same semantics: `excludeTools` is direct-equality match (no parens
strip — pre-F2 behavior preserved); `includeTools` strips the first
`(...)` suffix so `toolName(args)` matches `toolName`.
Filed bucket: F2 perf cleanup PR A (R9 + W11 done / W12 this commit
/ R10 to follow).
Test sweep: 13/13 session-mcp-view.test.ts pass; typecheck clean.
* perf(core): F2 PR A R10 / R23 T7 — pid-descendants ps snapshot + pgrep fallback
R10 / R23 T7 (filed as F2 follow-up from #4336 review): the Linux
/ macOS pid-descendant enumeration moves from per-pid `pgrep -P
<pid>` BFS (one subprocess fork per node visited) to a single
`ps -A -o pid=,ppid=` snapshot followed by an in-memory tree walk
over `Map<ppid, pid[]>`. Windows analog: single `Get-CimInstance
Win32_Process | ConvertTo-Csv` snapshot of all `(ProcessId,
ParentProcessId)` rows replaces per-pid
`Get-CimInstance -Filter "ParentProcessId=$p"` BFS.
Two motivations:
1. **Fork count**: typical `npx → tool` / `uvx → tool` wrapper
trees are 2-3 levels deep with B=1-3 children per node →
pre-fix BFS forked ~5-10 subprocesses per pool-shutdown call.
Post-fix: exactly 1 fork regardless of tree depth.
2. **Snapshot consistency**: pre-fix BFS walked the table level
by level; a child that forked between two adjacent BFS levels
could be missed (we'd see the child but query its
descendants AFTER the new fork). The snapshot path captures
the table at one instant; new descendants forked after the
snapshot are tolerated by the existing ESRCH-tolerant
SIGTERM loop.
Caveats:
- `ps -A -o pid=,ppid=` is POSIX standard (macOS / Linux /
*BSD), but BusyBox `ps` <v1.28 (2018) doesn't support `-o`.
Distroless containers may not have `ps` at all. To preserve
behavior on those edge platforms, the legacy per-pid `pgrep`
BFS is retained as a fallback (`listDescendantPidsUnixPgrepFallback`).
Same retention on Windows for the per-pid filter path.
- Snapshot path uses `maxBuffer: 8MB` to cover ~250k-process
pathological hosts. Default 1MB would clip at ~30k processes.
- `MAX_DESCENDANTS = 256` / `MAX_DEPTH = 8` caps preserved on
both snapshot + fallback paths.
- Snapshot scans the entire host process table (not just the
target subtree). On the typical 200-500 process developer
machine this parses in <10ms; the win over BFS is real but
not order-of-magnitude — ~2x improvement, not 100x. PR A's
motivation framing is "fork hygiene + consistency", not raw
perf.
Empty-result detection: snapshot path tracks `parsedRows`. If the
ps/CIM tool runs successfully but produces 0 parseable rows
(BusyBox without `-o` echoing usage, AppLocker truncating CIM
output, etc.), we throw — the outer catch falls back to the
per-pid path. A genuine "root has no children" case parses many
rows and just returns empty from the walk. So the
"no-children-found" semantics are preserved across both paths.
Test gate update: pre-fix `integration: spawn-and-enumerate` test
skipped on `CI === '1'` because pgrep wasn't available on
minimal CI runners. Post-fix `ps -A` is universally available on
non-distroless Linux/macOS — only the Windows skip remains.
6/6 pid-descendants tests pass including the now-active
integration spawn test.
Design doc (`docs/design/f2-mcp-transport-pool.md` §6.4 + the F2
follow-up table at lines 82-85) updated to reflect the snapshot
+ fallback shape, and to mark W11 / W12 / R9 / R10 as ✅ Done in
PR A with the per-fix commit refs.
This commit completes F2 cleanup PR A. Filed bucket order:
R9 (commit 0cb1eaa27) → W11 (commit 2d546efca) → W12 (commit
a4a855ab3) → R10 (this commit). Issue #4175 item 7 "F2 post-
merge cleanup PRs": PR A done; PR B (W93 + W133-a + W134) and
PR C (W133-c SDK breaking) to follow as separate clusters.
Test sweep: 287/287 F2 + cli pass; ESLint clean; typecheck clean
(core + cli). Integration test on macOS local runs the new
snapshot path successfully.
* refactor(core): F2 PR A R2 — wenshao followup (visited set + dedup predicate)
Two Suggestions from wenshao's first PR #4411 review pass (07:15Z),
both small and worth folding before merge:
PR-A-R2 #1 (pid-descendants.ts:309 — walkDescendants visited set):
`walkDescendants`'s BFS lacked a `visited` set. If the snapshot
captures a PID-reuse cycle — rare but possible on busy hosts with
rapid pid churn between `ps -A`'s start and parse, where Linux
wraparound can show a freed pid in a different parent's children
list creating an A→B / B→A cycle — pre-fix BFS would revisit nodes
and fill the MAX_DESCENDANTS=256 quota with duplicate entries,
starving legitimate descendants. Pre-PR-A the per-pid `pgrep` BFS
had the same theoretical issue but was less exposed (each
`pgrep -P pid` call returns only DIRECT children; snapshot captures
the whole tree at once, making cycles instantly visible).
Fix: 3-LOC `Set<number>` add. `root` seeded into `visited` so a
malformed snapshot listing root as a descendant of its own child
doesn't re-enqueue root either.
PR-A-R2 #2 (session-mcp-view.ts:117 — predicate dedup):
After W12, the exported `passesSessionFilter` /
`passesSessionPromptFilter` still called `passesNameFilter` (the
pre-W12 array-based implementation), while `applyTools` /
`applyPrompts` used `compiledFilterAccepts(compileNameFilter(...))`.
Two parallel implementations of the same predicate — future change
to one without the other would silently diverge:
- the exported function's tests (passesSessionFilter unit tests)
would still pass
- the production filter path in applyTools/applyPrompts would
behave differently
Reviewer also noted `passesSessionPromptFilter` had zero callers
in production code or tests after W12 — `applyPrompts` no longer
references it. Kept the export rather than deleting it (matches
the `passesSessionFilter` shape for symmetry + the F3 audit-path
comment block earmarks both as the replay predicates), but routed
both through `compiledFilterAccepts(compileNameFilter(...))` so
there is a single source of truth. Set construction is per-call
for these exports (negligible for unit-test / one-off probes);
the bulk paths in `applyTools` / `applyPrompts` still construct
ONE filter per pass via the original W12 code path.
`passesNameFilter` (the standalone array-based helper) deleted —
its only callers were the two exports, which now use the compiled
path. Public-API surface unchanged: the two exported functions
keep their signatures and semantics.
Test sweep: 19/19 pid-descendants + session-mcp-view tests pass;
typecheck + ESLint clean.
Continues commit chain: f05917071 (R9) → 20d2f1b90 (W11) →
6cf18f641 (W12) → 2a41c6fae (R10) → this (R2 followups).
* fix(core): F2 PR A R3 T3 — Windows CSV delimiter locale fix
`ConvertTo-Csv -NoTypeInformation` honors the system locale's list
separator on PowerShell 5.1. On German / French / Dutch / Italian /
... locales the separator is `;` not `,`, so the regex
`^"(\d+)","(\d+)"$` in `snapshotProcessTreeWin` never matched →
`parsedRows === 0` → snapshot threw → fell back to the per-pid CIM
filter path with ~0.5-1s extra PowerShell startup latency per
descendant on every pool shutdown.
Fix: 1-LOC `-Delimiter ","` on `ConvertTo-Csv`. Forces comma
regardless of locale or PowerShell version. PowerShell 7+ defaults
to comma already; 5.1 (the Windows-bundled version most users have
without explicit upgrade) honored locale. The explicit delimiter
makes both consistent.
Skipped wenshao's companion Suggestion T4 (test coverage for
walkDescendants MAX_DESCENDANTS / MAX_DEPTH caps) as F2 hardening
follow-up — the caps are simple 2-line guards exercisable by
inspection; ~50 LOC of mock infrastructure isn't commensurate
with the regression risk on currently-stable defensive code,
and (per the issue #4175 follow-up bucket) we keep dedicated
test-coverage work out of perf-cleanup PRs.
Continues commit chain: f05917071 (R9) → 20d2f1b90 (W11) →
6cf18f641 (W12) → 2a41c6fae (R10) → ced5d62b0 (R2) → this (R3 T3).
Test sweep: 6/6 pid-descendants tests pass; typecheck + ESLint clean.
* refactor(acp-bridge): F1 test split — lift bridge.test.ts (6861 LOC) to acp-bridge (#4445)
* refactor(acp-bridge): rename httpAcpBridge.test.ts -> bridge.test.ts (git mv)
Pure file rename; zero content change. Follow-up commits will:
- extract FakeAgent + makeChannel + makeBridge into testUtils.ts
- split 4 daemon-host integration tests back to cli/daemonStatusProvider.test.ts
Part of #4175 F1 test split (deferred from #4334).
* refactor(acp-bridge): extract testUtils + split daemon-host tests to cli (#4175 F1)
Net mechanical extraction following commit 2aff1a4d1 (pure git mv of
httpAcpBridge.test.ts -> bridge.test.ts). After this commit
`@qwen-code/acp-bridge` owns the bulk of the lifted bridge test
suite, and cli keeps only the 4 daemon-host integration tests that
need to wire `createDaemonStatusProvider()`.
Changes:
1. New `packages/acp-bridge/src/internal/testUtils.ts` (~280 LOC):
FakeAgent, FakeAgentOpts, ChannelHandle, makeChannel, makeBridge
(no statusProvider default — acp-bridge tests exercise the
no-provider fallback path), WS_A/WS_B/SESS_A constants. Marked
@internal; lives under `internal/` matching the existing
`stderrLine.ts` package-private convention. Exposed via new
`./internal/testUtils` subpath in package.json exports.
2. `packages/acp-bridge/src/bridge.test.ts` shrinks from 6861 ->
~6400 LOC: fixtures replaced with named imports from
`./internal/testUtils.js`; cross-package import
`from './daemonStatusProvider.js'` removed (4 daemon-host tests
moved out); ACP SDK + bridgeErrors / workspacePaths / bridge /
channel / bridgeTypes imports split into multiple statements
reflecting actual post-F1 provenance.
3. New `packages/cli/src/serve/daemonStatusProvider.test.ts`
(~240 LOC, 4 tests): wires real `createDaemonStatusProvider()`
through a cli-side `makeBridge` wrapper to assert end-to-end
daemon env / preflight cells. Imports
`createHttpAcpBridge` via the `./httpAcpBridge.js` re-export
shim — doubles as a shim surface smoke check.
Verification:
- acp-bridge: 291/291 tests pass (177 in bridge.test.ts).
- cli: daemonStatusProvider.test.ts 4/4 pass; full cli suite 6742/6767
green (16 pre-existing failures in AuthDialog / memoryDiagnostics /
useAtCompletion — all on `daemon_mode_b_main` baseline, last
modified by commits predating this branch).
- Tests counts pre-split: 181 in httpAcpBridge.test.ts;
post-split: 177 in bridge.test.ts + 4 in daemonStatusProvider.test.ts
= 181 (parity preserved).
Part of #4175 F1 test split (deferred from #4334).
* refactor(acp-bridge): self-review round 1 — vitest alias + doc/comment polish
Five code-reviewer findings folded in on top of e97282f30:
S1 [Suggestion] — Test-utils ships to npm + cli reads stale dist.
Added `packages/cli/vitest.config.ts:resolve.alias` mapping
`@qwen-code/acp-bridge/internal/testUtils` → the .ts source. The
package subpath export is RETAINED (required for TypeScript
`nodenext` to resolve types — it won't fall back to tsconfig
paths once exports rejects a subpath). Dual-channel approach
documented in the testUtils JSDoc, including the alpha-stage 0.0.1
tradeoff that the file still ships in dist (stripInternal /
.npmignore deferred).
S2 [Suggestion] — Stale wording "two tests" in narrative comment.
bridge.test.ts split-marker now correctly says "4 fallback tests"
(no-provider × 2 surfaces + throwing-provider × 2 surfaces).
S3 [Suggestion] — "Shim smoke check" only half-applied.
daemonStatusProvider.test.ts now routes `BridgeOptions` and
`HttpAcpBridge` types through `./httpAcpBridge.js` shim too
(alongside `createHttpAcpBridge`), so the entire factory surface
the cli tests rely on flows through the F1 re-export shim.
N1 [Nit] — Asymmetric split-marker phrasing.
Both markers now describe the 4 moved tests by surface
(env real / preflight idle / preflight merged-live /
preflight extMethod-throws) rather than "1 of" + "3 more".
N2 [Nit] — testUtils "the suite" ambiguity.
makeChannel JSDoc now references `bridge.test.ts` explicitly
instead of "the suite" (which was unambiguous pre-split when
helpers + 10 createInMemoryChannel sites lived in the same file).
Verification: 291/291 acp-bridge tests pass; 4/4 cli daemon
integration tests pass; tsc clean on both packages (pre-existing
server.ts errors on baseline unchanged); eslint --max-warnings 0
clean on all 4 touched files.
* docs(cli): self-review round 2 — fix stale vitest.config.ts alias comment
Round 2 reviewer caught a 3-way contradiction in the round 1 docs:
- vitest.config.ts said: alias replaces the export, internal/* stays
unpublished (matches stderrLine convention).
- package.json: subpath export IS declared.
- testUtils.ts JSDoc: both channels intentionally retained,
testUtils ships in dist.
Round 1 explicitly chose to retain the export because TS `nodenext`
won't fall back to tsconfig `paths` once `exports` rejects a
subpath; the alias only serves to short-circuit *runtime* resolution
so cli reads src/ not dist/. Rewriting the vitest.config.ts comment
to reflect that dual-channel reality (and pointing readers at
testUtils.ts for the full rationale).
* fix(acp-bridge): #4445 round 3 fold-in — 4 of 7 reviewer threads adopted
PR #4445 review pass — 4 adopt + 3 decline (declines replied
inline; not folded here):
ADOPTED:
T1 [copilot daemonStatusProvider.test.ts:136 — bridge.shutdown
missing]: added `await bridge.shutdown()` to test 2 (preflight
idle). Three of four tests already shut down; symmetry +
future-proof if `createHttpAcpBridge` gains background work
even when no channel was spawned.
T5 [wenshao testUtils.ts:92 — makeBridge naming collision]: cli-
side helper renamed `makeBridge` -> `makeBridgeWithDaemonStatusProvider`
(4 call sites in daemonStatusProvider.test.ts), JSDoc updated to
reference the wenshao thread. testUtils.makeBridge stays as the
canonical name used by ~100 tests in bridge.test.ts. A future
contributor can no longer pick the wrong helper by accident.
T6 [wenshao testUtils.ts:32 — JSDoc mis-claims @internal tag matches
stderrLine.ts convention]: fixed wording. stderrLine.ts uses prose
only; @internal is an additional package-private signal, not a
convention match. Also restructured the npm-leak paragraph to
describe the new .npmignore-via-files-negation enforcement (T7).
T7 [wenshao package.json:70 — testUtils ships to npm]: switched
`files: ["dist"]` -> `files: ["dist", "!dist/internal/testUtils.*",
"!dist/**/*.test.*"]`. Wenshao's suggested `"test"` exports
condition wasn't viable: vitest sets `vitest` not `test`, and
gating on `vitest` would hide types from the cli's tsc compile.
The negation-pattern files-field excludes the built testUtils
from the publish surface while keeping the subpath export entry
that TypeScript `nodenext` needs to resolve types. Verified via
`npm pack --dry-run`: dist/internal/stderrLine.* still ships
(production internal helper); dist/internal/testUtils.* +
dist/**/*.test.* are excluded.
DECLINED (replied on PR threads, not folded here):
T2/T3 [copilot — `handles` array unused in tests 3/4]: bookkeeping
matches the pre-split bridge.test.ts verbatim; cleanup is scope
creep on this rename PR.
T4 [copilot — testUtils eager-imports createHttpAcpBridge,
cross-copy identity risk]: cli daemonStatusProvider.test.ts uses
its OWN local `makeBridgeWithDaemonStatusProvider` and never
imports testUtils.makeBridge — the cross-copy concern isn't
triggered. Premature abstraction on a test-only fixture.
Verification: 291/291 acp-bridge tests pass; 4/4 cli daemon tests
pass; tsc clean both packages; eslint --max-warnings 0 clean on
2 touched .ts files; `npm pack --dry-run` confirms publish-surface
exclusions.
* fix(core): F2 cleanup PR B — self-heal observability (W133-a + W134) (#4460)
* fix(core): F2 cleanup PR B — self-heal observability (W133-a + W134)
W93 declined as already satisfied by W1 fix in #4336 commit 6
(spawnEntry's catch already calls forceShutdown which runs the full
cleanup table — listener removal, timer clear, subscriber detach,
sweep+disconnect, onClosed eviction). Source-verified non-repro.
W133-a: McpClient.onerror now captures the error in a private
`lastTransportError` field (reset at each connect()); the W120
silent-drop block at mcp-pool-entry.ts:346 reads it via the new
`getLastTransportError()` getter and appends `: <error.message>` to
the lastError string on the emitted 'failed' event. Preserves the
literal "silent transport drop" prefix invariant for log-grep
backward compat — pre-fix marker stays a substring.
W134: sweepAndDisconnect now returns SweepResult instead of void —
{ pidSweepError?, disconnectError?, descendantsFound?,
descendantsSignaled? }. The silent-drop fire-and-forget caller chains
to inspect the result and emits a structured warn log when either
pid-sweep threw OR sigtermPids partially signaled (signaled < found)
— surfaces orphan-process pressure without inflating PR scope (no
new SSE event or SDK reducer state; deferred to W134-followup if
maintainers want metrics).
forceShutdown / doRestart sweep callers ignore the return value (JS
implicit-void at await sites preserves behavior).
4 new tests in mcp-transport-pool.test.ts covering W133-a happy path
+ fallback (no prior onerror) + W134 pidSweepError + W134
partial-signal failure modes. Module-mocks pid-descendants.js for
controllable sweep behavior, and debugLogger.js to observe warn
calls (production logger is session-gated and a no-op in tests).
Singleton-stub debugLogger mock so production module-load
`createDebugLogger('McpPool:Entry')` and the test's retrieval get
the same vi.fn instances.
Verification:
- tsc clean: packages/core, packages/cli (server.ts pre-existing
errors unchanged)
- F2 transport-pool: 32/32 pass (28 pre-existing + 4 new)
- mcp-client: 46/46 pass
- eslint --max-warnings 0 clean on 3 touched files
Part of #4175 #4336 follow-up bucket.
* fix(core): #4460 round 1 fold-in — 4 copilot doc/comment threads adopted
T1 [copilot mcp-pool-entry.ts:116 — stale line ref in SweepResult JSDoc]:
replaced `mcp-pool-entry.ts:383` with stable method-anchor reference
to the W120 silent-drop block inside `statusChangeListener`. Line
numbers drift on every edit; method names don't.
T2 [copilot mcp-pool-entry.ts:453 — `?? 0` ambiguous in warn payload]:
silent-drop warn log now prints `descendantsFound=unknown` and
`descendantsSignaled=unknown` when the values are undefined (only
reachable in the pidSweepError branch — sweep threw before
assignment). Operators triaging the warn can now distinguish
"sweep succeeded but found 0 descendants" from "sweep itself
threw, count is genuinely unmeasured". Locked in via a new
assertion in the W134 pidSweepError test.
T3 [copilot mcp-client.ts:116 — brittle line refs in lastTransportError
JSDoc]: replaced `mcp-pool-entry.ts:346` and `mcp-client.ts:130`
with stable method/block names (the `statusChangeListener` silent-
drop block; the `client.onerror` arrow inside connect()). Same
fix applied to the parallel comment in mcp-transport-pool.test.ts:730
for consistency.
T4 [copilot mcp-transport-pool.test.ts:797 — singleton-stub mock comment
contradictory]: rewrote the comment to unambiguously describe what
the mock DOES (factory body runs once; inner arrow returns the same
object on every call) instead of the prior hypothetical phrasing
("Returning a fresh object would have...") which read as a
description of current behavior at first glance.
All 4 are doc/comment fixes — zero behavior change apart from the
T2 string format ('unknown' instead of '0'). Verified:
- 32/32 mcp-transport-pool.test.ts pass
- tsc clean on packages/core
- eslint --max-warnings 0 clean on 3 touched files
* fix(core): #4460 round 2 fold-in — remove dead SweepResult.disconnectError field
T5 [wenshao mcp-pool-entry.ts:134 — `disconnectError` is dead data]:
glm-5.1 review caught that the field was populated when
`client.disconnect()` threw (line 844) but no consumer ever read
it — the silent-drop `.then()` handler gated only on
`pidSweepError` and partial-signal; `forceShutdown` and `doRestart`
ignore the return; no test asserted on it.
Removed the field from `SweepResult` and the assignment in the
disconnect catch. The pre-existing `debugLogger.error(`client.disconnect
failed for ...`)` inside `sweepAndDisconnect` already gives operators
the signal — adding it to the outer silent-drop warn would have been
duplicate noise. If a future consumer needs to gate logic on disconnect
failures, re-add the field + reader at that point.
Verification: 32/32 mcp-transport-pool.test.ts pass; tsc + eslint
clean on the touched file.
* feat(sdk/daemon-ui): unified completeness follow-up to #4328 (#4353)
* feat(sdk/daemon-ui): expand event coverage to 28+ daemon event types (PR-A)
Closes the "12+ daemon events fall through to debug" gap surfaced in the PR
the daemon currently emits (Stage 1 + Wave 3-4), so renderers stop having
to peek at `rawEvent.data` for known event categories.
Session-meta:
- session.metadata.changed (from session_metadata_updated)
- session.approval_mode.changed (from approval_mode_changed)
- session.available_commands (from available_commands_update; upgraded
from a status-text fallback to a typed event carrying the command list)
Workspace state (Wave 3-4):
- workspace.memory.changed
- workspace.agent.changed
- workspace.tool.toggled
- workspace.initialized
- workspace.mcp.budget_warning
- workspace.mcp.child_refused
- workspace.mcp.server_restarted
- workspace.mcp.server_restart_refused
Auth device-flow (Wave 4 OAuth, RFC 8628):
- auth.device_flow.started
- auth.device_flow.throttled
- auth.device_flow.authorized
- auth.device_flow.failed (carries DaemonAuthDeviceFlowSdkErrorKind)
- auth.device_flow.cancelled
- `DaemonUiErrorEvent.errorKind?: DaemonErrorKind` — closed-enum error
category propagated from daemon's typed-error taxonomy. Renderers can
branch on errorKind for "retry auth" vs "check file path" affordances
instead of regex-matching `text`.
- `DaemonUiToolUpdateEvent.provenance?: DaemonUiToolProvenance` +
`.serverId?` — closed enum ('builtin' | 'mcp' | 'subagent' | 'unknown').
Falls back to the `mcp__<server>__<tool>` naming heuristic when the
daemon doesn't stamp provenance explicitly. Unblocks UI namespace
dispatch without string-matching toolName.
Session-meta / workspace / auth events do NOT push transcript blocks.
They are intentional sidechannel observations: `lastEventId` advances
(monotonic invariant preserved), but the chat-stream transcript stays
focused on user/assistant/tool/shell/permission content. Renderers
consume them via selectors (introduced in follow-up PRs).
All new event types produce short structured lines in
`daemonUiEventToTerminalText` for tail-style debug consumers. Web/IDE
renderers should consume the typed events directly via subscription.
40/40 tests pass. New tests verify:
- All 16 new event types normalize correctly
- Malformed payloads fall back to debug without leaking raw data
(`secret` field never appears in fallback text)
- MCP tool provenance heuristic (`mcp__github__create_issue` →
provenance='mcp', serverId='github')
- errorKind propagation on session_died / stream_error
- Reducer is no-op on new event types; lastEventId still advances
This is PR-A of the unified-renderer-layer follow-up series:
- PR-A (this commit) — event coverage + closed-enum schema
- PR-B — server-side timestamps + ordering refactor
- PR-C — multimodal content + tool preview taxonomy
- PR-D — render contract (toMarkdown / toHtml / toPlainText) + adapter
conformance test framework
- PR-E — reducer state machine (subagent / progress / current tool /
cancellation propagation)
See https://github.com/QwenLM/qwen-code/pull/4328#issuecomment-4494179724
for the full proposal.
Generated with AI
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* feat(sdk/daemon-ui): server timestamps + event-id-based ordering (PR-B)
Closes the "时间定义不标准" gap surfaced in the PR #4328 review:
- Client-side `Date.now()` drifts across clients
- No daemon-authoritative timestamp propagated to UI
- Out-of-order replay events get fresher `state.now` than originals,
breaking `createdAt` ordering
- `DaemonUiEventBase.serverTimestamp?: number` — daemon-authoritative
wall-clock timestamp extracted from envelope.
- `DaemonTranscriptBlockBase.serverTimestamp?: number` + `clientReceivedAt: number`.
- `createdAt` preserved as `@deprecated` alias for `clientReceivedAt`
(backward compat for code written before this PR).
`extractServerTimestamp` looks at three candidate envelope locations:
1. `event.serverTimestamp` (preferred when daemon adds it)
2. `event._meta.serverTimestamp` (Anthropic-style metadata convention)
3. `event.data._meta.serverTimestamp` (sessionUpdate nested location)
The SDK is ready to consume serverTimestamp WHEN daemon emits it, without
requiring a coordinated SDK release. Undefined when daemon doesn't emit
(current state) — graceful degradation to client-clock ordering.
`selectTranscriptBlocksOrderedByEventId(state)` — returns blocks sorted by:
1. `eventId` (daemon-monotonic SSE cursor) — primary key
2. `serverTimestamp` (daemon wall clock) — fallback for synthetic frames
3. `clientReceivedAt` (local clock) — last resort
Use this when displaying long sessions where event id 5 may arrive AFTER
event id 7 (typical in SSE replay-after-reconnect).
`formatBlockTimestamp(block, opts)` — formats the most authoritative
timestamp on a block using `Intl.DateTimeFormat`. Prefers
`serverTimestamp` over `clientReceivedAt` for cross-client consistency.
Accepts locale / timeZone / dateStyle / timeStyle.
Daemon needs to stamp `_meta.serverTimestamp` on every SSE envelope. This
SDK PR is ready to consume it the moment the daemon ships the field; no
coordination needed.
- serverTimestamp extraction from all three envelope locations
- Defaults undefined when envelope has none
- `selectTranscriptBlocksOrderedByEventId` sorts mixed-arrival events by
eventId (replay scenario)
- `formatBlockTimestamp` prefers serverTimestamp; returns localized string
PR-B of the unified follow-up to PR #4328 (PR-A + PR-B + PR-C + PR-D +
PR-E in one branch).
Generated with AI
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* feat(sdk/daemon-ui): reducer state machine — currentTool / approvalMode / cancellation propagation (PR-E)
Closes the "reducer state machine 设计缺漏" gap surfaced in the PR #4328 review:
- No `currentTool` — UI scans `blocks[]` to find the running tool
- No mirrored approval mode — UI walks events to badge "plan"/"yolo"
- Cancellation does not propagate — in-flight tool blocks stuck at
'in_progress' forever when the parent prompt is cancelled
## State additions (sidechannel, no transcript blocks)
`DaemonTranscriptSidechannelState`:
- `currentToolCallId?: string` — toolCallId of the in-flight tool
- `approvalMode?: string` — mirrored from session.approval_mode.changed
- `toolProgress: Record<string, { ratio?, step? }>` — per-tool progress
shape (daemon-side emission of `tool.progress` events pending)
## Reducer behavior
### `tool.update` events
`IN_FLIGHT_TOOL_STATUSES` = { pending, confirming, running, in_progress }
`TERMINAL_TOOL_STATUSES` = { completed, success, failed, error, canceled, cancelled }
- Tool enters in-flight: set `currentToolCallId = event.toolCallId`
- Tool enters terminal: clear `currentToolCallId` if it matches
- Unknown status (forward-compat): leave pointer untouched
This avoids the failure mode where a future daemon-emitted status like
`'paused'` would silently mark unknown states as either in-flight or
terminal incorrectly.
### `session.approval_mode.changed`
Mirror `event.next` onto `state.approvalMode`. Renderers can render a
mode badge ("plan" / "default" / "auto-edit" / "yolo") with a single
selector call, no event-stream walking.
### `assistant.done` with `reason === 'cancelled'`
`propagateCancellationToInFlightTools` walks every tool block whose
status is still in-flight and force-sets it to 'cancelled'. The daemon
does not guarantee terminal `tool_call_update` for every in-flight tool
when the parent prompt is cancelled, so this propagation prevents UI
spinners from spinning forever.
`currentToolCallId` is also cleared in the same call.
Non-cancellation `assistant.done` (e.g., `reason: 'end_turn'`) does NOT
propagate — in-flight tools remain in-flight until the daemon emits
their terminal update naturally.
## Selectors
- `selectCurrentTool(state)` — returns the running tool block, or undefined
- `selectApprovalMode(state)` — returns the mirrored approval mode
- `selectToolProgress(state, toolCallId)` — per-tool progress query
All exported from `@qwen-code/sdk/daemon`.
## Scope deliberately deferred
Subagent nesting (`parentBlockId` / `delegationId` / `DaemonSubagentTranscriptBlock`)
is NOT in this PR. The shape needs design discussion (how to project nested
events; whether to bake delegation tracking into transcript or sidechannel).
PR-D / PR-F follow-up.
## Test coverage (51/51 pass)
- currentToolCallId set on enter, cleared on terminal
- approvalMode mirrors changes
- Cancellation marks in-flight tools 'cancelled', leaves completed alone
- Unknown status does NOT clear currentToolCallId (forward-compat)
- Non-cancellation `assistant.done` does NOT propagate
## Roadmap
PR-E of the unified follow-up to PR #4328 (PR-A + PR-B + PR-E in this
branch; PR-C / PR-D pending).
Generated with AI
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* feat(sdk/daemon-ui): tool preview taxonomy + multimodal content extraction (PR-C)
Closes two related gaps surfaced in the PR #4328 review:
- `DaemonToolPreview` had only 4 kinds — UI fell back to `key_value` /
`generic` for tools that deserved structured display
- `getTextContent` silently dropped non-text content (image / audio /
resource), so multimodal conversations vanished from the UI
`DaemonToolPreview` extends from 4 to 8 variants:
- `file_diff` — `{ path, oldText?, newText?, patch? }` — file edit tools
(Anthropic-style `oldText/newText`, aider-style `patch`, write-style
`newText` alone)
- `file_read` — `{ path, range?: [start, end] }` — file read tools, with
range extracted from `lineRange` tuple OR `offset/limit` pair
- `web_fetch` — `{ url, method? }` — HTTP fetch tools (requires URL
with scheme to avoid false positives on relative paths)
- `mcp_invocation` — `{ serverId, toolName, argsSummary? }` — MCP server
tool calls, identified via `mcp__<server>__<tool>` naming convention
(same heuristic as PR-A `DaemonUiToolUpdateEvent.provenance`)
Detector order matters — MCP wins first (most specific), then file_diff,
file_read, web_fetch, then the existing command / key_value fallbacks.
New helper `extractContentPart(value): DaemonUiContentPart | undefined`
returns a discriminated union:
```ts
type DaemonUiContentPart =
| { kind: 'text'; text: string }
| { kind: 'image'; mediaType: string; source: { url?, data? } }
| { kind: 'audio'; mediaType: string; source: { url?, data? } }
| { kind: 'resource'; uri: string; mediaType?, description? };
```
The existing `getTextContent` is preserved for backward compat. Renderers
that need to surface non-text content (web UI thumbnails, IDE attachment
chips) now have a typed shape to consume.
- Wiring `extractContentPart` into the normalizer / reducer so text
blocks accumulate `parts: DaemonUiContentPart[]` alongside `text`
(additive shape change requires render contract coordination — PR-D).
- 5 additional tool preview kinds (image_generation / code_block /
tabular / subagent_delegation / search) — useful but not urgent;
current 8 kinds cover the typical agent flows.
- file_diff detection from Anthropic / aider / write shapes
- file_read with lineRange tuple AND offset+limit pair
- web_fetch with method, REJECTS relative paths (no scheme)
- mcp_invocation with serverId + toolName extraction
- Detector priority: MCP wins over file_diff on conflicting shapes
- extractContentPart for text / image (url) / audio (data) / resource
- Unknown content type returns undefined (skip rather than synthesize)
- Image without source returns undefined (defensive)
PR-C of the unified follow-up to PR #4328 (PR-A + PR-B + PR-E + PR-C in
this branch; PR-D render contract pending).
Generated with AI
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* feat(sdk/daemon-ui): render contract — markdown / HTML / plain text helpers (PR-D)
Closes the "render 契约只覆盖 terminal" gap surfaced in the PR #4328 review:
> PR ships `daemonUiEventToTerminalText` for terminal. Web/IDE/channel
> adapters each roll their own projection. No shared contract → adapter
> divergence is inevitable.
## New helpers
```ts
daemonBlockToMarkdown(block, opts?): string // GFM-compatible
daemonBlockToHtml(block, opts?): string // conservatively escaped HTML
daemonBlockToPlainText(block, opts?): string // for copy-paste / logs
daemonToolPreviewToMarkdown(preview, opts?): string
```
All three respect the same `kind` discrimination so adapters can switch
between them without touching call sites.
## Per-kind projection
For each `DaemonTranscriptBlock['kind']`:
- `user` / `assistant` / `thought` — plain text with role labels
- `tool` — header with toolName + structured preview + status badge
- `shell` — fenced code block, stream-discriminated (stdout vs stderr)
- `permission` — title + options list + resolved/pending indicator
- `status` / `debug` / `error` — semantic class / role (error → role=alert)
For each `DaemonToolPreview['kind']`:
- `ask_user_question` — question + options as bullet list
- `command` — fenced bash with optional cwd comment
- `file_diff` — unified diff in fenced code block (oldText/newText OR patch)
- `file_read` — `path (lines N-M)` line
- `web_fetch` — `METHOD url` line
- `mcp_invocation` — `serverId::toolName` with args summary
- `key_value` — bullet list
- `generic` — emphasized summary
## Security
- Default HTML sanitizer escapes `<`, `>`, `&`, `"`, `'` and FIRST strips
ANSI/control sequences via `sanitizeTerminalText` (defense against
agent-emitted escape codes in HTML output).
- Custom sanitizer hook for consumers wanting markdown→HTML pipelines
(markdown-it + DOMPurify, etc.).
- `sanitizeUrls` option strips token-like query params (`token=`, `key=`,
`x-amz-`, etc.) from URLs in `web_fetch` previews.
- `maxFieldLength` truncation defaults 8192, prevents pathological
rendering on huge content.
## Adapter conformance (out of scope for this commit)
The conformance test framework (fixture corpus + `runAdapterConformanceSuite`)
mentioned in PR-D scope is deferred to a follow-up. The render helpers
here are the precondition — once stable, the conformance framework can
use them as the reference projection.
## Test coverage (77/77 pass)
- All 9 block kinds render in markdown (verified for user/assistant/tool/
shell/permission/error specifically)
- file_diff renders as unified diff with old/new lines
- mcp_invocation renders as `server::tool` format
- HTML escapes XSS (`<script>` → `<script>`)
- HTML strips terminal escape sequences before escaping
- Error blocks emit `role="alert"` for screen readers
- plain text drops markdown delimiters
- maxFieldLength truncates with ellipsis
- sanitizeUrls strips token query params
- Custom sanitizer hook works
## Roadmap
PR-D of the unified follow-up to PR #4328 — completes the 5-PR series
(A: event coverage, B: time schema, E: state machine, C: tool preview +
content extraction, D: render contract).
Generated with AI
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* feat(sdk/daemon-ui): 5 additional tool preview kinds — taxonomy complete (PR-F)
Closes the "5 additional preview kinds" item in PR #4353's TODO §A
(SDK-only work).
## New preview kinds (8 → 13)
- `code_block` — `{ language?, code, origin? }` — REPL / formatter /
generator output, fenced as `\`\`\`<language>` in markdown
- `search` — `{ query, resultCount?, top? }` — grep / ripgrep / find /
glob results with up to 5 top hits
- `tabular` — `{ columns, rows, totalRows? }` — structured table output
(50-row cap with `totalRows` truncation indicator); supports both
`columns: string[] + rows: unknown[][]` explicit shape and legacy
`data: Array<Record<>>` shape (auto-infers columns from first row)
- `image_generation` — `{ prompt, thumbnailUrl?, model? }` — dall-e /
diffusion / imagen / flux / sora style tools
- `subagent_delegation` — `{ agentName, task, parentDelegationId? }` —
Anthropic-style Task tool and similar sub-agent dispatchers
## Detector priority
Order matters — most specific wins. New detectors slot in between
`mcp_invocation` and `file_diff`:
```
mcp_invocation > subagent_delegation > search > image_generation
> file_diff > file_read > web_fetch > code_block > tabular
> command > key_value > generic
```
Rationale: subagent / search / image generation are most discriminable
(distinct toolName patterns); file ops next; code_block / tabular last
because their shapes (`code:`, `columns:`) can appear in other tools.
## Render projections
Both `daemonToolPreviewToMarkdown` and the plain-text rendering paths
extended with cases for all 5 new kinds:
- code_block: fenced markdown code block with language tag
- search: bold header + GFM bullet list of top results
- tabular: GFM pipe table with header / separator / body / truncation hint
- image_generation: bold header + blockquoted prompt + embedded markdown
image (URL sanitization respected via `sanitizeUrls` opt)
- subagent_delegation: bold delegate-arrow header + blockquoted task +
optional parent delegation reference
## Test coverage (91/91 pass, +14 new)
- Each detector with positive case
- Detector priority verified: subagent_delegation wins over file_diff
when toolName='Task' has both subagent + file-edit fields
- Tabular row cap (50) + totalRows stamping for truncated data
- Legacy data: Array<Record<>> auto-column inference
- Each render projection with structural assertions (markdown table
format, image embed, bullet lists)
## Roadmap
PR-F of the unified follow-up to PR #4328. Brings the preview taxonomy
to 13 kinds covering: file ops (3), web (1), code/data (2), media (1),
agent control (2 — ask_user_question + subagent_delegation), MCP (1),
search (1), generic fallbacks (2).
Generated with AI
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* feat(sdk/daemon-ui): adapter conformance framework + fixture corpus (PR-G)
Closes the "Adapter conformance test framework" item in PR #4353's TODO §A.
Lets any daemon-ui adapter (TUI / web / IDE / channel / mobile) validate
that it projects a fixed corpus of daemon SSE event streams to the same
semantic shape — catches projection drift before it reaches users.
## API surface
```ts
interface DaemonUiAdapterUnderTest {
reduce(events: readonly DaemonUiEvent[]): unknown;
renderToText(state: unknown): string;
}
interface DaemonUiConformanceFixture {
name: string;
description: string;
envelopes: DaemonEvent[]; // raw daemon envelopes
expectedContains: string[]; // phrases the rendered text MUST contain
expectedAbsent?: string[]; // phrases that MUST NOT appear
normalizeOptions?: { ... }; // forward-compat normalize opts
}
runAdapterConformanceSuite(adapter, opts?): ConformanceSuiteResult
DAEMON_UI_CONFORMANCE_FIXTURES: ReadonlyArray<DaemonUiConformanceFixture>
```
## Design
**Format-agnostic assertion**: adapters can render to ANSI / HTML /
markdown / JSX — the framework only inspects plain text via
`renderToText`. Catches semantic divergence (missing user message,
wrong tool status, leaked secret) without forcing identical formatting.
**Embedded fixture corpus** (no fs reads — works in browser bundle):
- `simple-chat` — user/assistant streaming flow
- `tool-call-lifecycle` — running → completed transition
- `file-edit-diff` — file_diff preview surfacing
- `mcp-invocation` — MCP serverId/toolName extraction via heuristic
- `permission-lifecycle` — request + resolved with outcome
- `mcp-budget-warning` — Wave 3 event (adapter must observe but rendering
is its choice)
- `cancellation-propagates` — tool block status flows
- `malformed-payload-redaction` — uses `includeRawEvent: true` to verify
even a debug-mode adapter doesn't leak `token: secret-do-not-leak`
- `auth-device-flow-success` — Wave 4 OAuth events
- `available-commands-typed-event` — PR-A upgrade from status text
Per-fixture `expectedContains` and `expectedAbsent` describe the
content contract independently of format.
## Suite result
```ts
{
passed: number,
failed: ConformanceFailure[], // each carries missing + leaked + excerpt
total: number,
}
```
**Does not throw** — caller asserts on `result.failed` so adapter test
suites can produce per-fixture diagnostics rather than a single opaque
exception.
## Filter options
`only` / `skip` allow targeted runs during adapter development:
```ts
runAdapterConformanceSuite(myAdapter, { only: ['simple-chat'] });
runAdapterConformanceSuite(myAdapter, { skip: ['cancellation-propagates'] });
```
## Test coverage (97/97 pass, +6 new)
- SDK reference adapter (reducer + markdown render) passes all fixtures
- SDK reference adapter (reducer + plainText render) also passes
- Buggy adapter (empty string output) fails every fixture with non-empty
`expectedContains`
- Buggy adapter (raw event dump via JSON.stringify) caught by redaction
fixture's `expectedAbsent`
- `only` filter narrows to a single fixture
- `skip` filter excludes named fixtures from the corpus
## Usage from adapter authors
```ts
// In your adapter's test file
import { runAdapterConformanceSuite } from '@qwen-code/sdk/daemon';
import { reduceForTui, renderTuiState } from './my-tui-adapter';
it('TUI adapter conforms to daemon UI corpus', () => {
const result = runAdapterConformanceSuite({
reduce: reduceForTui,
renderToText: renderTuiState,
});
expect(result.failed).toEqual([]);
});
```
## Roadmap
PR-G of the unified follow-up to PR #4328. The corpus is intentionally
small (10 fixtures) but extensible — adapter authors can submit new
fixtures via additions to `DAEMON_UI_CONFORMANCE_FIXTURES` to lock in
regression coverage for edge cases their adapter encountered.
Generated with AI
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* feat(webui+sdk/daemon-ui): wire transcriptAdapter to SDK render contract (PR-H)
Closes the "WebUI transcriptAdapter migration" item in PR #4353's TODO §A.
Validates the PR-D render contract end-to-end on the real WebUI consumer.
`daemonTranscriptToUnifiedMessages(blocks, options?)` gains a new options
parameter:
```ts
interface DaemonTranscriptAdapterOptions {
useMarkdown?: boolean; // default: false
enrichToolDetailsWithPreview?: boolean; // default: false
}
```
Defaults preserve legacy behavior — existing callers see no change.
For `user` / `assistant` / `thought` blocks, content is projected via
SDK's `daemonBlockToMarkdown` instead of raw sanitized text. The WebUI's
markdown renderer (markdown-it) then gets:
- `**You**\n\n<content>` for user blocks (bold "You" label)
- Raw text for assistant blocks (markdown formatting in agent output
passes through cleanly)
- `> *thought:* <text>` blockquote for thought blocks
For `tool` blocks, `rawOutput` is replaced with `daemonToolPreviewToMarkdown(block.preview)`.
This lets WebUI surfaces without per-preview-kind React components still
display:
- `file_diff` as a fenced unified diff
- `mcp_invocation` as `server::tool` with args summary
- `tabular` as GFM pipe table
- `search` as bullet list with match count
- `image_generation` as embedded markdown image
- `subagent_delegation` as delegate arrow + task quote
Renderers with per-kind components should leave this opt-out.
`packages/sdk-typescript/src/daemon/index.ts` was missing exports for
PR-D / PR-F / PR-G / PR-B / PR-E surface — WebUI's `@qwen-code/sdk/daemon`
import path uses the daemon root, not the ui/ sub-index. Added 15+
re-exports so consumers don't need to use the longer
`@qwen-code/sdk/daemon/ui/index.js` path.
Now exported from `@qwen-code/sdk/daemon` root:
- `daemonBlockToMarkdown` / `daemonBlockToHtml` / `daemonBlockToPlainText`
- `daemonToolPreviewToMarkdown`
- `extractContentPart` + `DaemonUiContentPart` type
- `formatBlockTimestamp` + `selectTranscriptBlocksOrderedByEventId`
- `selectCurrentTool` / `selectApprovalMode` / `selectToolProgress`
- `runAdapterConformanceSuite` + `DAEMON_UI_CONFORMANCE_FIXTURES`
- All associated types
`webui/src/daemon/transcriptAdapter.test.ts` mock blocks updated to include
`clientReceivedAt` (required field added in PR-B). Mechanical change —
every `createdAt: N` test fixture gets a matching `clientReceivedAt: N`.
- WebUI `npm run typecheck` — clean
- SDK `npm run typecheck` — clean
- SDK `vitest run test/unit/daemonUi.test.ts` — 97/97 pass
- WebUI transcriptAdapter test fixtures typecheck against updated
DaemonTranscriptBlockBase schema
PR-H of the unified follow-up to PR #4328. Closes the WebUI migration
gap in TODO §A.
Generated with AI
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* docs(daemon-ui): add developer guide + migration cookbook (PR-I)
Closes the final "Documentation" item in PR #4353's TODO §A. Brings the
unified daemon UI surface to ~95% SDK-side completion.
## Files added
- `docs/developers/daemon-ui/README.md` — full API reference
- Three-layer model (normalizer → reducer → render helpers)
- Quick start with idiomatic event-loop pattern
- Event taxonomy (28+ types categorized: chat-stream / session-meta /
workspace / auth device-flow)
- Render contract cookbook (markdown / HTML / plainText)
- Tool preview taxonomy (13 kinds with use cases)
- State selectors (currentTool / approvalMode / toolProgress / ordering)
- Cancellation propagation explanation
- Time semantics (eventId > serverTimestamp > clientReceivedAt
precedence)
- Adapter conformance usage
- ErrorKind dispatch pattern
- Tool provenance dispatch pattern
- Forward-compat principles
- `docs/developers/daemon-ui/MIGRATION.md` — adapter author migration
cookbook
- Step-by-step recommended adoption order (9 steps, value-ranked)
- Before/after code examples for each step
- Backward-compat checklist (everything is additive — no breaking
changes)
- Cross-references to PR-A through PR-H commits
## Roadmap
PR-I of the unified follow-up to PR #4328. Documentation-only — no
code changes; no tests affected.
Generated with AI
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* fix(daemon-ui): address review feedback
* fix(daemon-ui): address review hardening feedback
* fix(daemon-ui): handle resync-required events
* feat(sdk/daemon-ui): consume daemon-side subagent nesting context (PR-K)
Closes the SDK-side gap for §B1 in PR #4353's TODO list. PR-E originally
deferred subagent nesting because daemon-side parent-context wasn't yet
stamped on tool_call events. After the rebase onto current
daemon_mode_b_main, source verification confirms the daemon now emits
`tool_call._meta.parentToolCallId` + `tool_call._meta.subagentType` via
`SubAgentTracker.getSubagentMeta()` (core), so the SDK side is unblocked.
## Schema additions (additive, forward-compat-safe)
`DaemonUiToolUpdateEvent`:
- parentToolCallId?: string — toolCallId of the parent Task / delegation
- subagentType?: string — sub-agent type label (e.g. 'code-reviewer')
`DaemonToolTranscriptBlock`:
- parentToolCallId?: string — mirror of event field
- subagentType?: string — mirror of event field
- parentBlockId?: string — pre-resolved by reducer when parent already
in state, so renderers don't re-correlate
## Normalizer wiring
`normalizeToolUpdate` checks both top-level and `_meta` for parentToolCallId
+ subagentType (fallback chain mirrors how provenance/serverId are read).
Top-level tool calls without sub-agent context omit the fields cleanly.
## Reducer behavior
- New tool block: resolves `parentBlockId` from `toolBlockByCallId` at
create time. Out-of-order arrival (child before parent) leaves
`parentBlockId` undefined — selectors fall back to `parentToolCallId`
lookup.
- Existing tool block update: adopts parent context if not yet
correlated, never overwrites established correlation (handles the
flow where SubAgentTracker activates after the initial tool_call).
## New public selectors
- selectSubagentChildBlocks(state, parentToolCallId): returns the
array of tool blocks invoked inside a given parent delegation
- isSubagentChildBlock(block): type guard for "this tool block came
from a sub-agent"
Both exported from @qwen-code/sdk/daemon root + ui/index.
## Forward-compat properties
- Top-level tool calls (no sub-agent) work identically as before
- Trimmed parent blocks: child fallback to undefined parentBlockId
- Daemon emits both fields together; SDK reads independently to tolerate
partial future stamping
## Test coverage (129/129 pass, +5 new tests)
- Extract parentToolCallId + subagentType from `_meta`
- Top-level tool calls have undefined parent fields (forward-compat)
- Reducer correlates parentBlockId at create time
- Reducer adopts parent context on later update (out-of-order arrival)
- isSubagentChildBlock discriminator
## Roadmap
PR-K of the unified follow-up to PR #4353. Closes §B1 (subagent nesting)
in the TODO declaration; daemon-side already shipped on
`daemon_mode_b_main` via SubAgentTracker (core).
Remaining TODO §B / §D items still depend on further daemon/Core work:
- §B2 `tool.progress` event type (daemon emit pending)
- §D MessageEmitter multimodal echo + HistoryReplayer inlineData/fileData
(core change pending)
Generated with AI
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* fix(daemon-ui): PR-K self-review hardening — back-fill / trim / self-ref / docs
Multi-round self-review of PR-K (d8375fe46) surfaced two real bugs, a
few defensive gaps, and missing docs/fixture coverage. All addressed
in one commit.
## Bugs fixed
### Bug 1 — `parentBlockId` never back-filled for out-of-order arrival
Original PR-K resolved `parentBlockId` only at child create time, which
broke this flow:
1. Child arrives WITH parent stamp → block created with
`parentToolCallId` set, `parentBlockId` undefined (parent not in
state yet)
2. Parent arrives later → block created, `toolBlockByCallId` indexed
3. Subsequent child updates: existing-block branch only ran the
back-fill inside `!existing.parentToolCallId`, which is false (we
already adopted the stamp in step 1). `parentBlockId` stayed
undefined forever.
Fix: separate the two correlations.
- existing-block update: independently back-fill `parentBlockId`
whenever `parentToolCallId` is set and `parentBlockId` is missing
- new-block create: scan existing children whose `parentToolCallId`
matches the new block's `toolCallId` and back-fill their
`parentBlockId`. Cheap O(n) over current blocks.
### Bug 2 — dangling `parentBlockId` after trim
`trimTranscriptState` reset `toolBlockByCallId[id]` to the trimmed
sentinel for evicted blocks but did NOT walk surviving children to
null their `parentBlockId` references. Renderers walking
`blockIndexById.get(parentBlockId)` would get undefined, with no
"why" signal.
Fix: post-trim, walk remaining tool blocks; if `parentBlockId`
references an id not in `keptIds`, null it. `parentToolCallId` stays
(survives trimming so selector-keyed queries still work).
## Defensive hardening
- **Self-reference guard** (normalizer): drop
`parentToolCallId === toolCallId` before it reaches the reducer.
Daemon should never emit this, but defending costs nothing.
- **Selector docstring**: clarify `selectSubagentChildBlocks` returns
**direct** children only; document cycle / depth-cap responsibility
for renderers walking up the chain.
- **Cosmetic**: remove redundant `as DaemonToolTranscriptBlock` cast
in `isSubagentChildBlock` (TypeScript already narrows after
`block.kind === 'tool'` on the discriminated union).
- **Alphabetical**: move `isSubagentChildBlock` re-export to correct
position in both `daemon/index.ts` and `daemon/ui/index.ts`.
## Docs + conformance gaps closed
- `README.md` — new "Sub-agent nesting (PR-K)" section with full
reducer behavior, out-of-order handling note, recursive walk example,
cycle-defense note.
- `MIGRATION.md` — new step 8a with before/after for nested rendering.
- `conformance.ts` — new `subagent-nesting` fixture covering parent +
nested child via `tool_call._meta`. Markdown-safe phrases chosen
(markdown escapes `-` so titles cannot be substring-matched as-is).
## Test coverage (+5 tests, 134/134 pass)
- Self-reference dropped in normalizer
- Back-fill on out-of-order parent arrival (child first, parent after)
- Back-fill on later child update when parent now exists
- Dangling `parentBlockId` nulled after parent trimmed
- New `subagent-nesting` conformance fixture passes SDK reference adapter
## Side-effect verification
Verified no regressions:
- Cancellation propagation still cancels parent + children together
(iterates `toolBlockByCallId`, which includes both)
- Render contract unchanged (`daemonBlockToMarkdown` etc. project per
block, no nested awareness required)
- No serializer to update
- `selectTranscriptBlocksOrderedByEventId` unaffected (parent-agnostic)
Generated with AI
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* fix(daemon-ui): permission block trim contract — wenshao review
Addresses both items from wenshao's review on PR #4353:
## Critical — resolvePermissionBlock missing TRIMMED guard
The sibling `upsertPermissionBlock` (transcript.ts:544) correctly returns
early when `existingId === TRIMMED_PERMISSION_BLOCK_ID`, but
`resolvePermissionBlock` (transcript.ts:581) had no such guard. When
`maxBlocks` trimming evicted a pending permission request, a subsequent
`permission.resolved` event would:
1. Fail the `getWritableBlockById` lookup (sentinel is not a real block id)
2. Fall through and create a brand-new orphan resolution block
This wasted a block slot, accelerated further trimming, and silently
broke the trimmed-block contract that the request-side guard establishes.
Fix: mirror the request-side guard. Read the index entry up front,
return early on the sentinel.
## Suggestion — permissionBlockByRequestId grows unboundedly
`trimTranscriptState` writes `TRIMMED_PERMISSION_BLOCK_ID` for evicted
permission requests but never deletes those entries. Unlike the tool
side (which calls `pruneTrimmedToolIndexes` post-trim), the permission
index grew without bound in long sessions.
Fix: add `pruneTrimmedPermissionIndexes` analogous to the tool-side
helper. Caps the sentinel set at `maxBlocks` entries; older entries are
deleted (any later resolution event still drops cleanly via the new
Critical guard).
## Tests
- Updated existing `keeps orphan permission resolutions visible after
request trimming` test to encode the corrected contract (drops silently
instead of creating an orphan). Test rename: "drops resolution for
trimmed permission requests (wenshao Critical)".
- New `Suggestion: pruneTrimmedPermissionIndexes caps the trimmed
sentinel set` test verifies the cap.
Total: 136/136 tests pass, SDK + WebUI typecheck green.
## Side-effect verification
- `upsertPermissionBlock` already had the equivalent guard — no
asymmetry remains.
- `pruneTrimmedPermissionIndexes` only touches entries holding the
sentinel; live permission blocks are unaffected.
- Selectors over `state.blocks` (e.g. `selectPendingPermissionBlocks`)
iterate the block array, not the index — unaffected by cap.
Generated with AI
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* fix(daemon-ui): address wenshao + doudouOUC inline reviews (2026-05-23)
Addresses the 13 inline review comments from wenshao (6) and doudouOUC
(7, one overlap) on the 2026-05-23 review round.
## Critical / Important
### sanitizeUrls not threaded through HTML preview path (doudouOUC)
`daemonBlockToHtml` for tool blocks called `daemonToolPreviewToPlainText`
which didn't accept `opts` — when callers set `sanitizeUrls: true`, the
markdown path stripped auth tokens but the HTML path leaked them into
the DOM. Now: helper accepts opts, threads through `web_fetch.url` and
`image_generation.thumbnailUrl`.
### enrichToolDetailsWithPreview overwrote rawOutput (doudouOUC)
The webui adapter replaced structured `rawOutput` with a markdown
summary string when `enrichDetails: true`. Downstream `ToolCallData`
consumers may branch on the shape (object vs string) and break. Plus
the actual tool output was silently dropped.
Fix: keep `rawOutput` verbatim, surface markdown via a new optional
`previewMarkdown` field added to `ToolCallData`.
### transcriptBlockToTerminalText zero test coverage (wenshao)
Added 12 tests covering each `switch` branch (user / assistant / thought
/ tool / shell stdout+stderr / permission unresolved+resolved / status /
debug / error) plus the unknown-kind degradation path. Verified
`assertNever` returns a graceful error line (does NOT throw) — wenshao's
reviewer was slightly wrong on the throw claim but coverage gap was
real.
### selectTranscriptBlocksOrderedByEventId no memoization (wenshao)
Selector was called from React `useSyncExternalStore` and re-sorted on
every dispatch — including sidechannel-only events that don't touch
blocks. Added WeakMap cache keyed on `state.blocks` reference; the
reducer preserves the same array reference for non-block-mutating
events, so the cache hits across renders.
### selectSubagentChildBlocks O(n) per call (wenshao)
Naive `state.blocks.filter()` was O(n) per call; rendering a tree with
m parents made it O(n*m). Built a memoized reverse index keyed on
`state.blocks` reference (WeakMap of parentToolCallId →
DaemonToolTranscriptBlock[]). Each lookup now O(1) after first call.
### Test file TS errors at root tsc (wenshao)
Fixed multiple TS errors in `daemonUi.test.ts` flagged by root
`tsc --noEmit`:
- Added `DaemonTranscriptState` + `DaemonUiEvent` imports
- `block.content` access via `as Array<Record<string, unknown>>` cast
- `delete` on globalThis property via narrower interface cast
- `debug?.text` via `DaemonUiEvent & { text: string }` narrowing (Extract on
union with `'status' | 'debug'` literal would resolve to never)
- 6 occurrences of index-signature access via bracket notation
- `raw: null` added to 3 `DaemonUiPermissionOption` literals (required field)
- Explicit type annotations on conformance-suite `renderToText` params
Note: `webui/src/daemon/transcriptAdapter.test.ts` shows residual
"clientReceivedAt does not exist" errors at root tsc, but this is
environmental — the resolution trace shows `@qwen-code/sdk/daemon`
crossing into a sibling worktree's stale dist via shared workspace
node_modules. In a single-worktree CI checkout this resolves cleanly.
## Suggestions (cleanups)
### Hoist asDaemonErrorKind double-eval (doudouOUC)
`session_died` + `stream_error` cases each computed `asDaemonErrorKind`
twice in the conditional spread (predicate + value). Hoisted to const,
no functional change.
### renderToolHeader bypassed opts (doudouOUC)
Forwarded `opts` so `maxFieldLength` is honored for tool title /
toolName / toolKind.
### isSensitiveKey duplicates (doudouOUC)
Removed duplicate `endsWith('accesskey')` / `endsWith('secretkey')`
checks and the redundant exact-match `privatekey` (already covered by
`endsWith`).
### propagateCancellationToInFlightTools iterated trimmed (wenshao)
Filter `TRIMMED_TOOL_BLOCK_ID` sentinels up front. Avoids redundant
index dereferences in long sessions with many historical tools.
### toolProgress shallow clone (doudouOUC + wenshao)
`cloneTranscriptState` outer `...state` spread shared inner
`{ ratio?, step? }` references between snapshots. Once `tool.progress`
event handlers start mutating in place, the prior snapshot would leak.
Deep-clone the inner records now (cost bounded by in-flight tools,
small).
### isDeviceFlowErrorKind closed set (wenshao + doudouOUC)
Both reviewers suggested strict validation. We INTENTIONALLY kept
lenient pass-through — the public type
`DaemonAuthDeviceFlowSdkErrorKind` explicitly includes `(string & {})`
as a forward-compat escape hatch (existing test `keeps future
auth_device_flow_failed errorKind values observable` enforces this).
Now expose `KNOWN_DEVICE_FLOW_ERROR_KINDS` as documentation and
explain the design in the JSDoc.
## Validation
| | |
|---|---|
| SDK tests | 148/148 pass (+12 terminal coverage + assorted hardening) |
| SDK typecheck | clean |
| WebUI typecheck | clean |
## Side-effect verification
- WeakMap memos invalidate correctly: reducer creates a fresh
`state.blocks` reference only on block-mutating events. Sidechannel
events reuse t…
Status: ✅ Ready for review — F2 complete, 22 commits / +7966 / −97 across 34 files
This is the F2 shared MCP transport pool (#4175 Mode B Wave 5). Delivered as a single feature-cohesive PR with 6 atomic feature commits + 16 review fold-in commits per the maintainer's branching strategy on #4175.
git log -p origin/daemon_mode_b_main..HEADfor commit-by-commit walk.What's in this PR
refactor(core): split McpClient.discover into pure tool/prompt listfeat(core): McpTransportPool + SessionMcpView— pool core, fingerprint key, refcount, drain state machine, snapshot replay, generation guardfeat(core): cross-platform pid sweep— Linux/macOSpgrep -P+ Windows PowerShell CIMfeat(serve): wire McpTransportPool into QwenAgent daemon mode— Config / ToolRegistry / McpClientManager threading;acpAgent.QwenAgent.mcpPoolctor; pool lifecycle hooked to SIGTERM / IDE-closefeat(serve): pool-aware status + restart routes—entryCount/entrySummaryonGET /workspace/mcp;?entryIndex=on restart route; capability tagsmcp_workspace_pool+mcp_pool_restartfeat(serve): graduate MCP budget guardrails to workspace scope— newWorkspaceMcpBudgetclass;scope: 'workspace'cell; SDKisWorkspaceScopedBudgetEventhelperPlus 16 review fold-in commits (
fix(core): wenshao Wxx review fold-ins ...) and 1 merge commit reconciling F3 (#4335) ondaemon_mode_b_main.Design doc
📄
docs/design/f2-mcp-transport-pool.mdv2.2 — single source of truth + complete review-fold-in history.Key invariants:
(name + fingerprint). 4 sessions ×--mcp-client-budget=2caps workspace at 2, not 8.includeTools/excludeTools/trustexcluded; OAuth normalized viacanonicalOAuth.pooledTransports = {stdio, websocket}; HTTP/SSE go throughcreateUnpooledConnectionunless operator opts in viaQWEN_SERVE_MCP_POOL_TRANSPORTS.spawnInFlightdedupe (§6.2): 5 concurrent acquires for same key → 1 spawn (test verified).restart()(W4 / W31 fixes).sessionToEntriesreverse index (V21-2):releaseSessionis O(refs) not O(entries).acquirereleases reserved slot viahasNameSibling(W21 false-positive fix usesparseConnectionId, notstartsWith).applyTools/applyPromptsso late attachers don't see empty state — gated byskipReplayfor unpooled bypass.tool.withTrust(this.cfg.trust)clone; allocation pin when trust unchanged.pgrep -Precursive, Windows PowerShell CIM with$pvariable binding (W5 defense-in-depth); bounded (2s timeout, 256 max, 8 depth); graceful degradation.scope: 'workspace'events fan out to all attached sessions viabroadcastBudgetEvent.runWithTimeoutwraps connect+discover onspawnEntry/doRestart/createUnpooledConnection; stdio 30s default, remote 5s, per-serverdiscoveryTimeoutMsoverride clamped to [100ms, 300s].Review history — 9 batches, 32+ fold-ins
ae0b296c43e68c00bc72399f1090e58a098f4a3c5cd903fb453220ee3e60af3c0ffb65002569717a4d36aa8117All review threads resolved. Critical bugs caught + fixed mid-PR (a representative sample):
statusChangeListenercorruption (multi-fingerprint name collisions).doRestartskipped descendant pid sweep — orphan grandchildren on every restart.doRestartdrain-timer + maxIdleTimer race — orphaned new subprocess on shutdown overlap.PooledConnectionafter entry transitioned to'failed'.discoverAllMcpToolsViaPoolnot re-entrant — concurrent passes leaked one connection.hasNameSiblingstartsWithprefix collision — server names with::permanently leaked the budget slot.parseIntvsNumberdivergence between pool's and manager's env-var parsing — same env value produced 100× enforcement difference.runWithTimeoutIIFE re-inserted deleted entry on late-settle → zombie entry.doRestartreturned without sweeping the new transport spawn — orphan on supersede.drainAllreturned live array reference + leaked timer + retroactively negativeforced.applyPromptsignored per-sessionexcludeToolsfilter — restricted sessions still received every prompt.mcpDiscoveryState— preflight reportednot_startedfor completed pool discovery.drainAllPromise.allSettled([...spawnInFlight])was unbounded — daemon shutdown could block for full discovery timeout.'failed'listener pinned manager refs in closure — memory leak in cleanup path.entries, missedspawnInFlight).SDK additive contract
All new fields are additive. Pre-F2 SDK clients ignore unknown fields per PR 14 contract.
New capability tags (advertised conditionally, default-on unless
QWEN_SERVE_NO_MCP_POOL=1):mcp_workspace_pool—entryCount/entrySummaryfields present on per-server cells.mcp_pool_restart—?entryIndex=accepted on restart route; multi-entry response uses{entries: RestartResult[]}shape.EVENT_SCHEMA_VERSIONstays at1(additive).Verification
npx tsc --noEmit)daemon_mode_b_mainresolved (single conflict inserver.test.ts:EXPECTED_REGISTERED_FEATURES— combined F3permission_mediation+ F2mcp_workspace_pool/mcp_pool_restartordering)F2 follow-ups (filed for post-merge cleanup PR)
McpClientManagerctor 7-positional sentinel pattern → options-object refactorpgrep -P <pid>per-PID-per-level cost → singleps -eo pid,ppidtree buildermaxReconnectAttempts/reconnectStrategyhealth-monitor wire-up (forward-compat placeholders today)acquire→ shared tailpassesSessionFilterO(M×N) → pre-computed SetreleaseSessionwiring on per-session close (needs F4 ACP lifecycle hook, no per-session close notification today)onEntryEventAPI alongside its first F4 consumerdiscoveryTimeoutForsource-of-truth across pool + manager (refactor risk)Refs
'workspace'daemon_mode_b_mainwhile F2 was in reviewBranching
Targets
daemon_mode_b_mainper #4175 strategy. Will be rolled into main as a feature-cohesive batch alongside other Wave 5 / F-series work.🤖 Generated with Qwen Code