feat(core): add signal.reason convention for ShellExecutionService (#3831 PR-1 of 3)#3842
Conversation
…ecute() Foundation for #3831 Phase D (b) — Ctrl+B promote of a running foreground shell to background. Defines a discriminated `ShellAbortReason` union that the AbortSignal carries; default behavior (no reason / `{ kind: 'cancel' }`) keeps the existing tree-kill on abort. `{ kind: 'background' }` is a takeover signal — execute() skips the kill, drops the child from its active set (so cleanup() won't kill it later), flushes a snapshot of captured output, and resolves the result Promise immediately with `promoted: true` so the awaiting caller unblocks. Pure plumbing: no caller sets the reason yet, so this is a zero-behavior change for existing call sites. The `promoted?: boolean` field is optional on ShellExecutionResult so existing consumers compile against the new shape without source changes. Tests pin both branches in both childProcessFallback and executeWithPty: default abort still SIGTERM-tree-kills; `{ kind: 'cancel' }` is identical to default (pin against accidental routing through the background branch); `{ kind: 'background' }` skips the kill, snapshot output is preserved, mockProcessKill / mockPtyProcess.kill are NOT called. Part of #3831 (Phase D part b — Ctrl+B promote running shell to background). PR-1 of 3.
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
wenshao
left a comment
There was a problem hiding this comment.
wenshao
left a comment
There was a problem hiding this comment.
…view) Addresses 4 Critical + 2 Suggestion findings on PR-1 of #3831: - **childProcess listener detach** (review line 555 + 573): Anonymous arrow listeners on stdout/stderr/error/exit could not be off()'d. After background-promote, post-promote bytes would re-enter handleOutput, which then calls decoder.decode() on a now-finalized text decoder (cleanup() already called .decode() without stream:true) → TypeError crash. Even without the crash, old onOutputEvent would fire for new data → ownership contract violation + duplication. Fix: extract named handler refs (stdoutHandler / stderrHandler / errorHandler / exitHandler) and call off() on all four in the background-promote branch via a detachServiceListeners() helper. - **PTY listener detach** (review line 967 + 990): node-pty's onData / onExit return IDisposable handles; the abort handler now captures dataDisposable / exitDisposable and calls .dispose() in the background-promote branch. ptyProcess.on('error') is EventEmitter-style (not IDisposable) — extract a named ptyErrorHandler ref and off() it. Without these, post-promote PTY error throws → Node.js crash; post-promote data continues writing to headlessTerminal and calling old onOutputEvent → ownership violation. - **PTY in-flight chain item ownership** (related to review line 990): processingChain may have already-enqueued callbacks past the early listenersDetached check. Refactored from "early-return short-circuit" to "guard each onOutputEvent emit individually" so in-flight writes still LAND in headlessTerminal (snapshot reflects them) but no events leak to the foreground onOutputEvent. Also clear renderTimeout in the abort handler so a pending throttled render doesn't fire post-promote. - **PTY snapshot freshness** (review line 972, suggestion): The original abort handler called serializeTerminalToText immediately. Now we await Promise.race([processingChain drain, SIGKILL_TIMEOUT_MS]) first (mirrors the onExit finalize pattern at ~line 970) so in-flight headlessTerminal.write callbacks land before serialization. Skipped render(true) intentionally because it would emit final onOutputEvent data (renderFn calls onOutputEvent), violating the "no emit post-promote" invariant — added a comment explaining why direct serialize is correct. - **Handoff-boundary tests** (review line 1257, suggestion): Added 4 new tests pinning the ownership contract — 2 for child_process (post-promote stdout/stderr does NOT route to onOutputEvent; child exit does NOT re-resolve result), 2 for PTY (data/exit disposables ARE called; result shape stays promoted: true even if post-promote events fire). Also: test setup now stubs mockPtyProcess.onData / .onExit to return { dispose: vi.fn() } so the background-promote path's dispose() calls don't crash on undefined (the stub's mock.results[0].value is then inspected by the new handoff tests). 58 / 58 tests pass (50 baseline + 4 first-pass + 4 handoff). Total +235 / -35 on top of the prior commit.
wenshao
left a comment
There was a problem hiding this comment.
No review findings. Downgraded from Approve to Comment: self-PR reviews cannot be approved by the author; CI has failing check: Test (macos-latest, 24.x). — gpt-5.5 via Qwen Code /review
wenshao
left a comment
There was a problem hiding this comment.
[Suggestion] Both promote paths (child_process and PTY) have zero logging. If promote fails silently — drain timeout, snapshot empty, serializeTerminalToText throws — there is no indicator in logs/stdout to show promote happened, which path was taken, or why the snapshot is missing. This makes debugging at 3am nearly impossible. Consider adding at minimum a console.warn with pid/shellId and the method path on promote entry.
— deepseek-v4-pro via Qwen Code /review
wenshao
left a comment
There was a problem hiding this comment.
Review found 11 Suggestions and 2 Nice-to-haves — no Critical issues. Core logic correct, previous listener-retention feedback addressed.
Additional finding (could not map to diff line): In childProcessFallback, abortSignal.addEventListener('abort', abortHandler) (~L622) runs before child.on('exit', exitHandler) (~L628). If the signal is pre-aborted, detachServiceListeners() tries child.off('exit', exitHandler) before the handler is attached — then L628 attaches it anyway, leaking through the promote boundary. Reorder: register the exit handler before the abort listener.
…romote (resolve 2nd review pass) Addresses 6 follow-up [Suggestion] threads on PR-1 of #3831 — all substantive code-quality issues raised by the second-pass review of the dispose-based detach commit (8e8e18c): - **Exhaustive switch on `ShellAbortReason.kind`** (both abort handlers). Earlier `if (reason?.kind === 'background')` form silently fell through to kill for any unrecognized variant — a future `{ kind: 'suspend' }` would have killed the process with zero compile-time signal. Switched to `switch (kind)` with a `never`-typed default that runs `debugLogger.warn` and falls back to the safest behavior (cancel/kill). Each branch is now extracted into a named helper (`performBackgroundPromote` / `performCancelKill`) so the switch body stays a single screenful. - **Each `dispose()` wrapped in its own try/catch** (PTY). node-pty's `IDisposable` contract doesn't guarantee no-throw. Without per-dispose try/catch a single throwing dispose() would skip subsequent cleanup (the other dispose, off('error'), activePtys.delete, drain, resolve) and the caller would hang forever on `await result`. Each call now logs via debugLogger.warn on failure but continues. - **`.catch(() => undefined)` on the processingChain side of the drain race** (PTY). `Promise.race([processingChain.then(drain).then(drain), timeout])` would propagate a chain rejection out of the race; since `addEventListener` doesn't await our handler, the rejection became unhandled and `resolve()` was never called → caller hung. Now the rejection is swallowed; the timeout side still terminates the race on time. - **Drain-timeout truncation now emits a diagnostic warning** (PTY). Previously the 200ms drain timeout could fire, the snapshot would be taken with the buffer in mid-write state, and the result.output would be silently truncated. Race result is now observed via a symbol sentinel; when the timeout side wins, debugLogger.warn fires pointing the user at rawOutput as the un-truncated fallback. - **Snapshot serialize failure logs instead of swallowing silently** (PTY). Empty `catch {}` made result.output indistinguishable from "command produced no output" if serializeTerminalToText threw. Now `debugLogger.warn` with the error message leaves a trail for support bundles. - **Dedicated `PROMOTE_DRAIN_TIMEOUT_MS` constant** separated from `SIGKILL_TIMEOUT_MS`. Both are 200ms today, but they have unrelated reasons-to-change (kill escalation timing vs. promote drain ceiling) — sharing the constant means tuning one would silently change the other. Also adds a module-level `debugLogger = createDebugLogger('SHELL_EXECUTION')` since the service had no logging surface before this commit. 58 / 58 tests pass; tsc clean; ESLint clean. No new tests added: the new behaviors (timeout sentinel firing, dispose throw, exhaustive switch default) are defensive log-only paths; existing handoff tests already cover the happy path. Adding mock-throw tests is reasonable follow-up but not blocking.
Code ReviewOverviewPR-1 of 3 implementing Phase D(b) of #3831. Pure plumbing for a future Ctrl+B "promote foreground shell to background" feature. It introduces a Both Code Quality & StyleStrengths
Specific SuggestionsMinor: phantom
|
…t-reason read Resolves the third review pass on PR-1 of #3831 — 1 real bug + 2 defensive hardenings: - **Real bug: `ptyProcess.off('error', ...)` throws TypeError at runtime** (line ~1074). `@lydell/node-pty`'s `IPty` interface exposes the legacy Node EventEmitter `removeListener`, not the modern `off` alias. Previous form threw, the surrounding try/catch swallowed it (post-prior-pass dispose hardening), but the old `ptyErrorHandler` stayed registered — so a post-promote PTY error would still hit our foreground handler and `throw err`, breaking the handoff contract that PR-1's whole listener-detach work is supposed to enforce. Switched to `removeListener`. The catch + warn stays as defense-in-depth; the message wording is updated. - **Prototype-pollution-safe `kind` read** (extracted to module-level helper `getShellAbortReasonKind`). The previous `reason?.kind` walked the prototype chain — a polluted `Object.prototype.kind = 'background'` would silently route `abortController.abort({})` (any plain object reason) into the promote branch and skip the kill. Lifecycle/safety branch deserves the extra check. Helper now: rejects non-object reasons; reads `kind` only as an OWN property (`hasOwnProperty`); whitelists against `'background' | 'cancel'`; defaults to `'cancel'` (the safe historical behavior) for everything else. Both abort handlers (childProcess + PTY) now share this helper. - **`streamStdout: true` + background-promote = silent empty snapshot** (childProcess `performBackgroundPromote`). The promote snapshot reads from the `stdout` / `stderr` string accumulators; but in `streamStdout` mode `handleOutput` forwards bytes through `onOutputEvent` and skips the accumulators entirely. Today PR-1's only call site (foreground shell.ts) uses `streamStdout: false`, so the combination is unreachable — but if a future caller pairs the two, `result.output` would be empty with no diagnostic. Added a `debugLogger.warn` when the combination occurs, pointing the caller at `rawOutput` as the fallback. Cheaper than building a parallel accumulator just for this latent case. 58 / 58 tests pass; tsc clean; ESLint clean.
There was a problem hiding this comment.
Pull request overview
This PR (PR-1 of 3 for #3831) introduces a signal.reason convention to let ShellExecutionService.execute() distinguish between a normal cancel abort vs a “promote to background” takeover abort. It adds a new discriminated ShellAbortReason union, updates both PTY and child_process fallback abort handlers to skip killing the child on { kind: 'background' }, detach service listeners, drop the child from active tracking, snapshot output, and resolve early with promoted: true. Unit tests are updated to pin both branches and verify post-promotion listener handoff boundaries.
Changes:
- Add
ShellAbortReason(viaAbortSignal.reason) andShellExecutionResult.promoted?: booleanto support background promotion signaling. - Implement reason-aware abort handling for both PTY and
child_processexecution paths (detach listeners, skip kill, snapshot output, resolve early). - Extend
shellExecutionServiceunit tests to cover cancel-vs-background discrimination and post-promotion handoff behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| packages/core/src/services/shellExecutionService.ts | Adds ShellAbortReason, promoted result field, and reason-aware abort handling with listener detachment + snapshot behavior for PTY and child_process. |
| packages/core/src/services/shellExecutionService.test.ts | Adds/updates tests for abort-reason branching and validates listener disposal/off behavior across the promotion boundary. |
Comments suppressed due to low confidence (1)
packages/core/src/services/shellExecutionService.ts:783
- This comment says in-flight callbacks should avoid emitting or writing to the "(now caller-owned) headlessTerminal", but the terminal instance is still owned/managed by ShellExecutionService; only the PTY process ownership is transferred. Consider rewording to avoid implying the terminal object is handed off to the caller.
? [...argsPrefix, commandToExecute].join(' ')
: [...argsPrefix, commandToExecute];
const ptyProcess = ptyInfo.module.spawn(executable, args, {
cwd,
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
wenshao
left a comment
There was a problem hiding this comment.
…ware PTY snapshot (resolve 4th review pass) Resolves 6 threads on PR-1 of #3831 — 1 Critical + 1 real bug + 2 quality + 2 test-coverage: - **[Critical] `getShellAbortReasonKind` throw-safe property read.** Previous form read `reason.kind` after only checking that `kind` is an own property. An own accessor that throws (or a Proxy with a trapping getter) would throw before the helper reached either the cancel kill path or the background promote path. Abort handlers are dispatched async and not awaited by AbortSignal, so a leaked throw here would have left the shell process alive instead of being killed on cancel — quietly. Wrapped the property read in try/catch with a fall-back to the safe 'cancel' kill behavior. - **Real bug: child_process post-exit race in background-promote** (`performBackgroundPromote`). The child may have already exited but the 'exit' event hasn't reached our handler yet (Node delivers events on the next microtask). Promoting in that window would detach our exit listener and report `promoted: true` for a process that's already dead — the caller would hold an inert pid expecting to take over. Now we read `child.exitCode` / `child.signalCode` before detaching: if either is non-null, fall through and let the pending exit handler resolve normally with the real exit info. Mirrored mock setup so `exitCode` / `signalCode` default to `null` (matching real ChildProcess) instead of `undefined`. - **PTY snapshot: re-decode + replay (mirror exit-path encoding).** The promoted snapshot was serializing `headlessTerminal` directly, which was fed by a streaming decoder initialized from the first-chunk encoding heuristic. When early output is ASCII-only but later output is in a different encoding (GBK / Shift-JIS / etc.), this produces mojibake — and the normal exit path doesn't, because it re-decodes `finalBuffer` with `getCachedEncodingForBuffer` and replays through a fresh terminal. Now mirrors that logic so `result.output` shape matches across the two paths. Direct-serialize remains as a last-ditch fallback if replay throws. - **Switch `default` no longer emits a runtime warn.** Reviewer noted the helper's whitelist made the `default: { _exhaustive: never }` branch unreachable at runtime — the `debugLogger.warn` in it could never fire. Kept the `_: never = kind` type assertion (so a future ShellAbortReason variant forces a TS error here, directing the developer to extend BOTH the helper's whitelist AND add a `case`), removed the unreachable warn. Added a comment that the assertion is the static-only safety net the union expansion would trigger. - **Direct unit tests for `getShellAbortReasonKind`** (8 cases). The helper's prototype-pollution defense is the main reason it exists; if `hasOwnProperty` is accidentally removed the regression would silently send `abortController.abort({})` (any plain reason) into the promote path. Exported the helper and added direct tests for: null / undefined, non-object, empty object (no own kind), prototype- only kind (pollution), unknown kind value, throwing accessor, Proxy trap, and the two happy paths. - **`removeListener` regression guard.** The fix to call `ptyProcess.removeListener('error', ...)` instead of `.off(...)` matters because `@lydell/node-pty`'s IPty interface only exposes `removeListener` — `.off()` throws TypeError on a real PTY but the EventEmitter mock tolerates both. Added a test that spies on both methods and asserts the production code uses `removeListener` for the 'error' event, so a future swap back to `.off()` regresses loudly under the mock instead of silently. 68 / 68 tests pass (58 baseline + 9 helper boundary + 1 removeListener guard + 1 post-exit race); tsc clean; ESLint clean.
…eview pass) Mirrors the child_process post-exit race fix from 4cc558b into the PTY path — addresses 1 [Critical] thread on PR-1 of #3831: The PTY may have already exited but our `exitDisposable` (onExit callback) hasn't run yet — node-pty delivers the exit event asynchronously after the PTY's native SIGCHLD, so there's a window between "PTY actually dead" and "service onExit fires". Promoting in that window detaches our exit listener and reports `promoted: true` for a dead PTY, losing the real exit status; the caller would hold an inert pid expecting to take over. The IPty interface doesn't expose an `exitCode` field we can read directly (unlike `child.exitCode` / `child.signalCode` for child_process), so use `process.kill(pid, 0)` as a best-effort liveness check via the existing `ShellExecutionService.isPtyActive` helper. If kill(pid, 0) throws ESRCH, the pid is gone — log at debug level and fall through, letting the pending onExit callback resolve normally with the real exit info. Also adds a unit test mirroring the child_process race test: mocks `process.kill(pid, 0)` to throw ESRCH on the liveness probe, asserts the result has no `promoted: true` and reports the real exitCode. 69 / 69 tests pass; tsc clean; ESLint clean.
Doc said 'all six edge cases' but the test suite has 8 cases (added Proxy-trap and undefined later). Off-by-2 cosmetic only — no behavior change. Caught during a multi-round self-audit of PR-1 of #3831. Audit summary: 7 rounds (correctness / reverse / consistency / coverage / build / exception paths / style) found one false-positive (a sync- abort registration-order race I initially thought existed). Verified that Node's WHATWG AbortSignal does NOT auto-fire 'abort' listeners on already-aborted signals, so the race window cannot open. No code change needed for that scenario; this commit is just the JSDoc fix. 69 / 69 tests still pass; tsc + ESLint clean.
…citly Multi-round self-audit found that `getShellAbortReasonKind`'s value whitelist has no compile-time tie to the `ShellAbortReason` union: when the union grows, TypeScript's `_exhaustive: never` in each switch forces #3 (the case arm) to be added, but the helper's whitelist (#2) silently keeps degrading the new variant to 'cancel', and the new case arm is never reached at runtime. Reviewer #4 raised this on the second pass; the original commit chose to accept it (option B in that thread) but didn't leave a strong in-code signal for future contributors. Added an INVARIANT block inside the helper enumerating the three sites that must be kept in sync, so the next person extending `ShellAbortReason` sees the coupling at the place where they're most likely to forget it. No behavior change — comment-only. 69 / 69 tests still pass; tsc + ESLint clean. Audit summary (this round + prior round): 18 angles total over two sweeps and one reverse-attack pass. Found: - 0 real bugs - 1 false-positive race (sync-abort registration order — Node WHATWG AbortSignal does NOT auto-fire on already-aborted signals; investigated, reverted) - 1 cosmetic doc fix (boundary-test count off-by-2) - 1 cosmetic INVARIANT block (this commit) Areas reviewed without finding new issues: caller-side ShellExecutionResult shape compatibility (optional `promoted?` field, existing callers spread-untouched); `exited` flag lifecycle (monotonic, cleanup() idempotent); processingChain in-flight ownership (listenersDetached guards every onOutputEvent emit including the renderFn-rendered case via the same flag); race between exit event and abort handler (both microtasks, FIFO ordering gives correct outcome either way); Node version dependence (`AbortSignal.reason` is Node 17.2+, engines: >=20 covers it); test isolation (mockImplementationOnce + module-level mockProcessKill clears each beforeEach); `process.kill(pid, 0)` Windows liveness reliability (best-effort, acceptable for PR-1 plumbing); PID reuse race on the PTY liveness check (theoretically possible, microsecond window, unavoidable at the OS level — rejected in spec discussion); PR-2/PR-3 contract surface (caller MUST attach listeners before abort — documented; any future caller violating this is its own bug).
…reEach The 'execution method selection' describe block has its own beforeEach (separate from 'child_process fallback') that builds mockChildProcess but does not set `exitCode` / `signalCode = null`. Real Node `ChildProcess.exitCode` / `signalCode` are `null` while the process is alive — and production now reads these in the background-promote race guard. The current tests in this block don't exercise the promote path, so they pass regardless, but any future promote-related test landing here would silently trip the guard (`undefined !== null` is true) and fall through to the normal-exit branch instead of promoting. Mirror the `child_process fallback` block's mock setup so the two beforeEach hooks produce equivalent ChildProcess shapes, eliminating a quiet foot-gun for future contributors. Comment-only / test-fixture change. 69 / 69 tests still pass; tsc clean. Found during a deeper third-round self-audit of PR-1 of #3831.
tanzhenxin
left a comment
There was a problem hiding this comment.
Review
Approving — clean plumbing after the iteration rounds, no behavior change for existing callers. A few notes to consider, none blocking.
Notes
1. aborted + promoted shape. Both promote branches resolve with aborted: true, promoted: true. The existing tools/shell.ts consumer checks aborted first and emits "cancelled" copy. PR-2 callers will need to branch on promoted before aborted. Might be cleaner to set aborted: false when promoted: true so existing branches just work.
2. getShellAbortReasonKind defensive read. The hasOwnProperty.call on the reason runs outside the try. A Proxy with a throwing getOwnPropertyDescriptor trap would propagate past the helper. Probably worth moving the check inside the try.
3. PTY handoff-boundary test. Asserts dispose was called, but doesn't emit data after the abort to verify the foreground handler actually stops firing — the child_process equivalent does. Worth mirroring when convenient.
APPROVE.
…wenLM#3831 PR-1 of 3) (QwenLM#3842) * feat(core): add signal.reason convention for ShellExecutionService.execute() Foundation for QwenLM#3831 Phase D (b) — Ctrl+B promote of a running foreground shell to background. Defines a discriminated `ShellAbortReason` union that the AbortSignal carries; default behavior (no reason / `{ kind: 'cancel' }`) keeps the existing tree-kill on abort. `{ kind: 'background' }` is a takeover signal — execute() skips the kill, drops the child from its active set (so cleanup() won't kill it later), flushes a snapshot of captured output, and resolves the result Promise immediately with `promoted: true` so the awaiting caller unblocks. Pure plumbing: no caller sets the reason yet, so this is a zero-behavior change for existing call sites. The `promoted?: boolean` field is optional on ShellExecutionResult so existing consumers compile against the new shape without source changes. Tests pin both branches in both childProcessFallback and executeWithPty: default abort still SIGTERM-tree-kills; `{ kind: 'cancel' }` is identical to default (pin against accidental routing through the background branch); `{ kind: 'background' }` skips the kill, snapshot output is preserved, mockProcessKill / mockPtyProcess.kill are NOT called. Part of QwenLM#3831 (Phase D part b — Ctrl+B promote running shell to background). PR-1 of 3. * fix(core): detach service listeners on background-promote (resolve review) Addresses 4 Critical + 2 Suggestion findings on PR-1 of QwenLM#3831: - **childProcess listener detach** (review line 555 + 573): Anonymous arrow listeners on stdout/stderr/error/exit could not be off()'d. After background-promote, post-promote bytes would re-enter handleOutput, which then calls decoder.decode() on a now-finalized text decoder (cleanup() already called .decode() without stream:true) → TypeError crash. Even without the crash, old onOutputEvent would fire for new data → ownership contract violation + duplication. Fix: extract named handler refs (stdoutHandler / stderrHandler / errorHandler / exitHandler) and call off() on all four in the background-promote branch via a detachServiceListeners() helper. - **PTY listener detach** (review line 967 + 990): node-pty's onData / onExit return IDisposable handles; the abort handler now captures dataDisposable / exitDisposable and calls .dispose() in the background-promote branch. ptyProcess.on('error') is EventEmitter-style (not IDisposable) — extract a named ptyErrorHandler ref and off() it. Without these, post-promote PTY error throws → Node.js crash; post-promote data continues writing to headlessTerminal and calling old onOutputEvent → ownership violation. - **PTY in-flight chain item ownership** (related to review line 990): processingChain may have already-enqueued callbacks past the early listenersDetached check. Refactored from "early-return short-circuit" to "guard each onOutputEvent emit individually" so in-flight writes still LAND in headlessTerminal (snapshot reflects them) but no events leak to the foreground onOutputEvent. Also clear renderTimeout in the abort handler so a pending throttled render doesn't fire post-promote. - **PTY snapshot freshness** (review line 972, suggestion): The original abort handler called serializeTerminalToText immediately. Now we await Promise.race([processingChain drain, SIGKILL_TIMEOUT_MS]) first (mirrors the onExit finalize pattern at ~line 970) so in-flight headlessTerminal.write callbacks land before serialization. Skipped render(true) intentionally because it would emit final onOutputEvent data (renderFn calls onOutputEvent), violating the "no emit post-promote" invariant — added a comment explaining why direct serialize is correct. - **Handoff-boundary tests** (review line 1257, suggestion): Added 4 new tests pinning the ownership contract — 2 for child_process (post-promote stdout/stderr does NOT route to onOutputEvent; child exit does NOT re-resolve result), 2 for PTY (data/exit disposables ARE called; result shape stays promoted: true even if post-promote events fire). Also: test setup now stubs mockPtyProcess.onData / .onExit to return { dispose: vi.fn() } so the background-promote path's dispose() calls don't crash on undefined (the stub's mock.results[0].value is then inspected by the new handoff tests). 58 / 58 tests pass (50 baseline + 4 first-pass + 4 handoff). Total +235 / -35 on top of the prior commit. * fix(core): defensive hardening for ShellExecutionService background-promote (resolve 2nd review pass) Addresses 6 follow-up [Suggestion] threads on PR-1 of QwenLM#3831 — all substantive code-quality issues raised by the second-pass review of the dispose-based detach commit (8e8e18c): - **Exhaustive switch on `ShellAbortReason.kind`** (both abort handlers). Earlier `if (reason?.kind === 'background')` form silently fell through to kill for any unrecognized variant — a future `{ kind: 'suspend' }` would have killed the process with zero compile-time signal. Switched to `switch (kind)` with a `never`-typed default that runs `debugLogger.warn` and falls back to the safest behavior (cancel/kill). Each branch is now extracted into a named helper (`performBackgroundPromote` / `performCancelKill`) so the switch body stays a single screenful. - **Each `dispose()` wrapped in its own try/catch** (PTY). node-pty's `IDisposable` contract doesn't guarantee no-throw. Without per-dispose try/catch a single throwing dispose() would skip subsequent cleanup (the other dispose, off('error'), activePtys.delete, drain, resolve) and the caller would hang forever on `await result`. Each call now logs via debugLogger.warn on failure but continues. - **`.catch(() => undefined)` on the processingChain side of the drain race** (PTY). `Promise.race([processingChain.then(drain).then(drain), timeout])` would propagate a chain rejection out of the race; since `addEventListener` doesn't await our handler, the rejection became unhandled and `resolve()` was never called → caller hung. Now the rejection is swallowed; the timeout side still terminates the race on time. - **Drain-timeout truncation now emits a diagnostic warning** (PTY). Previously the 200ms drain timeout could fire, the snapshot would be taken with the buffer in mid-write state, and the result.output would be silently truncated. Race result is now observed via a symbol sentinel; when the timeout side wins, debugLogger.warn fires pointing the user at rawOutput as the un-truncated fallback. - **Snapshot serialize failure logs instead of swallowing silently** (PTY). Empty `catch {}` made result.output indistinguishable from "command produced no output" if serializeTerminalToText threw. Now `debugLogger.warn` with the error message leaves a trail for support bundles. - **Dedicated `PROMOTE_DRAIN_TIMEOUT_MS` constant** separated from `SIGKILL_TIMEOUT_MS`. Both are 200ms today, but they have unrelated reasons-to-change (kill escalation timing vs. promote drain ceiling) — sharing the constant means tuning one would silently change the other. Also adds a module-level `debugLogger = createDebugLogger('SHELL_EXECUTION')` since the service had no logging surface before this commit. 58 / 58 tests pass; tsc clean; ESLint clean. No new tests added: the new behaviors (timeout sentinel firing, dispose throw, exhaustive switch default) are defensive log-only paths; existing handoff tests already cover the happy path. Adding mock-throw tests is reasonable follow-up but not blocking. * fix(core): real bug — ptyProcess.off → removeListener; defensive abort-reason read Resolves the third review pass on PR-1 of QwenLM#3831 — 1 real bug + 2 defensive hardenings: - **Real bug: `ptyProcess.off('error', ...)` throws TypeError at runtime** (line ~1074). `@lydell/node-pty`'s `IPty` interface exposes the legacy Node EventEmitter `removeListener`, not the modern `off` alias. Previous form threw, the surrounding try/catch swallowed it (post-prior-pass dispose hardening), but the old `ptyErrorHandler` stayed registered — so a post-promote PTY error would still hit our foreground handler and `throw err`, breaking the handoff contract that PR-1's whole listener-detach work is supposed to enforce. Switched to `removeListener`. The catch + warn stays as defense-in-depth; the message wording is updated. - **Prototype-pollution-safe `kind` read** (extracted to module-level helper `getShellAbortReasonKind`). The previous `reason?.kind` walked the prototype chain — a polluted `Object.prototype.kind = 'background'` would silently route `abortController.abort({})` (any plain object reason) into the promote branch and skip the kill. Lifecycle/safety branch deserves the extra check. Helper now: rejects non-object reasons; reads `kind` only as an OWN property (`hasOwnProperty`); whitelists against `'background' | 'cancel'`; defaults to `'cancel'` (the safe historical behavior) for everything else. Both abort handlers (childProcess + PTY) now share this helper. - **`streamStdout: true` + background-promote = silent empty snapshot** (childProcess `performBackgroundPromote`). The promote snapshot reads from the `stdout` / `stderr` string accumulators; but in `streamStdout` mode `handleOutput` forwards bytes through `onOutputEvent` and skips the accumulators entirely. Today PR-1's only call site (foreground shell.ts) uses `streamStdout: false`, so the combination is unreachable — but if a future caller pairs the two, `result.output` would be empty with no diagnostic. Added a `debugLogger.warn` when the combination occurs, pointing the caller at `rawOutput` as the fallback. Cheaper than building a parallel accumulator just for this latent case. 58 / 58 tests pass; tsc clean; ESLint clean. * fix(core): liveness check + throw-safe abort-reason read + encoding-aware PTY snapshot (resolve 4th review pass) Resolves 6 threads on PR-1 of QwenLM#3831 — 1 Critical + 1 real bug + 2 quality + 2 test-coverage: - **[Critical] `getShellAbortReasonKind` throw-safe property read.** Previous form read `reason.kind` after only checking that `kind` is an own property. An own accessor that throws (or a Proxy with a trapping getter) would throw before the helper reached either the cancel kill path or the background promote path. Abort handlers are dispatched async and not awaited by AbortSignal, so a leaked throw here would have left the shell process alive instead of being killed on cancel — quietly. Wrapped the property read in try/catch with a fall-back to the safe 'cancel' kill behavior. - **Real bug: child_process post-exit race in background-promote** (`performBackgroundPromote`). The child may have already exited but the 'exit' event hasn't reached our handler yet (Node delivers events on the next microtask). Promoting in that window would detach our exit listener and report `promoted: true` for a process that's already dead — the caller would hold an inert pid expecting to take over. Now we read `child.exitCode` / `child.signalCode` before detaching: if either is non-null, fall through and let the pending exit handler resolve normally with the real exit info. Mirrored mock setup so `exitCode` / `signalCode` default to `null` (matching real ChildProcess) instead of `undefined`. - **PTY snapshot: re-decode + replay (mirror exit-path encoding).** The promoted snapshot was serializing `headlessTerminal` directly, which was fed by a streaming decoder initialized from the first-chunk encoding heuristic. When early output is ASCII-only but later output is in a different encoding (GBK / Shift-JIS / etc.), this produces mojibake — and the normal exit path doesn't, because it re-decodes `finalBuffer` with `getCachedEncodingForBuffer` and replays through a fresh terminal. Now mirrors that logic so `result.output` shape matches across the two paths. Direct-serialize remains as a last-ditch fallback if replay throws. - **Switch `default` no longer emits a runtime warn.** Reviewer noted the helper's whitelist made the `default: { _exhaustive: never }` branch unreachable at runtime — the `debugLogger.warn` in it could never fire. Kept the `_: never = kind` type assertion (so a future ShellAbortReason variant forces a TS error here, directing the developer to extend BOTH the helper's whitelist AND add a `case`), removed the unreachable warn. Added a comment that the assertion is the static-only safety net the union expansion would trigger. - **Direct unit tests for `getShellAbortReasonKind`** (8 cases). The helper's prototype-pollution defense is the main reason it exists; if `hasOwnProperty` is accidentally removed the regression would silently send `abortController.abort({})` (any plain reason) into the promote path. Exported the helper and added direct tests for: null / undefined, non-object, empty object (no own kind), prototype- only kind (pollution), unknown kind value, throwing accessor, Proxy trap, and the two happy paths. - **`removeListener` regression guard.** The fix to call `ptyProcess.removeListener('error', ...)` instead of `.off(...)` matters because `@lydell/node-pty`'s IPty interface only exposes `removeListener` — `.off()` throws TypeError on a real PTY but the EventEmitter mock tolerates both. Added a test that spies on both methods and asserts the production code uses `removeListener` for the 'error' event, so a future swap back to `.off()` regresses loudly under the mock instead of silently. 68 / 68 tests pass (58 baseline + 9 helper boundary + 1 removeListener guard + 1 post-exit race); tsc clean; ESLint clean. * fix(core): PTY background-promote post-exit race guard (resolve 5th review pass) Mirrors the child_process post-exit race fix from 4cc558b into the PTY path — addresses 1 [Critical] thread on PR-1 of QwenLM#3831: The PTY may have already exited but our `exitDisposable` (onExit callback) hasn't run yet — node-pty delivers the exit event asynchronously after the PTY's native SIGCHLD, so there's a window between "PTY actually dead" and "service onExit fires". Promoting in that window detaches our exit listener and reports `promoted: true` for a dead PTY, losing the real exit status; the caller would hold an inert pid expecting to take over. The IPty interface doesn't expose an `exitCode` field we can read directly (unlike `child.exitCode` / `child.signalCode` for child_process), so use `process.kill(pid, 0)` as a best-effort liveness check via the existing `ShellExecutionService.isPtyActive` helper. If kill(pid, 0) throws ESRCH, the pid is gone — log at debug level and fall through, letting the pending onExit callback resolve normally with the real exit info. Also adds a unit test mirroring the child_process race test: mocks `process.kill(pid, 0)` to throw ESRCH on the liveness probe, asserts the result has no `promoted: true` and reports the real exitCode. 69 / 69 tests pass; tsc clean; ESLint clean. * docs(core): correct getShellAbortReasonKind boundary-test count in JSDoc Doc said 'all six edge cases' but the test suite has 8 cases (added Proxy-trap and undefined later). Off-by-2 cosmetic only — no behavior change. Caught during a multi-round self-audit of PR-1 of QwenLM#3831. Audit summary: 7 rounds (correctness / reverse / consistency / coverage / build / exception paths / style) found one false-positive (a sync- abort registration-order race I initially thought existed). Verified that Node's WHATWG AbortSignal does NOT auto-fire 'abort' listeners on already-aborted signals, so the race window cannot open. No code change needed for that scenario; this commit is just the JSDoc fix. 69 / 69 tests still pass; tsc + ESLint clean. * docs(core): document the helper / union / switch sync invariant explicitly Multi-round self-audit found that `getShellAbortReasonKind`'s value whitelist has no compile-time tie to the `ShellAbortReason` union: when the union grows, TypeScript's `_exhaustive: never` in each switch forces #3 (the case arm) to be added, but the helper's whitelist (#2) silently keeps degrading the new variant to 'cancel', and the new case arm is never reached at runtime. Reviewer #4 raised this on the second pass; the original commit chose to accept it (option B in that thread) but didn't leave a strong in-code signal for future contributors. Added an INVARIANT block inside the helper enumerating the three sites that must be kept in sync, so the next person extending `ShellAbortReason` sees the coupling at the place where they're most likely to forget it. No behavior change — comment-only. 69 / 69 tests still pass; tsc + ESLint clean. Audit summary (this round + prior round): 18 angles total over two sweeps and one reverse-attack pass. Found: - 0 real bugs - 1 false-positive race (sync-abort registration order — Node WHATWG AbortSignal does NOT auto-fire on already-aborted signals; investigated, reverted) - 1 cosmetic doc fix (boundary-test count off-by-2) - 1 cosmetic INVARIANT block (this commit) Areas reviewed without finding new issues: caller-side ShellExecutionResult shape compatibility (optional `promoted?` field, existing callers spread-untouched); `exited` flag lifecycle (monotonic, cleanup() idempotent); processingChain in-flight ownership (listenersDetached guards every onOutputEvent emit including the renderFn-rendered case via the same flag); race between exit event and abort handler (both microtasks, FIFO ordering gives correct outcome either way); Node version dependence (`AbortSignal.reason` is Node 17.2+, engines: >=20 covers it); test isolation (mockImplementationOnce + module-level mockProcessKill clears each beforeEach); `process.kill(pid, 0)` Windows liveness reliability (best-effort, acceptable for PR-1 plumbing); PID reuse race on the PTY liveness check (theoretically possible, microsecond window, unavoidable at the OS level — rejected in spec discussion); PR-2/PR-3 contract surface (caller MUST attach listeners before abort — documented; any future caller violating this is its own bug). * test(core): align mockChildProcess.exitCode/signalCode in second beforeEach The 'execution method selection' describe block has its own beforeEach (separate from 'child_process fallback') that builds mockChildProcess but does not set `exitCode` / `signalCode = null`. Real Node `ChildProcess.exitCode` / `signalCode` are `null` while the process is alive — and production now reads these in the background-promote race guard. The current tests in this block don't exercise the promote path, so they pass regardless, but any future promote-related test landing here would silently trip the guard (`undefined !== null` is true) and fall through to the normal-exit branch instead of promoting. Mirror the `child_process fallback` block's mock setup so the two beforeEach hooks produce equivalent ChildProcess shapes, eliminating a quiet foot-gun for future contributors. Comment-only / test-fixture change. 69 / 69 tests still pass; tsc clean. Found during a deeper third-round self-audit of PR-1 of QwenLM#3831.
…-up to #3842) (#3886) * fix(core): wrap hasOwnProperty.call inside try + add post-abort PTY data assertion Two non-blocking review notes from @tanzhenxin's PR-1 approval (#3842, post-merge follow-up): - **Note 2 (real bug)**: `getShellAbortReasonKind` had `Object.prototype.hasOwnProperty.call(reason, 'kind')` outside the try/catch. `hasOwnProperty.call` triggers the `[[GetOwnProperty]]` Proxy trap (`getOwnPropertyDescriptor` handler). A Proxy whose `getOwnPropertyDescriptor` throws — separate from a throwing `get` trap, which the prior commit already covered — would propagate past the helper, leaving the abort handler's switch on `kind` to throw through `addEventListener` (which doesn't await async listener return values), so the shell process would stay alive instead of being killed on cancel. Moved the descriptor probe inside the same try block as the value read. My own multi-round audit covered six attack vectors but missed this one — only `get` trap throws were considered. The reviewer caught it on first pass. - **Note 3 (test parity)**: the PTY post-promotion handoff test asserted `dispose` was called but never re-invoked the data callback to verify the foreground `onOutputEvent` actually stops firing — the child_process equivalent has that assertion. Mirrored it: emit data AFTER abort by re-invoking the captured `dataCallback` reference, and assert `onOutputEventMock.mock.calls.length` does NOT increase past the moment of promote. Exercises the production `listenersDetached` guard inside the chain callback, which the bare dispose-was-called check didn't. Note 1 from the same review (the `aborted: true + promoted: true` shape forcing PR-2 callers to check `promoted` before `aborted`) is deliberately NOT addressed here — it's a contract simplification that affects PR-2's branching, so it belongs in PR-2 along with the caller-side decision on whether to flip `aborted` for promoted results. Added a TODO upstream in #3831 (PR-2 design) to track. 70 / 70 tests pass (69 baseline + 1 new helper boundary for the throwing-getOwnPropertyDescriptor case). tsc + ESLint clean. * test(core): fix tautological PTY post-promote assertion (audit follow-up) The PTY post-promotion handoff test added in the previous commit copied the child_process equivalent's pattern verbatim — sync \`expect(count).toBe(countAtPromote)\` immediately after dataCallback returns. That works for child_process because its \`handleOutput\` is fully synchronous (sniff → decoder → emit, all on the same call stack), so the count change happens BEFORE the assertion. PTY's \`handleOutput\` is async — \`processingChain.then(...)\` queues a microtask that does the sniff + write + render-then-emit work. The sync assertion captures both \`countAtPromote\` and the post-emit count BEFORE the chain microtask ever runs, so both reads return whatever happened before the assert (typically 0). The test would tautologically pass even if the production \`listenersDetached\` guard were removed — i.e., it didn't actually verify the guard. Restructured to: 1. Drive the PTY through \`simulateExecution\` so \`await handle.result\` forces all queued microtasks (including pre-promote chain items AND the abort handler's drain) to settle. 2. Capture \`eventCountAfterSettle\` once everything has stabilized. 3. Re-invoke the captured \`dataCallback\` with post-promote data, await two more macrotask boundaries to let the new chain item fully run. 4. Assert the count hasn't moved. If the production \`listenersDetached\` guard is removed, the post-promote chain item emits, count increases past \`eventCountAfterSettle\`, and this assertion fails. So the test actually exercises the guard now. Found in self-audit while reviewing my own follow-up commit. Caught because audit was paranoid about *whether the test verifies what it claims to verify*, not just whether it passes. 70 / 70 tests pass; tsc + ESLint clean. * test(core): pin eventCountAfterSettle === 0 in PTY post-promote test The previous fix asserted post-promote count equals the \`eventCountAfterSettle\` baseline, but didn't pin the baseline itself. With the production \`listenersDetached\` guard intact, both halves (pre-promote chain and post-promote chain) suppress emit, so \`eventCountAfterSettle === 0 === post-count\` and the relative comparison is vacuously true. If a future refactor changed the production guard semantics so the pre-promote chain item DID emit (count becomes 1+ after settle), the relative-comparison test would still pass as long as post-promote also emitted the same number — that's a regression the test should catch. Adding \`expect(eventCountAfterSettle).toBe(0)\` makes the contract explicit: once \`listenersDetached\` is set during the abort handler's sync part, BOTH the in-flight chain item (pre-promote) and the future chain item (post-promote) skip emit. Found in another self-audit pass — even after fixing the tautological assertion, the test could still mask certain future regressions. 70 / 70 tests pass; tsc + ESLint clean. * test(core): drop dataCallbackHolder pattern, read mock.calls directly The dataCallbackHolder pattern (capturing the onData callback inside simulateExecution then invoking it after via a closure-shared object) was unnecessary indirection — \`mockPtyProcess.onData.mock.calls[0][0]\` reads the same callback reference whether you read it inside or outside the simulation closure. Vi's mock.calls array is per-mock-instance and beforeEach re-creates mockPtyProcess + .onData freshly, so there's no stale-reference risk in the simpler form. No behavior change in what's tested. 70 / 70 pass. * chore(core): drop in-source @-mention attribution + dead Proxy.get handler Two cosmetic cleanups found in another self-audit pass: - **Helper comment**: removed the parenthetical attribution ("Caught by @tanzhenxin in the PR-1 review; my own audit only covered \`get\` trap throws."). The technical content of the comment — explaining why both the descriptor probe and the value read live inside the try — stands on its own. The reviewer credit lives in commit history / PR description, where it belongs; an in-source @-mention ages poorly (handles change, the relevant person may move on) and doesn't help future readers reason about the code. - **Test Proxy**: \`throwingDescriptorProxy\` declared a \`get()\` handler that always returned \`undefined\`. The descriptor probe throws before the helper ever reaches the value read, so the \`get\` handler is unreachable — dropped it and added a one-line comment explaining why no \`get\` handler is needed for this test. Mirror test (\`throwingReason\` with throwing accessor + \`proxyReason\` with throwing \`get\`) keeps the symmetric "throwing-`get`" coverage. 70 / 70 tests pass; tsc + ESLint clean.
…3831 PR-1 of 3) (#3842) * feat(core): add signal.reason convention for ShellExecutionService.execute() Foundation for #3831 Phase D (b) — Ctrl+B promote of a running foreground shell to background. Defines a discriminated `ShellAbortReason` union that the AbortSignal carries; default behavior (no reason / `{ kind: 'cancel' }`) keeps the existing tree-kill on abort. `{ kind: 'background' }` is a takeover signal — execute() skips the kill, drops the child from its active set (so cleanup() won't kill it later), flushes a snapshot of captured output, and resolves the result Promise immediately with `promoted: true` so the awaiting caller unblocks. Pure plumbing: no caller sets the reason yet, so this is a zero-behavior change for existing call sites. The `promoted?: boolean` field is optional on ShellExecutionResult so existing consumers compile against the new shape without source changes. Tests pin both branches in both childProcessFallback and executeWithPty: default abort still SIGTERM-tree-kills; `{ kind: 'cancel' }` is identical to default (pin against accidental routing through the background branch); `{ kind: 'background' }` skips the kill, snapshot output is preserved, mockProcessKill / mockPtyProcess.kill are NOT called. Part of #3831 (Phase D part b — Ctrl+B promote running shell to background). PR-1 of 3. * fix(core): detach service listeners on background-promote (resolve review) Addresses 4 Critical + 2 Suggestion findings on PR-1 of #3831: - **childProcess listener detach** (review line 555 + 573): Anonymous arrow listeners on stdout/stderr/error/exit could not be off()'d. After background-promote, post-promote bytes would re-enter handleOutput, which then calls decoder.decode() on a now-finalized text decoder (cleanup() already called .decode() without stream:true) → TypeError crash. Even without the crash, old onOutputEvent would fire for new data → ownership contract violation + duplication. Fix: extract named handler refs (stdoutHandler / stderrHandler / errorHandler / exitHandler) and call off() on all four in the background-promote branch via a detachServiceListeners() helper. - **PTY listener detach** (review line 967 + 990): node-pty's onData / onExit return IDisposable handles; the abort handler now captures dataDisposable / exitDisposable and calls .dispose() in the background-promote branch. ptyProcess.on('error') is EventEmitter-style (not IDisposable) — extract a named ptyErrorHandler ref and off() it. Without these, post-promote PTY error throws → Node.js crash; post-promote data continues writing to headlessTerminal and calling old onOutputEvent → ownership violation. - **PTY in-flight chain item ownership** (related to review line 990): processingChain may have already-enqueued callbacks past the early listenersDetached check. Refactored from "early-return short-circuit" to "guard each onOutputEvent emit individually" so in-flight writes still LAND in headlessTerminal (snapshot reflects them) but no events leak to the foreground onOutputEvent. Also clear renderTimeout in the abort handler so a pending throttled render doesn't fire post-promote. - **PTY snapshot freshness** (review line 972, suggestion): The original abort handler called serializeTerminalToText immediately. Now we await Promise.race([processingChain drain, SIGKILL_TIMEOUT_MS]) first (mirrors the onExit finalize pattern at ~line 970) so in-flight headlessTerminal.write callbacks land before serialization. Skipped render(true) intentionally because it would emit final onOutputEvent data (renderFn calls onOutputEvent), violating the "no emit post-promote" invariant — added a comment explaining why direct serialize is correct. - **Handoff-boundary tests** (review line 1257, suggestion): Added 4 new tests pinning the ownership contract — 2 for child_process (post-promote stdout/stderr does NOT route to onOutputEvent; child exit does NOT re-resolve result), 2 for PTY (data/exit disposables ARE called; result shape stays promoted: true even if post-promote events fire). Also: test setup now stubs mockPtyProcess.onData / .onExit to return { dispose: vi.fn() } so the background-promote path's dispose() calls don't crash on undefined (the stub's mock.results[0].value is then inspected by the new handoff tests). 58 / 58 tests pass (50 baseline + 4 first-pass + 4 handoff). Total +235 / -35 on top of the prior commit. * fix(core): defensive hardening for ShellExecutionService background-promote (resolve 2nd review pass) Addresses 6 follow-up [Suggestion] threads on PR-1 of #3831 — all substantive code-quality issues raised by the second-pass review of the dispose-based detach commit (8e8e18c): - **Exhaustive switch on `ShellAbortReason.kind`** (both abort handlers). Earlier `if (reason?.kind === 'background')` form silently fell through to kill for any unrecognized variant — a future `{ kind: 'suspend' }` would have killed the process with zero compile-time signal. Switched to `switch (kind)` with a `never`-typed default that runs `debugLogger.warn` and falls back to the safest behavior (cancel/kill). Each branch is now extracted into a named helper (`performBackgroundPromote` / `performCancelKill`) so the switch body stays a single screenful. - **Each `dispose()` wrapped in its own try/catch** (PTY). node-pty's `IDisposable` contract doesn't guarantee no-throw. Without per-dispose try/catch a single throwing dispose() would skip subsequent cleanup (the other dispose, off('error'), activePtys.delete, drain, resolve) and the caller would hang forever on `await result`. Each call now logs via debugLogger.warn on failure but continues. - **`.catch(() => undefined)` on the processingChain side of the drain race** (PTY). `Promise.race([processingChain.then(drain).then(drain), timeout])` would propagate a chain rejection out of the race; since `addEventListener` doesn't await our handler, the rejection became unhandled and `resolve()` was never called → caller hung. Now the rejection is swallowed; the timeout side still terminates the race on time. - **Drain-timeout truncation now emits a diagnostic warning** (PTY). Previously the 200ms drain timeout could fire, the snapshot would be taken with the buffer in mid-write state, and the result.output would be silently truncated. Race result is now observed via a symbol sentinel; when the timeout side wins, debugLogger.warn fires pointing the user at rawOutput as the un-truncated fallback. - **Snapshot serialize failure logs instead of swallowing silently** (PTY). Empty `catch {}` made result.output indistinguishable from "command produced no output" if serializeTerminalToText threw. Now `debugLogger.warn` with the error message leaves a trail for support bundles. - **Dedicated `PROMOTE_DRAIN_TIMEOUT_MS` constant** separated from `SIGKILL_TIMEOUT_MS`. Both are 200ms today, but they have unrelated reasons-to-change (kill escalation timing vs. promote drain ceiling) — sharing the constant means tuning one would silently change the other. Also adds a module-level `debugLogger = createDebugLogger('SHELL_EXECUTION')` since the service had no logging surface before this commit. 58 / 58 tests pass; tsc clean; ESLint clean. No new tests added: the new behaviors (timeout sentinel firing, dispose throw, exhaustive switch default) are defensive log-only paths; existing handoff tests already cover the happy path. Adding mock-throw tests is reasonable follow-up but not blocking. * fix(core): real bug — ptyProcess.off → removeListener; defensive abort-reason read Resolves the third review pass on PR-1 of #3831 — 1 real bug + 2 defensive hardenings: - **Real bug: `ptyProcess.off('error', ...)` throws TypeError at runtime** (line ~1074). `@lydell/node-pty`'s `IPty` interface exposes the legacy Node EventEmitter `removeListener`, not the modern `off` alias. Previous form threw, the surrounding try/catch swallowed it (post-prior-pass dispose hardening), but the old `ptyErrorHandler` stayed registered — so a post-promote PTY error would still hit our foreground handler and `throw err`, breaking the handoff contract that PR-1's whole listener-detach work is supposed to enforce. Switched to `removeListener`. The catch + warn stays as defense-in-depth; the message wording is updated. - **Prototype-pollution-safe `kind` read** (extracted to module-level helper `getShellAbortReasonKind`). The previous `reason?.kind` walked the prototype chain — a polluted `Object.prototype.kind = 'background'` would silently route `abortController.abort({})` (any plain object reason) into the promote branch and skip the kill. Lifecycle/safety branch deserves the extra check. Helper now: rejects non-object reasons; reads `kind` only as an OWN property (`hasOwnProperty`); whitelists against `'background' | 'cancel'`; defaults to `'cancel'` (the safe historical behavior) for everything else. Both abort handlers (childProcess + PTY) now share this helper. - **`streamStdout: true` + background-promote = silent empty snapshot** (childProcess `performBackgroundPromote`). The promote snapshot reads from the `stdout` / `stderr` string accumulators; but in `streamStdout` mode `handleOutput` forwards bytes through `onOutputEvent` and skips the accumulators entirely. Today PR-1's only call site (foreground shell.ts) uses `streamStdout: false`, so the combination is unreachable — but if a future caller pairs the two, `result.output` would be empty with no diagnostic. Added a `debugLogger.warn` when the combination occurs, pointing the caller at `rawOutput` as the fallback. Cheaper than building a parallel accumulator just for this latent case. 58 / 58 tests pass; tsc clean; ESLint clean. * fix(core): liveness check + throw-safe abort-reason read + encoding-aware PTY snapshot (resolve 4th review pass) Resolves 6 threads on PR-1 of #3831 — 1 Critical + 1 real bug + 2 quality + 2 test-coverage: - **[Critical] `getShellAbortReasonKind` throw-safe property read.** Previous form read `reason.kind` after only checking that `kind` is an own property. An own accessor that throws (or a Proxy with a trapping getter) would throw before the helper reached either the cancel kill path or the background promote path. Abort handlers are dispatched async and not awaited by AbortSignal, so a leaked throw here would have left the shell process alive instead of being killed on cancel — quietly. Wrapped the property read in try/catch with a fall-back to the safe 'cancel' kill behavior. - **Real bug: child_process post-exit race in background-promote** (`performBackgroundPromote`). The child may have already exited but the 'exit' event hasn't reached our handler yet (Node delivers events on the next microtask). Promoting in that window would detach our exit listener and report `promoted: true` for a process that's already dead — the caller would hold an inert pid expecting to take over. Now we read `child.exitCode` / `child.signalCode` before detaching: if either is non-null, fall through and let the pending exit handler resolve normally with the real exit info. Mirrored mock setup so `exitCode` / `signalCode` default to `null` (matching real ChildProcess) instead of `undefined`. - **PTY snapshot: re-decode + replay (mirror exit-path encoding).** The promoted snapshot was serializing `headlessTerminal` directly, which was fed by a streaming decoder initialized from the first-chunk encoding heuristic. When early output is ASCII-only but later output is in a different encoding (GBK / Shift-JIS / etc.), this produces mojibake — and the normal exit path doesn't, because it re-decodes `finalBuffer` with `getCachedEncodingForBuffer` and replays through a fresh terminal. Now mirrors that logic so `result.output` shape matches across the two paths. Direct-serialize remains as a last-ditch fallback if replay throws. - **Switch `default` no longer emits a runtime warn.** Reviewer noted the helper's whitelist made the `default: { _exhaustive: never }` branch unreachable at runtime — the `debugLogger.warn` in it could never fire. Kept the `_: never = kind` type assertion (so a future ShellAbortReason variant forces a TS error here, directing the developer to extend BOTH the helper's whitelist AND add a `case`), removed the unreachable warn. Added a comment that the assertion is the static-only safety net the union expansion would trigger. - **Direct unit tests for `getShellAbortReasonKind`** (8 cases). The helper's prototype-pollution defense is the main reason it exists; if `hasOwnProperty` is accidentally removed the regression would silently send `abortController.abort({})` (any plain reason) into the promote path. Exported the helper and added direct tests for: null / undefined, non-object, empty object (no own kind), prototype- only kind (pollution), unknown kind value, throwing accessor, Proxy trap, and the two happy paths. - **`removeListener` regression guard.** The fix to call `ptyProcess.removeListener('error', ...)` instead of `.off(...)` matters because `@lydell/node-pty`'s IPty interface only exposes `removeListener` — `.off()` throws TypeError on a real PTY but the EventEmitter mock tolerates both. Added a test that spies on both methods and asserts the production code uses `removeListener` for the 'error' event, so a future swap back to `.off()` regresses loudly under the mock instead of silently. 68 / 68 tests pass (58 baseline + 9 helper boundary + 1 removeListener guard + 1 post-exit race); tsc clean; ESLint clean. * fix(core): PTY background-promote post-exit race guard (resolve 5th review pass) Mirrors the child_process post-exit race fix from 4cc558b into the PTY path — addresses 1 [Critical] thread on PR-1 of #3831: The PTY may have already exited but our `exitDisposable` (onExit callback) hasn't run yet — node-pty delivers the exit event asynchronously after the PTY's native SIGCHLD, so there's a window between "PTY actually dead" and "service onExit fires". Promoting in that window detaches our exit listener and reports `promoted: true` for a dead PTY, losing the real exit status; the caller would hold an inert pid expecting to take over. The IPty interface doesn't expose an `exitCode` field we can read directly (unlike `child.exitCode` / `child.signalCode` for child_process), so use `process.kill(pid, 0)` as a best-effort liveness check via the existing `ShellExecutionService.isPtyActive` helper. If kill(pid, 0) throws ESRCH, the pid is gone — log at debug level and fall through, letting the pending onExit callback resolve normally with the real exit info. Also adds a unit test mirroring the child_process race test: mocks `process.kill(pid, 0)` to throw ESRCH on the liveness probe, asserts the result has no `promoted: true` and reports the real exitCode. 69 / 69 tests pass; tsc clean; ESLint clean. * docs(core): correct getShellAbortReasonKind boundary-test count in JSDoc Doc said 'all six edge cases' but the test suite has 8 cases (added Proxy-trap and undefined later). Off-by-2 cosmetic only — no behavior change. Caught during a multi-round self-audit of PR-1 of #3831. Audit summary: 7 rounds (correctness / reverse / consistency / coverage / build / exception paths / style) found one false-positive (a sync- abort registration-order race I initially thought existed). Verified that Node's WHATWG AbortSignal does NOT auto-fire 'abort' listeners on already-aborted signals, so the race window cannot open. No code change needed for that scenario; this commit is just the JSDoc fix. 69 / 69 tests still pass; tsc + ESLint clean. * docs(core): document the helper / union / switch sync invariant explicitly Multi-round self-audit found that `getShellAbortReasonKind`'s value whitelist has no compile-time tie to the `ShellAbortReason` union: when the union grows, TypeScript's `_exhaustive: never` in each switch forces #3 (the case arm) to be added, but the helper's whitelist (#2) silently keeps degrading the new variant to 'cancel', and the new case arm is never reached at runtime. Reviewer #4 raised this on the second pass; the original commit chose to accept it (option B in that thread) but didn't leave a strong in-code signal for future contributors. Added an INVARIANT block inside the helper enumerating the three sites that must be kept in sync, so the next person extending `ShellAbortReason` sees the coupling at the place where they're most likely to forget it. No behavior change — comment-only. 69 / 69 tests still pass; tsc + ESLint clean. Audit summary (this round + prior round): 18 angles total over two sweeps and one reverse-attack pass. Found: - 0 real bugs - 1 false-positive race (sync-abort registration order — Node WHATWG AbortSignal does NOT auto-fire on already-aborted signals; investigated, reverted) - 1 cosmetic doc fix (boundary-test count off-by-2) - 1 cosmetic INVARIANT block (this commit) Areas reviewed without finding new issues: caller-side ShellExecutionResult shape compatibility (optional `promoted?` field, existing callers spread-untouched); `exited` flag lifecycle (monotonic, cleanup() idempotent); processingChain in-flight ownership (listenersDetached guards every onOutputEvent emit including the renderFn-rendered case via the same flag); race between exit event and abort handler (both microtasks, FIFO ordering gives correct outcome either way); Node version dependence (`AbortSignal.reason` is Node 17.2+, engines: >=20 covers it); test isolation (mockImplementationOnce + module-level mockProcessKill clears each beforeEach); `process.kill(pid, 0)` Windows liveness reliability (best-effort, acceptable for PR-1 plumbing); PID reuse race on the PTY liveness check (theoretically possible, microsecond window, unavoidable at the OS level — rejected in spec discussion); PR-2/PR-3 contract surface (caller MUST attach listeners before abort — documented; any future caller violating this is its own bug). * test(core): align mockChildProcess.exitCode/signalCode in second beforeEach The 'execution method selection' describe block has its own beforeEach (separate from 'child_process fallback') that builds mockChildProcess but does not set `exitCode` / `signalCode = null`. Real Node `ChildProcess.exitCode` / `signalCode` are `null` while the process is alive — and production now reads these in the background-promote race guard. The current tests in this block don't exercise the promote path, so they pass regardless, but any future promote-related test landing here would silently trip the guard (`undefined !== null` is true) and fall through to the normal-exit branch instead of promoting. Mirror the `child_process fallback` block's mock setup so the two beforeEach hooks produce equivalent ChildProcess shapes, eliminating a quiet foot-gun for future contributors. Comment-only / test-fixture change. 69 / 69 tests still pass; tsc clean. Found during a deeper third-round self-audit of PR-1 of #3831.
…-up to #3842) (#3886) * fix(core): wrap hasOwnProperty.call inside try + add post-abort PTY data assertion Two non-blocking review notes from @tanzhenxin's PR-1 approval (#3842, post-merge follow-up): - **Note 2 (real bug)**: `getShellAbortReasonKind` had `Object.prototype.hasOwnProperty.call(reason, 'kind')` outside the try/catch. `hasOwnProperty.call` triggers the `[[GetOwnProperty]]` Proxy trap (`getOwnPropertyDescriptor` handler). A Proxy whose `getOwnPropertyDescriptor` throws — separate from a throwing `get` trap, which the prior commit already covered — would propagate past the helper, leaving the abort handler's switch on `kind` to throw through `addEventListener` (which doesn't await async listener return values), so the shell process would stay alive instead of being killed on cancel. Moved the descriptor probe inside the same try block as the value read. My own multi-round audit covered six attack vectors but missed this one — only `get` trap throws were considered. The reviewer caught it on first pass. - **Note 3 (test parity)**: the PTY post-promotion handoff test asserted `dispose` was called but never re-invoked the data callback to verify the foreground `onOutputEvent` actually stops firing — the child_process equivalent has that assertion. Mirrored it: emit data AFTER abort by re-invoking the captured `dataCallback` reference, and assert `onOutputEventMock.mock.calls.length` does NOT increase past the moment of promote. Exercises the production `listenersDetached` guard inside the chain callback, which the bare dispose-was-called check didn't. Note 1 from the same review (the `aborted: true + promoted: true` shape forcing PR-2 callers to check `promoted` before `aborted`) is deliberately NOT addressed here — it's a contract simplification that affects PR-2's branching, so it belongs in PR-2 along with the caller-side decision on whether to flip `aborted` for promoted results. Added a TODO upstream in #3831 (PR-2 design) to track. 70 / 70 tests pass (69 baseline + 1 new helper boundary for the throwing-getOwnPropertyDescriptor case). tsc + ESLint clean. * test(core): fix tautological PTY post-promote assertion (audit follow-up) The PTY post-promotion handoff test added in the previous commit copied the child_process equivalent's pattern verbatim — sync \`expect(count).toBe(countAtPromote)\` immediately after dataCallback returns. That works for child_process because its \`handleOutput\` is fully synchronous (sniff → decoder → emit, all on the same call stack), so the count change happens BEFORE the assertion. PTY's \`handleOutput\` is async — \`processingChain.then(...)\` queues a microtask that does the sniff + write + render-then-emit work. The sync assertion captures both \`countAtPromote\` and the post-emit count BEFORE the chain microtask ever runs, so both reads return whatever happened before the assert (typically 0). The test would tautologically pass even if the production \`listenersDetached\` guard were removed — i.e., it didn't actually verify the guard. Restructured to: 1. Drive the PTY through \`simulateExecution\` so \`await handle.result\` forces all queued microtasks (including pre-promote chain items AND the abort handler's drain) to settle. 2. Capture \`eventCountAfterSettle\` once everything has stabilized. 3. Re-invoke the captured \`dataCallback\` with post-promote data, await two more macrotask boundaries to let the new chain item fully run. 4. Assert the count hasn't moved. If the production \`listenersDetached\` guard is removed, the post-promote chain item emits, count increases past \`eventCountAfterSettle\`, and this assertion fails. So the test actually exercises the guard now. Found in self-audit while reviewing my own follow-up commit. Caught because audit was paranoid about *whether the test verifies what it claims to verify*, not just whether it passes. 70 / 70 tests pass; tsc + ESLint clean. * test(core): pin eventCountAfterSettle === 0 in PTY post-promote test The previous fix asserted post-promote count equals the \`eventCountAfterSettle\` baseline, but didn't pin the baseline itself. With the production \`listenersDetached\` guard intact, both halves (pre-promote chain and post-promote chain) suppress emit, so \`eventCountAfterSettle === 0 === post-count\` and the relative comparison is vacuously true. If a future refactor changed the production guard semantics so the pre-promote chain item DID emit (count becomes 1+ after settle), the relative-comparison test would still pass as long as post-promote also emitted the same number — that's a regression the test should catch. Adding \`expect(eventCountAfterSettle).toBe(0)\` makes the contract explicit: once \`listenersDetached\` is set during the abort handler's sync part, BOTH the in-flight chain item (pre-promote) and the future chain item (post-promote) skip emit. Found in another self-audit pass — even after fixing the tautological assertion, the test could still mask certain future regressions. 70 / 70 tests pass; tsc + ESLint clean. * test(core): drop dataCallbackHolder pattern, read mock.calls directly The dataCallbackHolder pattern (capturing the onData callback inside simulateExecution then invoking it after via a closure-shared object) was unnecessary indirection — \`mockPtyProcess.onData.mock.calls[0][0]\` reads the same callback reference whether you read it inside or outside the simulation closure. Vi's mock.calls array is per-mock-instance and beforeEach re-creates mockPtyProcess + .onData freshly, so there's no stale-reference risk in the simpler form. No behavior change in what's tested. 70 / 70 pass. * chore(core): drop in-source @-mention attribution + dead Proxy.get handler Two cosmetic cleanups found in another self-audit pass: - **Helper comment**: removed the parenthetical attribution ("Caught by @tanzhenxin in the PR-1 review; my own audit only covered \`get\` trap throws."). The technical content of the comment — explaining why both the descriptor probe and the value read live inside the try — stands on its own. The reviewer credit lives in commit history / PR description, where it belongs; an in-source @-mention ages poorly (handles change, the relevant person may move on) and doesn't help future readers reason about the code. - **Test Proxy**: \`throwingDescriptorProxy\` declared a \`get()\` handler that always returned \`undefined\`. The descriptor probe throws before the helper ever reaches the value read, so the \`get\` handler is unreachable — dropped it and added a one-line comment explaining why no \`get\` handler is needed for this test. Mirror test (\`throwingReason\` with throwing accessor + \`proxyReason\` with throwing \`get\`) keeps the symmetric "throwing-`get`" coverage. 70 / 70 tests pass; tsc + ESLint clean.
…f 3) (#3894) * feat(core): foreground → background promote integration (#3831 PR-2 of 3) Builds on the \`signal.reason\` foundation merged in #3842 / #3886. Wires the foreground \`shell\` tool to detect a background-promote abort, snapshot the captured output to a \`bg_xxx.output\` file, register a \`BackgroundShellEntry\` in the existing \`BackgroundShellRegistry\`, and return a model-facing \`ToolResult\` pointing at \`/tasks\` / the dialog / \`task_stop\`. Also resolves design question 7 from #3831 (raised by @tanzhenxin in the PR-1 review): set \`result.aborted: false\` when \`result.promoted: true\` so existing \`if (result.aborted)\` consumer branches fall through naturally. ## Changes **\`shellExecutionService.ts\`** — both \`executeWithPty\` and \`childProcessFallback\` background-promote branches now resolve with \`aborted: false, promoted: true\` (was \`aborted: true\`). The flag now answers "should the caller emit a cancel/timeout message?" rather than "did the abort signal fire?" — and a promoted shell is neither cancelled nor timed out (the child is still running, ownership simply transferred). \`ShellExecutionResult.promoted?\` JSDoc updated to document this contract. **\`shell.ts\`** — \`ShellToolInvocation.execute()\` gains a 5th optional parameter \`setPromoteAbortControllerCallback?: (ac: AbortController) => void\`. The foreground path now creates an internal \`promoteAbortController\` and combines its signal into the existing \`signal + timeoutSignal\` AbortSignal.any() chain. Right after \`setPidCallback\` fires, \`setPromoteAbortControllerCallback\` exposes the controller to the scheduler so a UI surface (PR-3 Ctrl+B keybind) can find it by callId and trigger \`abort({ kind: 'background', shellId })\`. When \`result.promoted\` is observed after \`await resultPromise\`, a new \`handlePromotedForeground\` private method: 1. Generates \`bg_xxx\` shellId + on-disk \`outputPath\` under the same project temp dir \`executeBackground\` uses. 2. Writes \`result.output\` (the snapshot the service flushed at promote time) as the file's initial content (best-effort — ENOSPC / EACCES logged + swallowed; the registry entry is valuable on its own). 3. Constructs a \`BackgroundShellEntry\` with the running pid + the same \`promoteAbortController\` already wired into the live child — \`task_stop bg_xxx\` and the dialog's \`x\` key both abort via \`entry.abortController\` and will land on the still-running process. 4. Returns a model-facing \`ToolResult\` pointing at \`/tasks\` / the Background tasks dialog / \`task_stop\` for follow-up. **\`coreToolScheduler.ts\`** — \`TrackedExecutingToolCall\` gains an optional \`promoteAbortController?: AbortController\` field, populated when the shell tool's \`setPromoteAbortControllerCallback\` fires. The scheduler routes only the shell-tool branch to pass this callback, matching the existing \`setPidCallback\` pattern. ## Limitations (deferred to PR-2.5) Two follow-up items intentionally NOT in scope here. Scope discipline keeps PR-2 reviewable while still delivering the user-facing promote flow end-to-end (PR-3's Ctrl+B keybind can wire to this PR's \`promoteAbortController\` to ship a working feature). - **Post-promote stream redirect**: today the \`outputPath\` content is FROZEN at the promote moment. The service detached its data listener as part of PR-1's ownership-transfer contract, so post-promote bytes from the still-running child don't reach the file. \`Read\`-ing the output via \`/tasks\` shows what was captured before promote, not live updates. PR-2.5 will add caller-side \`onPostPromoteData\` callback (or equivalent) so post-promote bytes stream to the file like a normal background shell. - **Natural-exit registry settle**: the registry entry stays \`'running'\` until \`task_stop bg_xxx\` or session-end \`abortAll\` clears it. The service's exit listener was disposed at promote, so there's no observation point for natural child exit. PR-2.5 will keep the exit listener attached post-promote (with a separate \`onPostPromoteSettle\` callback) so the entry transitions to \`completed\` / \`failed\` like a normal background shell. These limitations are visible to users (output frozen, entry stays running until task_stop/session end) but don't break the core promote contract: the agent unblocks, the registry entry is observable, the process stays alive, cancel via \`task_stop\` works. ## Tests **\`shellExecutionService.test.ts\`** — two existing promote tests now assert \`aborted: false\` (per design question 7) instead of \`true\`. \`70 / 70 pass\`. **\`shell.test.ts\`** — three new tests in a \`foreground → background promote (#3831 PR-2)\` describe block: 1. \`setPromoteAbortControllerCallback\` exposes a real \`AbortController\` after spawn. 2. On \`result.promoted: true\`, the registry receives a \`bg_xxx\` entry with pid + abortController + outputPath, the snapshot is written via \`fs.writeFileSync\`, and the model-facing copy references \`/tasks\` + \`task_stop\` + the dialog. 3. A snapshot-write failure (mocked ENOSPC) doesn't break promote — the registry entry still gets registered with the running pid. \`96 / 96 pass\`. **\`coreToolScheduler.test.ts\`** — \`98 / 98 pass\` (no new tests; the new \`promoteAbortController\` field is exercised end-to-end via shell.test.ts). Total: \`264 / 264 affected tests pass\`; tsc + ESLint clean. ## Related - #3831 (Phase D part b — design + 3-PR sequencing; question 7 resolved here) - #3842 (PR-1 — \`signal.reason\` foundation) - #3886 (PR-1 follow-up — Proxy-trap fix + handoff test parity) - #3634 (Background task management roadmap) cc @tanzhenxin * fix(core): give promoted shell entry a FRESH AbortController so task_stop kills the child Real bug found in self-audit of #3894 PR-2: \`entry.abortController\` was being set to the same \`promoteAbortController\` that triggered the promote — which is **already aborted** by the time we reach \`handlePromotedForeground\`. Two consequences: 1. \`task_stop bg_xxx\` calls \`entry.abortController.abort()\`. On an already-aborted controller this is a no-op (the abort event was dispatched once when the controller fired; the second \`abort()\` doesn't re-fire listeners per WHATWG spec). 2. \`ShellExecutionService\` has already detached its own abort listener as part of the PR-1 ownership-transfer contract, so even if the abort COULD re-fire, there's nobody left listening to translate the signal into an actual SIGTERM/SIGKILL on the still- running child. Net effect: a promoted shell would survive \`task_stop\` forever — the agent would think it cancelled, the registry entry would stay \`'running'\`, and the OS process would keep running until the user killed the CLI session. Fix: \`handlePromotedForeground\` now creates a fresh \`AbortController\` for the registry entry and wires its abort listener to: 1. Send SIGTERM → SIGKILL to the still-running child via \`process.kill(-pid, …)\` (Linux/Mac process group, mirroring the \`detached: !isWindows\` spawn the foreground path uses) or \`taskkill /pid /f /t\` (Windows). Reuses the same SIGTERM-then- timeout-then-SIGKILL pattern \`ShellExecutionService.execute()\` uses on the non-promote cancel path; new constant \`PROMOTE_CANCEL_SIGKILL_TIMEOUT_MS = 200ms\` (intentionally separate from the service's \`SIGKILL_TIMEOUT_MS\` so tuning one doesn't silently change the other). 2. Sync-mark the registry entry \`cancelled\` via \`registry.cancel()\` so \`/tasks\` and the dialog reflect the user intent immediately. Added a regression test pinning \`entry.abortController.signal.aborted === false\` at registration time. Without the fix, this asserts \`true\` and the test fails — which is the visible canary for the silent-task_stop-failure mode. 97 / 97 shell.test.ts pass; tsc + ESLint clean. * fix(core): add 'error' listener on Windows taskkill spawn (audit follow-up) Reverse-audit found a Windows-specific crash mode: \`cpSpawn('taskkill', …)\` returns a \`ChildProcess\` whose 'error' event (emitted when the spawn fails — taskkill binary missing, EACCES, etc.) crashes Node by default if no 'error' listener is attached. Same pattern as PR-1's \`@lydell/node-pty\` IPty incident — Web/Node spec quirk easy to miss without specifically thinking about Windows + spawn-failure. Also wrapped the \`cpSpawn\` call itself in try/catch for the rarer sync-throw mode (EMFILE / ENOMEM at spawn-time). Recovery in both cases: log via debugLogger.warn + continue; \`registry.cancel\` below still transitions the entry, and the still-running child becomes an orphan that Windows reaps when the CLI session ends. 97 / 97 shell.test.ts pass; tsc + ESLint clean. * test(core): close 3 test gaps from #3894 review Three [Suggestion] threads from the @tanzhenxin-style review on PR-2, all real test gaps that would have let silent regressions through: 1. **\`setPromoteAbortControllerCallback\` test was too weak.** The old test only asserted that the callback received an \`AbortController\` instance, not that the controller's signal was actually wired into the \`AbortSignal.any(...)\` chain handed to ShellExecutionService. If \`shell.ts\` exposed the controller but forgot to combine its signal, Ctrl+B promotion would never reach the service while the bare-instance test still passed. Strengthened: capture the AbortSignal handed to ShellExecutionService.execute (4th arg), abort the promote controller, and assert the captured signal goes from \`aborted: false\` → \`true\`. 2. **The post-promote cancellation kill path was unverified.** The prior commit added a real-bug fix (fresh \`entryAc\` + abort listener that sends SIGTERM/SIGKILL + sync-marks the registry entry cancelled) but the only test it had was "the controller is fresh, signal not aborted". Reviewer rightly noted that this is the **core operational guarantee** for promoted shells — \`task_stop bg_xxx\` must actually stop the child. Added a test that uses fake timers + a \`process.kill\` spy: register a promoted entry, abort \`entry.abortController\`, flush microtasks (SIGTERM dispatch), advance fake time past \`PROMOTE_CANCEL_SIGKILL_TIMEOUT_MS\` (SIGKILL dispatch + \`registry.cancel\` mark). Pins the entire kill chain. 3. **Scheduler-side wiring of \`promoteAbortController\` was untested.** PR-3's Ctrl+B keybind looks up the executing tool call by callId and aborts \`tc.promoteAbortController\` — if \`CoreToolScheduler\` stops populating that field, the keybind silently breaks. Added a test in \`coreToolScheduler.test.ts\` that uses a \`TestShellInvocation extends ShellToolInvocation\` (so the scheduler's \`instanceof ShellToolInvocation\` check still routes the call through the shell-specific branch that wires the callback) and asserts that an \`onToolCallsUpdate\` batch emitted during the executing window contains a tool call where \`tc.promoteAbortController\` matches the controller the test exposed. 98 / 98 shell.test.ts pass; 99 / 99 coreToolScheduler.test.ts pass; tsc + ESLint clean. * fix(core): use commandToExecute in promoted entry + try/catch register Resolves 3 #3894 review threads: - **Critical**: `entry.command` and `llmContent` for the promoted foreground shell now use `commandToExecute` (post-co-author-rewrite form) instead of raw `this.params.command`. For `git commit -m` invocations that `addCoAuthorToGitCommit()` rewrote, the registry entry now mirrors what actually ran — matching `executeBackground`'s long-standing convention (line 1234). - Defensive try/catch around `registry.register(entry)`: today the call is internally safe (Map.set + emit), but a future implementation that throws would leave a zombie child detached from service listeners with no kill path. Catch path logs, fires `entryAc.abort()` for best-effort kill via the wired listener, and re-throws so the scheduler surfaces the failure. - Updates the misleading comment (line 748) that claimed the registry entry uses "the same `promoteAbortController`" — actual impl uses a fresh `entryAc` (the audit-fix from the previous push). Tests: - `entry.command` git-commit case pinning post-rewrite form - register-throw rejection + SIGTERM/SIGKILL kill via fake timers - 100/100 shell.test.ts pass; tsc + ESLint clean * fix(core): close 2 #3894 review findings — promote refused-race + mkdir orphan Resolves @tanzhenxin's CHANGES_REQUESTED review on #3894. 1. **Refused-promote race no longer reported as "Command timed out"** The combined-abort signal folds in `signal | timeoutSignal | promoteAbortController.signal`, but the timeout discriminator only excluded the user-cancel signal — not the promote signal. When the user fires Ctrl+B (PR-3's keybind) but the service's race guard refuses promotion (the child terminated a beat earlier), the result lands `aborted: true, promoted: false` and the foreground path falsely reported `Command timed out after 120000ms`. Both the agent and the user would see a timeout that didn't happen. Fix: extend the discriminator to ALSO exclude `promoteAbortController.signal.aborted`. Add a `wasPromoteRefused` branch that surfaces the actual cause: "Command finished before the background-promote request could be honoured (the child had already exited)." Same fix applied to both the llmContent path and the returnDisplay path so the model and the visible UI agree. Latent in PR-2 itself (no in-tree caller fires the promote yet), but PR-3's keybind would expose it on first ship. 2. **Unguarded mkdirSync orphans the promoted child** After `result.promoted: true`, ownership of the still-running child has transferred and the service's kill path is detached. The promote handler creates the snapshot output directory next, but the original `fs.mkdirSync(outputDir, { recursive: true })` had no guard — read- only temp mounts, sandboxed perms, full disk on inode/metadata exhaustion would reject the handler BEFORE the registry's kill listener was wired. The still-running child became an orphan zombie with no kill path until the OS reaped it on session end. Fix: wrap mkdirSync in try/catch (matches the safety pattern around `registry.register`). On failure, log + best-effort kill the child (SIGTERM via process.kill(-pid) on POSIX, taskkill /f /t on Windows with an `error` listener so a spawn failure doesn't crash Node) + re-throw so the scheduler surfaces the failure to the agent. Tests: 2 new regressions in `shell.test.ts`: - `mkdirSync(outputDir) throws → child gets SIGTERM, error re-raised` - `promote-refused race (aborted: true, promoted: false after promote signal) is NOT reported as "Command timed out"` 171/171 shell.test.ts pass; tsc + ESLint clean.
…PR-2 of 3) (QwenLM#3894) * feat(core): foreground → background promote integration (QwenLM#3831 PR-2 of 3) Builds on the \`signal.reason\` foundation merged in QwenLM#3842 / QwenLM#3886. Wires the foreground \`shell\` tool to detect a background-promote abort, snapshot the captured output to a \`bg_xxx.output\` file, register a \`BackgroundShellEntry\` in the existing \`BackgroundShellRegistry\`, and return a model-facing \`ToolResult\` pointing at \`/tasks\` / the dialog / \`task_stop\`. Also resolves design question 7 from QwenLM#3831 (raised by @tanzhenxin in the PR-1 review): set \`result.aborted: false\` when \`result.promoted: true\` so existing \`if (result.aborted)\` consumer branches fall through naturally. **\`shellExecutionService.ts\`** — both \`executeWithPty\` and \`childProcessFallback\` background-promote branches now resolve with \`aborted: false, promoted: true\` (was \`aborted: true\`). The flag now answers "should the caller emit a cancel/timeout message?" rather than "did the abort signal fire?" — and a promoted shell is neither cancelled nor timed out (the child is still running, ownership simply transferred). \`ShellExecutionResult.promoted?\` JSDoc updated to document this contract. **\`shell.ts\`** — \`ShellToolInvocation.execute()\` gains a 5th optional parameter \`setPromoteAbortControllerCallback?: (ac: AbortController) => void\`. The foreground path now creates an internal \`promoteAbortController\` and combines its signal into the existing \`signal + timeoutSignal\` AbortSignal.any() chain. Right after \`setPidCallback\` fires, \`setPromoteAbortControllerCallback\` exposes the controller to the scheduler so a UI surface (PR-3 Ctrl+B keybind) can find it by callId and trigger \`abort({ kind: 'background', shellId })\`. When \`result.promoted\` is observed after \`await resultPromise\`, a new \`handlePromotedForeground\` private method: 1. Generates \`bg_xxx\` shellId + on-disk \`outputPath\` under the same project temp dir \`executeBackground\` uses. 2. Writes \`result.output\` (the snapshot the service flushed at promote time) as the file's initial content (best-effort — ENOSPC / EACCES logged + swallowed; the registry entry is valuable on its own). 3. Constructs a \`BackgroundShellEntry\` with the running pid + the same \`promoteAbortController\` already wired into the live child — \`task_stop bg_xxx\` and the dialog's \`x\` key both abort via \`entry.abortController\` and will land on the still-running process. 4. Returns a model-facing \`ToolResult\` pointing at \`/tasks\` / the Background tasks dialog / \`task_stop\` for follow-up. **\`coreToolScheduler.ts\`** — \`TrackedExecutingToolCall\` gains an optional \`promoteAbortController?: AbortController\` field, populated when the shell tool's \`setPromoteAbortControllerCallback\` fires. The scheduler routes only the shell-tool branch to pass this callback, matching the existing \`setPidCallback\` pattern. Two follow-up items intentionally NOT in scope here. Scope discipline keeps PR-2 reviewable while still delivering the user-facing promote flow end-to-end (PR-3's Ctrl+B keybind can wire to this PR's \`promoteAbortController\` to ship a working feature). - **Post-promote stream redirect**: today the \`outputPath\` content is FROZEN at the promote moment. The service detached its data listener as part of PR-1's ownership-transfer contract, so post-promote bytes from the still-running child don't reach the file. \`Read\`-ing the output via \`/tasks\` shows what was captured before promote, not live updates. PR-2.5 will add caller-side \`onPostPromoteData\` callback (or equivalent) so post-promote bytes stream to the file like a normal background shell. - **Natural-exit registry settle**: the registry entry stays \`'running'\` until \`task_stop bg_xxx\` or session-end \`abortAll\` clears it. The service's exit listener was disposed at promote, so there's no observation point for natural child exit. PR-2.5 will keep the exit listener attached post-promote (with a separate \`onPostPromoteSettle\` callback) so the entry transitions to \`completed\` / \`failed\` like a normal background shell. These limitations are visible to users (output frozen, entry stays running until task_stop/session end) but don't break the core promote contract: the agent unblocks, the registry entry is observable, the process stays alive, cancel via \`task_stop\` works. **\`shellExecutionService.test.ts\`** — two existing promote tests now assert \`aborted: false\` (per design question 7) instead of \`true\`. \`70 / 70 pass\`. **\`shell.test.ts\`** — three new tests in a \`foreground → background promote (QwenLM#3831 PR-2)\` describe block: 1. \`setPromoteAbortControllerCallback\` exposes a real \`AbortController\` after spawn. 2. On \`result.promoted: true\`, the registry receives a \`bg_xxx\` entry with pid + abortController + outputPath, the snapshot is written via \`fs.writeFileSync\`, and the model-facing copy references \`/tasks\` + \`task_stop\` + the dialog. 3. A snapshot-write failure (mocked ENOSPC) doesn't break promote — the registry entry still gets registered with the running pid. \`96 / 96 pass\`. **\`coreToolScheduler.test.ts\`** — \`98 / 98 pass\` (no new tests; the new \`promoteAbortController\` field is exercised end-to-end via shell.test.ts). Total: \`264 / 264 affected tests pass\`; tsc + ESLint clean. - QwenLM#3831 (Phase D part b — design + 3-PR sequencing; question 7 resolved here) - QwenLM#3842 (PR-1 — \`signal.reason\` foundation) - QwenLM#3886 (PR-1 follow-up — Proxy-trap fix + handoff test parity) - QwenLM#3634 (Background task management roadmap) cc @tanzhenxin * fix(core): give promoted shell entry a FRESH AbortController so task_stop kills the child Real bug found in self-audit of QwenLM#3894 PR-2: \`entry.abortController\` was being set to the same \`promoteAbortController\` that triggered the promote — which is **already aborted** by the time we reach \`handlePromotedForeground\`. Two consequences: 1. \`task_stop bg_xxx\` calls \`entry.abortController.abort()\`. On an already-aborted controller this is a no-op (the abort event was dispatched once when the controller fired; the second \`abort()\` doesn't re-fire listeners per WHATWG spec). 2. \`ShellExecutionService\` has already detached its own abort listener as part of the PR-1 ownership-transfer contract, so even if the abort COULD re-fire, there's nobody left listening to translate the signal into an actual SIGTERM/SIGKILL on the still- running child. Net effect: a promoted shell would survive \`task_stop\` forever — the agent would think it cancelled, the registry entry would stay \`'running'\`, and the OS process would keep running until the user killed the CLI session. Fix: \`handlePromotedForeground\` now creates a fresh \`AbortController\` for the registry entry and wires its abort listener to: 1. Send SIGTERM → SIGKILL to the still-running child via \`process.kill(-pid, …)\` (Linux/Mac process group, mirroring the \`detached: !isWindows\` spawn the foreground path uses) or \`taskkill /pid /f /t\` (Windows). Reuses the same SIGTERM-then- timeout-then-SIGKILL pattern \`ShellExecutionService.execute()\` uses on the non-promote cancel path; new constant \`PROMOTE_CANCEL_SIGKILL_TIMEOUT_MS = 200ms\` (intentionally separate from the service's \`SIGKILL_TIMEOUT_MS\` so tuning one doesn't silently change the other). 2. Sync-mark the registry entry \`cancelled\` via \`registry.cancel()\` so \`/tasks\` and the dialog reflect the user intent immediately. Added a regression test pinning \`entry.abortController.signal.aborted === false\` at registration time. Without the fix, this asserts \`true\` and the test fails — which is the visible canary for the silent-task_stop-failure mode. 97 / 97 shell.test.ts pass; tsc + ESLint clean. * fix(core): add 'error' listener on Windows taskkill spawn (audit follow-up) Reverse-audit found a Windows-specific crash mode: \`cpSpawn('taskkill', …)\` returns a \`ChildProcess\` whose 'error' event (emitted when the spawn fails — taskkill binary missing, EACCES, etc.) crashes Node by default if no 'error' listener is attached. Same pattern as PR-1's \`@lydell/node-pty\` IPty incident — Web/Node spec quirk easy to miss without specifically thinking about Windows + spawn-failure. Also wrapped the \`cpSpawn\` call itself in try/catch for the rarer sync-throw mode (EMFILE / ENOMEM at spawn-time). Recovery in both cases: log via debugLogger.warn + continue; \`registry.cancel\` below still transitions the entry, and the still-running child becomes an orphan that Windows reaps when the CLI session ends. 97 / 97 shell.test.ts pass; tsc + ESLint clean. * test(core): close 3 test gaps from QwenLM#3894 review Three [Suggestion] threads from the @tanzhenxin-style review on PR-2, all real test gaps that would have let silent regressions through: 1. **\`setPromoteAbortControllerCallback\` test was too weak.** The old test only asserted that the callback received an \`AbortController\` instance, not that the controller's signal was actually wired into the \`AbortSignal.any(...)\` chain handed to ShellExecutionService. If \`shell.ts\` exposed the controller but forgot to combine its signal, Ctrl+B promotion would never reach the service while the bare-instance test still passed. Strengthened: capture the AbortSignal handed to ShellExecutionService.execute (4th arg), abort the promote controller, and assert the captured signal goes from \`aborted: false\` → \`true\`. 2. **The post-promote cancellation kill path was unverified.** The prior commit added a real-bug fix (fresh \`entryAc\` + abort listener that sends SIGTERM/SIGKILL + sync-marks the registry entry cancelled) but the only test it had was "the controller is fresh, signal not aborted". Reviewer rightly noted that this is the **core operational guarantee** for promoted shells — \`task_stop bg_xxx\` must actually stop the child. Added a test that uses fake timers + a \`process.kill\` spy: register a promoted entry, abort \`entry.abortController\`, flush microtasks (SIGTERM dispatch), advance fake time past \`PROMOTE_CANCEL_SIGKILL_TIMEOUT_MS\` (SIGKILL dispatch + \`registry.cancel\` mark). Pins the entire kill chain. 3. **Scheduler-side wiring of \`promoteAbortController\` was untested.** PR-3's Ctrl+B keybind looks up the executing tool call by callId and aborts \`tc.promoteAbortController\` — if \`CoreToolScheduler\` stops populating that field, the keybind silently breaks. Added a test in \`coreToolScheduler.test.ts\` that uses a \`TestShellInvocation extends ShellToolInvocation\` (so the scheduler's \`instanceof ShellToolInvocation\` check still routes the call through the shell-specific branch that wires the callback) and asserts that an \`onToolCallsUpdate\` batch emitted during the executing window contains a tool call where \`tc.promoteAbortController\` matches the controller the test exposed. 98 / 98 shell.test.ts pass; 99 / 99 coreToolScheduler.test.ts pass; tsc + ESLint clean. * fix(core): use commandToExecute in promoted entry + try/catch register Resolves 3 QwenLM#3894 review threads: - **Critical**: `entry.command` and `llmContent` for the promoted foreground shell now use `commandToExecute` (post-co-author-rewrite form) instead of raw `this.params.command`. For `git commit -m` invocations that `addCoAuthorToGitCommit()` rewrote, the registry entry now mirrors what actually ran — matching `executeBackground`'s long-standing convention (line 1234). - Defensive try/catch around `registry.register(entry)`: today the call is internally safe (Map.set + emit), but a future implementation that throws would leave a zombie child detached from service listeners with no kill path. Catch path logs, fires `entryAc.abort()` for best-effort kill via the wired listener, and re-throws so the scheduler surfaces the failure. - Updates the misleading comment (line 748) that claimed the registry entry uses "the same `promoteAbortController`" — actual impl uses a fresh `entryAc` (the audit-fix from the previous push). Tests: - `entry.command` git-commit case pinning post-rewrite form - register-throw rejection + SIGTERM/SIGKILL kill via fake timers - 100/100 shell.test.ts pass; tsc + ESLint clean * fix(core): close 2 QwenLM#3894 review findings — promote refused-race + mkdir orphan Resolves @tanzhenxin's CHANGES_REQUESTED review on QwenLM#3894. 1. **Refused-promote race no longer reported as "Command timed out"** The combined-abort signal folds in `signal | timeoutSignal | promoteAbortController.signal`, but the timeout discriminator only excluded the user-cancel signal — not the promote signal. When the user fires Ctrl+B (PR-3's keybind) but the service's race guard refuses promotion (the child terminated a beat earlier), the result lands `aborted: true, promoted: false` and the foreground path falsely reported `Command timed out after 120000ms`. Both the agent and the user would see a timeout that didn't happen. Fix: extend the discriminator to ALSO exclude `promoteAbortController.signal.aborted`. Add a `wasPromoteRefused` branch that surfaces the actual cause: "Command finished before the background-promote request could be honoured (the child had already exited)." Same fix applied to both the llmContent path and the returnDisplay path so the model and the visible UI agree. Latent in PR-2 itself (no in-tree caller fires the promote yet), but PR-3's keybind would expose it on first ship. 2. **Unguarded mkdirSync orphans the promoted child** After `result.promoted: true`, ownership of the still-running child has transferred and the service's kill path is detached. The promote handler creates the snapshot output directory next, but the original `fs.mkdirSync(outputDir, { recursive: true })` had no guard — read- only temp mounts, sandboxed perms, full disk on inode/metadata exhaustion would reject the handler BEFORE the registry's kill listener was wired. The still-running child became an orphan zombie with no kill path until the OS reaped it on session end. Fix: wrap mkdirSync in try/catch (matches the safety pattern around `registry.register`). On failure, log + best-effort kill the child (SIGTERM via process.kill(-pid) on POSIX, taskkill /f /t on Windows with an `error` listener so a spawn failure doesn't crash Node) + re-throw so the scheduler surfaces the failure to the agent. Tests: 2 new regressions in `shell.test.ts`: - `mkdirSync(outputDir) throws → child gets SIGTERM, error re-raised` - `promote-refused race (aborted: true, promoted: false after promote signal) is NOT reported as "Command timed out"` 171/171 shell.test.ts pass; tsc + ESLint clean.
…PR-2 of 3) (QwenLM#3894) * feat(core): foreground → background promote integration (QwenLM#3831 PR-2 of 3) Builds on the \`signal.reason\` foundation merged in QwenLM#3842 / QwenLM#3886. Wires the foreground \`shell\` tool to detect a background-promote abort, snapshot the captured output to a \`bg_xxx.output\` file, register a \`BackgroundShellEntry\` in the existing \`BackgroundShellRegistry\`, and return a model-facing \`ToolResult\` pointing at \`/tasks\` / the dialog / \`task_stop\`. Also resolves design question 7 from QwenLM#3831 (raised by @tanzhenxin in the PR-1 review): set \`result.aborted: false\` when \`result.promoted: true\` so existing \`if (result.aborted)\` consumer branches fall through naturally. ## Changes **\`shellExecutionService.ts\`** — both \`executeWithPty\` and \`childProcessFallback\` background-promote branches now resolve with \`aborted: false, promoted: true\` (was \`aborted: true\`). The flag now answers "should the caller emit a cancel/timeout message?" rather than "did the abort signal fire?" — and a promoted shell is neither cancelled nor timed out (the child is still running, ownership simply transferred). \`ShellExecutionResult.promoted?\` JSDoc updated to document this contract. **\`shell.ts\`** — \`ShellToolInvocation.execute()\` gains a 5th optional parameter \`setPromoteAbortControllerCallback?: (ac: AbortController) => void\`. The foreground path now creates an internal \`promoteAbortController\` and combines its signal into the existing \`signal + timeoutSignal\` AbortSignal.any() chain. Right after \`setPidCallback\` fires, \`setPromoteAbortControllerCallback\` exposes the controller to the scheduler so a UI surface (PR-3 Ctrl+B keybind) can find it by callId and trigger \`abort({ kind: 'background', shellId })\`. When \`result.promoted\` is observed after \`await resultPromise\`, a new \`handlePromotedForeground\` private method: 1. Generates \`bg_xxx\` shellId + on-disk \`outputPath\` under the same project temp dir \`executeBackground\` uses. 2. Writes \`result.output\` (the snapshot the service flushed at promote time) as the file's initial content (best-effort — ENOSPC / EACCES logged + swallowed; the registry entry is valuable on its own). 3. Constructs a \`BackgroundShellEntry\` with the running pid + the same \`promoteAbortController\` already wired into the live child — \`task_stop bg_xxx\` and the dialog's \`x\` key both abort via \`entry.abortController\` and will land on the still-running process. 4. Returns a model-facing \`ToolResult\` pointing at \`/tasks\` / the Background tasks dialog / \`task_stop\` for follow-up. **\`coreToolScheduler.ts\`** — \`TrackedExecutingToolCall\` gains an optional \`promoteAbortController?: AbortController\` field, populated when the shell tool's \`setPromoteAbortControllerCallback\` fires. The scheduler routes only the shell-tool branch to pass this callback, matching the existing \`setPidCallback\` pattern. ## Limitations (deferred to PR-2.5) Two follow-up items intentionally NOT in scope here. Scope discipline keeps PR-2 reviewable while still delivering the user-facing promote flow end-to-end (PR-3's Ctrl+B keybind can wire to this PR's \`promoteAbortController\` to ship a working feature). - **Post-promote stream redirect**: today the \`outputPath\` content is FROZEN at the promote moment. The service detached its data listener as part of PR-1's ownership-transfer contract, so post-promote bytes from the still-running child don't reach the file. \`Read\`-ing the output via \`/tasks\` shows what was captured before promote, not live updates. PR-2.5 will add caller-side \`onPostPromoteData\` callback (or equivalent) so post-promote bytes stream to the file like a normal background shell. - **Natural-exit registry settle**: the registry entry stays \`'running'\` until \`task_stop bg_xxx\` or session-end \`abortAll\` clears it. The service's exit listener was disposed at promote, so there's no observation point for natural child exit. PR-2.5 will keep the exit listener attached post-promote (with a separate \`onPostPromoteSettle\` callback) so the entry transitions to \`completed\` / \`failed\` like a normal background shell. These limitations are visible to users (output frozen, entry stays running until task_stop/session end) but don't break the core promote contract: the agent unblocks, the registry entry is observable, the process stays alive, cancel via \`task_stop\` works. ## Tests **\`shellExecutionService.test.ts\`** — two existing promote tests now assert \`aborted: false\` (per design question 7) instead of \`true\`. \`70 / 70 pass\`. **\`shell.test.ts\`** — three new tests in a \`foreground → background promote (QwenLM#3831 PR-2)\` describe block: 1. \`setPromoteAbortControllerCallback\` exposes a real \`AbortController\` after spawn. 2. On \`result.promoted: true\`, the registry receives a \`bg_xxx\` entry with pid + abortController + outputPath, the snapshot is written via \`fs.writeFileSync\`, and the model-facing copy references \`/tasks\` + \`task_stop\` + the dialog. 3. A snapshot-write failure (mocked ENOSPC) doesn't break promote — the registry entry still gets registered with the running pid. \`96 / 96 pass\`. **\`coreToolScheduler.test.ts\`** — \`98 / 98 pass\` (no new tests; the new \`promoteAbortController\` field is exercised end-to-end via shell.test.ts). Total: \`264 / 264 affected tests pass\`; tsc + ESLint clean. ## Related - QwenLM#3831 (Phase D part b — design + 3-PR sequencing; question 7 resolved here) - QwenLM#3842 (PR-1 — \`signal.reason\` foundation) - QwenLM#3886 (PR-1 follow-up — Proxy-trap fix + handoff test parity) - QwenLM#3634 (Background task management roadmap) cc @tanzhenxin * fix(core): give promoted shell entry a FRESH AbortController so task_stop kills the child Real bug found in self-audit of QwenLM#3894 PR-2: \`entry.abortController\` was being set to the same \`promoteAbortController\` that triggered the promote — which is **already aborted** by the time we reach \`handlePromotedForeground\`. Two consequences: 1. \`task_stop bg_xxx\` calls \`entry.abortController.abort()\`. On an already-aborted controller this is a no-op (the abort event was dispatched once when the controller fired; the second \`abort()\` doesn't re-fire listeners per WHATWG spec). 2. \`ShellExecutionService\` has already detached its own abort listener as part of the PR-1 ownership-transfer contract, so even if the abort COULD re-fire, there's nobody left listening to translate the signal into an actual SIGTERM/SIGKILL on the still- running child. Net effect: a promoted shell would survive \`task_stop\` forever — the agent would think it cancelled, the registry entry would stay \`'running'\`, and the OS process would keep running until the user killed the CLI session. Fix: \`handlePromotedForeground\` now creates a fresh \`AbortController\` for the registry entry and wires its abort listener to: 1. Send SIGTERM → SIGKILL to the still-running child via \`process.kill(-pid, …)\` (Linux/Mac process group, mirroring the \`detached: !isWindows\` spawn the foreground path uses) or \`taskkill /pid /f /t\` (Windows). Reuses the same SIGTERM-then- timeout-then-SIGKILL pattern \`ShellExecutionService.execute()\` uses on the non-promote cancel path; new constant \`PROMOTE_CANCEL_SIGKILL_TIMEOUT_MS = 200ms\` (intentionally separate from the service's \`SIGKILL_TIMEOUT_MS\` so tuning one doesn't silently change the other). 2. Sync-mark the registry entry \`cancelled\` via \`registry.cancel()\` so \`/tasks\` and the dialog reflect the user intent immediately. Added a regression test pinning \`entry.abortController.signal.aborted === false\` at registration time. Without the fix, this asserts \`true\` and the test fails — which is the visible canary for the silent-task_stop-failure mode. 97 / 97 shell.test.ts pass; tsc + ESLint clean. * fix(core): add 'error' listener on Windows taskkill spawn (audit follow-up) Reverse-audit found a Windows-specific crash mode: \`cpSpawn('taskkill', …)\` returns a \`ChildProcess\` whose 'error' event (emitted when the spawn fails — taskkill binary missing, EACCES, etc.) crashes Node by default if no 'error' listener is attached. Same pattern as PR-1's \`@lydell/node-pty\` IPty incident — Web/Node spec quirk easy to miss without specifically thinking about Windows + spawn-failure. Also wrapped the \`cpSpawn\` call itself in try/catch for the rarer sync-throw mode (EMFILE / ENOMEM at spawn-time). Recovery in both cases: log via debugLogger.warn + continue; \`registry.cancel\` below still transitions the entry, and the still-running child becomes an orphan that Windows reaps when the CLI session ends. 97 / 97 shell.test.ts pass; tsc + ESLint clean. * test(core): close 3 test gaps from QwenLM#3894 review Three [Suggestion] threads from the @tanzhenxin-style review on PR-2, all real test gaps that would have let silent regressions through: 1. **\`setPromoteAbortControllerCallback\` test was too weak.** The old test only asserted that the callback received an \`AbortController\` instance, not that the controller's signal was actually wired into the \`AbortSignal.any(...)\` chain handed to ShellExecutionService. If \`shell.ts\` exposed the controller but forgot to combine its signal, Ctrl+B promotion would never reach the service while the bare-instance test still passed. Strengthened: capture the AbortSignal handed to ShellExecutionService.execute (4th arg), abort the promote controller, and assert the captured signal goes from \`aborted: false\` → \`true\`. 2. **The post-promote cancellation kill path was unverified.** The prior commit added a real-bug fix (fresh \`entryAc\` + abort listener that sends SIGTERM/SIGKILL + sync-marks the registry entry cancelled) but the only test it had was "the controller is fresh, signal not aborted". Reviewer rightly noted that this is the **core operational guarantee** for promoted shells — \`task_stop bg_xxx\` must actually stop the child. Added a test that uses fake timers + a \`process.kill\` spy: register a promoted entry, abort \`entry.abortController\`, flush microtasks (SIGTERM dispatch), advance fake time past \`PROMOTE_CANCEL_SIGKILL_TIMEOUT_MS\` (SIGKILL dispatch + \`registry.cancel\` mark). Pins the entire kill chain. 3. **Scheduler-side wiring of \`promoteAbortController\` was untested.** PR-3's Ctrl+B keybind looks up the executing tool call by callId and aborts \`tc.promoteAbortController\` — if \`CoreToolScheduler\` stops populating that field, the keybind silently breaks. Added a test in \`coreToolScheduler.test.ts\` that uses a \`TestShellInvocation extends ShellToolInvocation\` (so the scheduler's \`instanceof ShellToolInvocation\` check still routes the call through the shell-specific branch that wires the callback) and asserts that an \`onToolCallsUpdate\` batch emitted during the executing window contains a tool call where \`tc.promoteAbortController\` matches the controller the test exposed. 98 / 98 shell.test.ts pass; 99 / 99 coreToolScheduler.test.ts pass; tsc + ESLint clean. * fix(core): use commandToExecute in promoted entry + try/catch register Resolves 3 QwenLM#3894 review threads: - **Critical**: `entry.command` and `llmContent` for the promoted foreground shell now use `commandToExecute` (post-co-author-rewrite form) instead of raw `this.params.command`. For `git commit -m` invocations that `addCoAuthorToGitCommit()` rewrote, the registry entry now mirrors what actually ran — matching `executeBackground`'s long-standing convention (line 1234). - Defensive try/catch around `registry.register(entry)`: today the call is internally safe (Map.set + emit), but a future implementation that throws would leave a zombie child detached from service listeners with no kill path. Catch path logs, fires `entryAc.abort()` for best-effort kill via the wired listener, and re-throws so the scheduler surfaces the failure. - Updates the misleading comment (line 748) that claimed the registry entry uses "the same `promoteAbortController`" — actual impl uses a fresh `entryAc` (the audit-fix from the previous push). Tests: - `entry.command` git-commit case pinning post-rewrite form - register-throw rejection + SIGTERM/SIGKILL kill via fake timers - 100/100 shell.test.ts pass; tsc + ESLint clean * fix(core): close 2 QwenLM#3894 review findings — promote refused-race + mkdir orphan Resolves @tanzhenxin's CHANGES_REQUESTED review on QwenLM#3894. 1. **Refused-promote race no longer reported as "Command timed out"** The combined-abort signal folds in `signal | timeoutSignal | promoteAbortController.signal`, but the timeout discriminator only excluded the user-cancel signal — not the promote signal. When the user fires Ctrl+B (PR-3's keybind) but the service's race guard refuses promotion (the child terminated a beat earlier), the result lands `aborted: true, promoted: false` and the foreground path falsely reported `Command timed out after 120000ms`. Both the agent and the user would see a timeout that didn't happen. Fix: extend the discriminator to ALSO exclude `promoteAbortController.signal.aborted`. Add a `wasPromoteRefused` branch that surfaces the actual cause: "Command finished before the background-promote request could be honoured (the child had already exited)." Same fix applied to both the llmContent path and the returnDisplay path so the model and the visible UI agree. Latent in PR-2 itself (no in-tree caller fires the promote yet), but PR-3's keybind would expose it on first ship. 2. **Unguarded mkdirSync orphans the promoted child** After `result.promoted: true`, ownership of the still-running child has transferred and the service's kill path is detached. The promote handler creates the snapshot output directory next, but the original `fs.mkdirSync(outputDir, { recursive: true })` had no guard — read- only temp mounts, sandboxed perms, full disk on inode/metadata exhaustion would reject the handler BEFORE the registry's kill listener was wired. The still-running child became an orphan zombie with no kill path until the OS reaped it on session end. Fix: wrap mkdirSync in try/catch (matches the safety pattern around `registry.register`). On failure, log + best-effort kill the child (SIGTERM via process.kill(-pid) on POSIX, taskkill /f /t on Windows with an `error` listener so a spawn failure doesn't crash Node) + re-throw so the scheduler surfaces the failure to the agent. Tests: 2 new regressions in `shell.test.ts`: - `mkdirSync(outputDir) throws → child gets SIGTERM, error re-raised` - `promote-refused race (aborted: true, promoted: false after promote signal) is NOT reported as "Command timed out"` 171/171 shell.test.ts pass; tsc + ESLint clean.
* feat(cli): Ctrl+B promote keybind — wire UI to PR-2's promoteAbortController (#3831 PR-3 of 3) Final piece of the foreground → background promote feature. PR-1 (#3842) landed the `signal.reason` foundation; PR-2 (#3894) wired `shell.ts` to detect a `{ kind: 'background' }` abort, snapshot output, register a `BackgroundShellEntry`, and stash the promote `AbortController` on `TrackedExecutingToolCall`. This PR exposes the user-visible surface: pressing Ctrl+B during an in-flight foreground shell command transfers ownership to a background task the user can inspect via `/tasks` or stop via `task_stop`. ## Changes - `keyBindings.ts`: new `Command.PROMOTE_SHELL_TO_BACKGROUND` bound to `Ctrl+B`. JSDoc explains the no-shell-running no-op semantics. - `useReactToolScheduler.ts`: project `promoteAbortController` from the core's `ExecutingToolCall` through `TrackedExecutingToolCall` so the React layer (AppContainer keypress handler) can find it by callId without re-plumbing through the scheduler. - `AppContainer.tsx`: `handleGlobalKeypress` gains a `PROMOTE_SHELL_TO_BACKGROUND` branch that walks `pendingToolCallsRef.current` (the ref, not the destructured array — keeps the deps list stable so the handler isn't re-bound on every tool-call status update), finds the executing tool call with a defined `promoteAbortController`, calls `.abort({ kind: 'background' })`, and returns early. No-op when no foreground shell is executing — Ctrl+B then falls through to the input layer's existing cursor-left binding. - `keyboard-shortcuts.md`: documents Ctrl+B with explicit fall-through behavior so the conflict with the prompt-area cursor-left binding is intentional + understandable. ## Tests - `keyMatchers.test.ts` (+1): Ctrl+B positive / bare-b + meta+b + Ctrl+other negatives. - `AppContainer.test.tsx` (+2): - **Ctrl+B promotes** — pendingToolCalls includes an executing shell with a stubbed `AbortController` + spy; firing Ctrl+B asserts `abort({ kind: 'background' })` is called once. - **Ctrl+B no-op** — empty `pendingToolCalls` + Ctrl+B must NOT throw (pins the safety contract for the typing-mid-prompt case where the input layer's own Ctrl+B should still fire). - 37/37 keyMatchers + 58/58 AppContainer pass; tsc + ESLint clean. ## E2E (manual, PR description guidance) The unit / integration tests cover the keybind → abort wiring and the promote handler's downstream behavior (PR-2's tests). Real-PTY E2E is intentionally manual since headless test infrastructure doesn't drive a real shell child + Ctrl+B keystroke; documented in the PR description checklist. Closes the 3-PR sequence for #3831 (Phase D part b of #3634). * fix(cli): #3969 review wave — broadcast comment + debug log + redundancy 5 #3969 review threads addressed: - **AppContainer.tsx Ctrl+B handler**: documented the KeypressContext.broadcast caveat (after `return`, the same Ctrl+B is still dispatched to text-buffer cursor-left + DebugProfiler; visible cursor-left side effect is cosmetic) so future readers understand why the prompt cursor moves on a successful promote. Added `debugLogger.debug` calls on both branches (matched callId on success; streamingState + pendingToolCalls.length on no-op fall-through) so "Ctrl+B doesn't work" reports are debuggable. - **useReactToolScheduler.ts TrackedExecutingToolCall**: dropped the redundant `pid?` and `promoteAbortController?` declarations — both come through the `& ExecutingToolCall` intersection unchanged. Fixed the JSDoc that wrote `{ kind: 'background', shellId }`: callers don't generate `shellId` (it's optional on the abort-reason union and `handlePromotedForeground` produces it downstream). The corresponding executing branch in `toolCallsUpdateHandler` no longer projects pid / promoteAbortController explicitly — `...coreTc` already spreads them; the explicit-undefined clearing in the non-executing branch is also dropped (those fields aren't on coreTc when status !== 'executing', so `...coreTc` doesn't carry them). - **AppContainer.test.tsx**: replaced two `as unknown as Key` double-casts with direct `: Key` annotations on the literal — the object already conforms to the Key interface, double-cast was bypassing type safety needlessly. Tests: 37/37 keyMatchers + 58/58 AppContainer pass; tsc + ESLint clean. No behavior change beyond the new debug log lines. * fix(cli): #3969 wave — tool-name guard + non-shell test + defensive clear 3 #3969 review threads addressed; 1 deferred: - AppContainer.tsx: Ctrl+B `find()` predicate now also checks `tc.request.name === ToolNames.SHELL` before matching the executing tool call. Defense-in-depth — today only the shell tool wires `promoteAbortController`, but a future copy-paste / type confusion that adds the property to a non-shell tool would otherwise let Ctrl+B mistakenly fire `abort({kind:'background'})` on a tool whose service has no promote-handoff handler. - useReactToolScheduler.ts: re-added explicit `pid: undefined` and `promoteAbortController: undefined` to the non-executing return. Previously dropped on the assumption that `...coreTc` doesn't carry these fields when the status isn't `executing` — true today, but the explicit clearing is defense-in-depth against a future core change that adds either field to a non-executing status type (would surface as a stuck PID display or a Ctrl+B handler that matches a no-longer-executing tool call). - AppContainer.test.tsx: replaced the placeholder "no-op when no pending tool calls" framing on the empty-array case (it does exercise the `executing-status` predicate but NOT the tool-name guard) with TWO tests: 1. existing empty-array no-throw test (renamed for clarity) 2. NEW: executing non-shell tool with a hostile-shape `promoteAbortController` — asserts `abortSpy` is NOT called. This is the regression test for the new tool-name guard above. Tests: 61/61 AppContainer.test.tsx pass; tsc + ESLint clean. Deferred to follow-up (replied + tracked): - `debugLogger.debug` is file-only; success-path "agent unblocks + next message says 'promoted to bg_xxx'" is the user-visible signal. Adding a synthetic history item or stderr line for the gap between keypress and agent message conflicts with Ink rendering and is better as a focused UX PR. * test(cli): pin inheritance of pid + promoteAbortController via type assertions #3969 review: the earlier "redundant declaration" review removed the explicit `pid?: number` and `promoteAbortController?: AbortController` from `TrackedExecutingToolCall`, relying on the `& ExecutingToolCall` intersection to inherit them. Current review flags the type-safety regression: if core renames or removes either field, the React-side build won't catch it locally — Ctrl+B handler silently breaks at runtime. Compromise: keep the type minimal (no re-declaration noise the prior review flagged) but add compile-time `extends keyof ExecutingToolCall` assertions that fail loudly + locally if either field disappears. The assertions are evaluated at compile time and zero-cost at runtime; the dummy `const` pins them so they aren't dead code. 61/61 AppContainer tests pass; tsc clean.
…wenLM#3831 PR-1 of 3) (QwenLM#3842) * feat(core): add signal.reason convention for ShellExecutionService.execute() Foundation for QwenLM#3831 Phase D (b) — Ctrl+B promote of a running foreground shell to background. Defines a discriminated `ShellAbortReason` union that the AbortSignal carries; default behavior (no reason / `{ kind: 'cancel' }`) keeps the existing tree-kill on abort. `{ kind: 'background' }` is a takeover signal — execute() skips the kill, drops the child from its active set (so cleanup() won't kill it later), flushes a snapshot of captured output, and resolves the result Promise immediately with `promoted: true` so the awaiting caller unblocks. Pure plumbing: no caller sets the reason yet, so this is a zero-behavior change for existing call sites. The `promoted?: boolean` field is optional on ShellExecutionResult so existing consumers compile against the new shape without source changes. Tests pin both branches in both childProcessFallback and executeWithPty: default abort still SIGTERM-tree-kills; `{ kind: 'cancel' }` is identical to default (pin against accidental routing through the background branch); `{ kind: 'background' }` skips the kill, snapshot output is preserved, mockProcessKill / mockPtyProcess.kill are NOT called. Part of QwenLM#3831 (Phase D part b — Ctrl+B promote running shell to background). PR-1 of 3. * fix(core): detach service listeners on background-promote (resolve review) Addresses 4 Critical + 2 Suggestion findings on PR-1 of QwenLM#3831: - **childProcess listener detach** (review line 555 + 573): Anonymous arrow listeners on stdout/stderr/error/exit could not be off()'d. After background-promote, post-promote bytes would re-enter handleOutput, which then calls decoder.decode() on a now-finalized text decoder (cleanup() already called .decode() without stream:true) → TypeError crash. Even without the crash, old onOutputEvent would fire for new data → ownership contract violation + duplication. Fix: extract named handler refs (stdoutHandler / stderrHandler / errorHandler / exitHandler) and call off() on all four in the background-promote branch via a detachServiceListeners() helper. - **PTY listener detach** (review line 967 + 990): node-pty's onData / onExit return IDisposable handles; the abort handler now captures dataDisposable / exitDisposable and calls .dispose() in the background-promote branch. ptyProcess.on('error') is EventEmitter-style (not IDisposable) — extract a named ptyErrorHandler ref and off() it. Without these, post-promote PTY error throws → Node.js crash; post-promote data continues writing to headlessTerminal and calling old onOutputEvent → ownership violation. - **PTY in-flight chain item ownership** (related to review line 990): processingChain may have already-enqueued callbacks past the early listenersDetached check. Refactored from "early-return short-circuit" to "guard each onOutputEvent emit individually" so in-flight writes still LAND in headlessTerminal (snapshot reflects them) but no events leak to the foreground onOutputEvent. Also clear renderTimeout in the abort handler so a pending throttled render doesn't fire post-promote. - **PTY snapshot freshness** (review line 972, suggestion): The original abort handler called serializeTerminalToText immediately. Now we await Promise.race([processingChain drain, SIGKILL_TIMEOUT_MS]) first (mirrors the onExit finalize pattern at ~line 970) so in-flight headlessTerminal.write callbacks land before serialization. Skipped render(true) intentionally because it would emit final onOutputEvent data (renderFn calls onOutputEvent), violating the "no emit post-promote" invariant — added a comment explaining why direct serialize is correct. - **Handoff-boundary tests** (review line 1257, suggestion): Added 4 new tests pinning the ownership contract — 2 for child_process (post-promote stdout/stderr does NOT route to onOutputEvent; child exit does NOT re-resolve result), 2 for PTY (data/exit disposables ARE called; result shape stays promoted: true even if post-promote events fire). Also: test setup now stubs mockPtyProcess.onData / .onExit to return { dispose: vi.fn() } so the background-promote path's dispose() calls don't crash on undefined (the stub's mock.results[0].value is then inspected by the new handoff tests). 58 / 58 tests pass (50 baseline + 4 first-pass + 4 handoff). Total +235 / -35 on top of the prior commit. * fix(core): defensive hardening for ShellExecutionService background-promote (resolve 2nd review pass) Addresses 6 follow-up [Suggestion] threads on PR-1 of QwenLM#3831 — all substantive code-quality issues raised by the second-pass review of the dispose-based detach commit (2bca15d5c): - **Exhaustive switch on `ShellAbortReason.kind`** (both abort handlers). Earlier `if (reason?.kind === 'background')` form silently fell through to kill for any unrecognized variant — a future `{ kind: 'suspend' }` would have killed the process with zero compile-time signal. Switched to `switch (kind)` with a `never`-typed default that runs `debugLogger.warn` and falls back to the safest behavior (cancel/kill). Each branch is now extracted into a named helper (`performBackgroundPromote` / `performCancelKill`) so the switch body stays a single screenful. - **Each `dispose()` wrapped in its own try/catch** (PTY). node-pty's `IDisposable` contract doesn't guarantee no-throw. Without per-dispose try/catch a single throwing dispose() would skip subsequent cleanup (the other dispose, off('error'), activePtys.delete, drain, resolve) and the caller would hang forever on `await result`. Each call now logs via debugLogger.warn on failure but continues. - **`.catch(() => undefined)` on the processingChain side of the drain race** (PTY). `Promise.race([processingChain.then(drain).then(drain), timeout])` would propagate a chain rejection out of the race; since `addEventListener` doesn't await our handler, the rejection became unhandled and `resolve()` was never called → caller hung. Now the rejection is swallowed; the timeout side still terminates the race on time. - **Drain-timeout truncation now emits a diagnostic warning** (PTY). Previously the 200ms drain timeout could fire, the snapshot would be taken with the buffer in mid-write state, and the result.output would be silently truncated. Race result is now observed via a symbol sentinel; when the timeout side wins, debugLogger.warn fires pointing the user at rawOutput as the un-truncated fallback. - **Snapshot serialize failure logs instead of swallowing silently** (PTY). Empty `catch {}` made result.output indistinguishable from "command produced no output" if serializeTerminalToText threw. Now `debugLogger.warn` with the error message leaves a trail for support bundles. - **Dedicated `PROMOTE_DRAIN_TIMEOUT_MS` constant** separated from `SIGKILL_TIMEOUT_MS`. Both are 200ms today, but they have unrelated reasons-to-change (kill escalation timing vs. promote drain ceiling) — sharing the constant means tuning one would silently change the other. Also adds a module-level `debugLogger = createDebugLogger('SHELL_EXECUTION')` since the service had no logging surface before this commit. 58 / 58 tests pass; tsc clean; ESLint clean. No new tests added: the new behaviors (timeout sentinel firing, dispose throw, exhaustive switch default) are defensive log-only paths; existing handoff tests already cover the happy path. Adding mock-throw tests is reasonable follow-up but not blocking. * fix(core): real bug — ptyProcess.off → removeListener; defensive abort-reason read Resolves the third review pass on PR-1 of QwenLM#3831 — 1 real bug + 2 defensive hardenings: - **Real bug: `ptyProcess.off('error', ...)` throws TypeError at runtime** (line ~1074). `@lydell/node-pty`'s `IPty` interface exposes the legacy Node EventEmitter `removeListener`, not the modern `off` alias. Previous form threw, the surrounding try/catch swallowed it (post-prior-pass dispose hardening), but the old `ptyErrorHandler` stayed registered — so a post-promote PTY error would still hit our foreground handler and `throw err`, breaking the handoff contract that PR-1's whole listener-detach work is supposed to enforce. Switched to `removeListener`. The catch + warn stays as defense-in-depth; the message wording is updated. - **Prototype-pollution-safe `kind` read** (extracted to module-level helper `getShellAbortReasonKind`). The previous `reason?.kind` walked the prototype chain — a polluted `Object.prototype.kind = 'background'` would silently route `abortController.abort({})` (any plain object reason) into the promote branch and skip the kill. Lifecycle/safety branch deserves the extra check. Helper now: rejects non-object reasons; reads `kind` only as an OWN property (`hasOwnProperty`); whitelists against `'background' | 'cancel'`; defaults to `'cancel'` (the safe historical behavior) for everything else. Both abort handlers (childProcess + PTY) now share this helper. - **`streamStdout: true` + background-promote = silent empty snapshot** (childProcess `performBackgroundPromote`). The promote snapshot reads from the `stdout` / `stderr` string accumulators; but in `streamStdout` mode `handleOutput` forwards bytes through `onOutputEvent` and skips the accumulators entirely. Today PR-1's only call site (foreground shell.ts) uses `streamStdout: false`, so the combination is unreachable — but if a future caller pairs the two, `result.output` would be empty with no diagnostic. Added a `debugLogger.warn` when the combination occurs, pointing the caller at `rawOutput` as the fallback. Cheaper than building a parallel accumulator just for this latent case. 58 / 58 tests pass; tsc clean; ESLint clean. * fix(core): liveness check + throw-safe abort-reason read + encoding-aware PTY snapshot (resolve 4th review pass) Resolves 6 threads on PR-1 of QwenLM#3831 — 1 Critical + 1 real bug + 2 quality + 2 test-coverage: - **[Critical] `getShellAbortReasonKind` throw-safe property read.** Previous form read `reason.kind` after only checking that `kind` is an own property. An own accessor that throws (or a Proxy with a trapping getter) would throw before the helper reached either the cancel kill path or the background promote path. Abort handlers are dispatched async and not awaited by AbortSignal, so a leaked throw here would have left the shell process alive instead of being killed on cancel — quietly. Wrapped the property read in try/catch with a fall-back to the safe 'cancel' kill behavior. - **Real bug: child_process post-exit race in background-promote** (`performBackgroundPromote`). The child may have already exited but the 'exit' event hasn't reached our handler yet (Node delivers events on the next microtask). Promoting in that window would detach our exit listener and report `promoted: true` for a process that's already dead — the caller would hold an inert pid expecting to take over. Now we read `child.exitCode` / `child.signalCode` before detaching: if either is non-null, fall through and let the pending exit handler resolve normally with the real exit info. Mirrored mock setup so `exitCode` / `signalCode` default to `null` (matching real ChildProcess) instead of `undefined`. - **PTY snapshot: re-decode + replay (mirror exit-path encoding).** The promoted snapshot was serializing `headlessTerminal` directly, which was fed by a streaming decoder initialized from the first-chunk encoding heuristic. When early output is ASCII-only but later output is in a different encoding (GBK / Shift-JIS / etc.), this produces mojibake — and the normal exit path doesn't, because it re-decodes `finalBuffer` with `getCachedEncodingForBuffer` and replays through a fresh terminal. Now mirrors that logic so `result.output` shape matches across the two paths. Direct-serialize remains as a last-ditch fallback if replay throws. - **Switch `default` no longer emits a runtime warn.** Reviewer noted the helper's whitelist made the `default: { _exhaustive: never }` branch unreachable at runtime — the `debugLogger.warn` in it could never fire. Kept the `_: never = kind` type assertion (so a future ShellAbortReason variant forces a TS error here, directing the developer to extend BOTH the helper's whitelist AND add a `case`), removed the unreachable warn. Added a comment that the assertion is the static-only safety net the union expansion would trigger. - **Direct unit tests for `getShellAbortReasonKind`** (8 cases). The helper's prototype-pollution defense is the main reason it exists; if `hasOwnProperty` is accidentally removed the regression would silently send `abortController.abort({})` (any plain reason) into the promote path. Exported the helper and added direct tests for: null / undefined, non-object, empty object (no own kind), prototype- only kind (pollution), unknown kind value, throwing accessor, Proxy trap, and the two happy paths. - **`removeListener` regression guard.** The fix to call `ptyProcess.removeListener('error', ...)` instead of `.off(...)` matters because `@lydell/node-pty`'s IPty interface only exposes `removeListener` — `.off()` throws TypeError on a real PTY but the EventEmitter mock tolerates both. Added a test that spies on both methods and asserts the production code uses `removeListener` for the 'error' event, so a future swap back to `.off()` regresses loudly under the mock instead of silently. 68 / 68 tests pass (58 baseline + 9 helper boundary + 1 removeListener guard + 1 post-exit race); tsc clean; ESLint clean. * fix(core): PTY background-promote post-exit race guard (resolve 5th review pass) Mirrors the child_process post-exit race fix from 0df1be23f into the PTY path — addresses 1 [Critical] thread on PR-1 of QwenLM#3831: The PTY may have already exited but our `exitDisposable` (onExit callback) hasn't run yet — node-pty delivers the exit event asynchronously after the PTY's native SIGCHLD, so there's a window between "PTY actually dead" and "service onExit fires". Promoting in that window detaches our exit listener and reports `promoted: true` for a dead PTY, losing the real exit status; the caller would hold an inert pid expecting to take over. The IPty interface doesn't expose an `exitCode` field we can read directly (unlike `child.exitCode` / `child.signalCode` for child_process), so use `process.kill(pid, 0)` as a best-effort liveness check via the existing `ShellExecutionService.isPtyActive` helper. If kill(pid, 0) throws ESRCH, the pid is gone — log at debug level and fall through, letting the pending onExit callback resolve normally with the real exit info. Also adds a unit test mirroring the child_process race test: mocks `process.kill(pid, 0)` to throw ESRCH on the liveness probe, asserts the result has no `promoted: true` and reports the real exitCode. 69 / 69 tests pass; tsc clean; ESLint clean. * docs(core): correct getShellAbortReasonKind boundary-test count in JSDoc Doc said 'all six edge cases' but the test suite has 8 cases (added Proxy-trap and undefined later). Off-by-2 cosmetic only — no behavior change. Caught during a multi-round self-audit of PR-1 of QwenLM#3831. Audit summary: 7 rounds (correctness / reverse / consistency / coverage / build / exception paths / style) found one false-positive (a sync- abort registration-order race I initially thought existed). Verified that Node's WHATWG AbortSignal does NOT auto-fire 'abort' listeners on already-aborted signals, so the race window cannot open. No code change needed for that scenario; this commit is just the JSDoc fix. 69 / 69 tests still pass; tsc + ESLint clean. * docs(core): document the helper / union / switch sync invariant explicitly Multi-round self-audit found that `getShellAbortReasonKind`'s value whitelist has no compile-time tie to the `ShellAbortReason` union: when the union grows, TypeScript's `_exhaustive: never` in each switch forces QwenLM#3 (the case arm) to be added, but the helper's whitelist (QwenLM#2) silently keeps degrading the new variant to 'cancel', and the new case arm is never reached at runtime. Reviewer QwenLM#4 raised this on the second pass; the original commit chose to accept it (option B in that thread) but didn't leave a strong in-code signal for future contributors. Added an INVARIANT block inside the helper enumerating the three sites that must be kept in sync, so the next person extending `ShellAbortReason` sees the coupling at the place where they're most likely to forget it. No behavior change — comment-only. 69 / 69 tests still pass; tsc + ESLint clean. Audit summary (this round + prior round): 18 angles total over two sweeps and one reverse-attack pass. Found: - 0 real bugs - 1 false-positive race (sync-abort registration order — Node WHATWG AbortSignal does NOT auto-fire on already-aborted signals; investigated, reverted) - 1 cosmetic doc fix (boundary-test count off-by-2) - 1 cosmetic INVARIANT block (this commit) Areas reviewed without finding new issues: caller-side ShellExecutionResult shape compatibility (optional `promoted?` field, existing callers spread-untouched); `exited` flag lifecycle (monotonic, cleanup() idempotent); processingChain in-flight ownership (listenersDetached guards every onOutputEvent emit including the renderFn-rendered case via the same flag); race between exit event and abort handler (both microtasks, FIFO ordering gives correct outcome either way); Node version dependence (`AbortSignal.reason` is Node 17.2+, engines: >=20 covers it); test isolation (mockImplementationOnce + module-level mockProcessKill clears each beforeEach); `process.kill(pid, 0)` Windows liveness reliability (best-effort, acceptable for PR-1 plumbing); PID reuse race on the PTY liveness check (theoretically possible, microsecond window, unavoidable at the OS level — rejected in spec discussion); PR-2/PR-3 contract surface (caller MUST attach listeners before abort — documented; any future caller violating this is its own bug). * test(core): align mockChildProcess.exitCode/signalCode in second beforeEach The 'execution method selection' describe block has its own beforeEach (separate from 'child_process fallback') that builds mockChildProcess but does not set `exitCode` / `signalCode = null`. Real Node `ChildProcess.exitCode` / `signalCode` are `null` while the process is alive — and production now reads these in the background-promote race guard. The current tests in this block don't exercise the promote path, so they pass regardless, but any future promote-related test landing here would silently trip the guard (`undefined !== null` is true) and fall through to the normal-exit branch instead of promoting. Mirror the `child_process fallback` block's mock setup so the two beforeEach hooks produce equivalent ChildProcess shapes, eliminating a quiet foot-gun for future contributors. Comment-only / test-fixture change. 69 / 69 tests still pass; tsc clean. Found during a deeper third-round self-audit of PR-1 of QwenLM#3831.
…-up to QwenLM#3842) (QwenLM#3886) * fix(core): wrap hasOwnProperty.call inside try + add post-abort PTY data assertion Two non-blocking review notes from @tanzhenxin's PR-1 approval (QwenLM#3842, post-merge follow-up): - **Note 2 (real bug)**: `getShellAbortReasonKind` had `Object.prototype.hasOwnProperty.call(reason, 'kind')` outside the try/catch. `hasOwnProperty.call` triggers the `[[GetOwnProperty]]` Proxy trap (`getOwnPropertyDescriptor` handler). A Proxy whose `getOwnPropertyDescriptor` throws — separate from a throwing `get` trap, which the prior commit already covered — would propagate past the helper, leaving the abort handler's switch on `kind` to throw through `addEventListener` (which doesn't await async listener return values), so the shell process would stay alive instead of being killed on cancel. Moved the descriptor probe inside the same try block as the value read. My own multi-round audit covered six attack vectors but missed this one — only `get` trap throws were considered. The reviewer caught it on first pass. - **Note 3 (test parity)**: the PTY post-promotion handoff test asserted `dispose` was called but never re-invoked the data callback to verify the foreground `onOutputEvent` actually stops firing — the child_process equivalent has that assertion. Mirrored it: emit data AFTER abort by re-invoking the captured `dataCallback` reference, and assert `onOutputEventMock.mock.calls.length` does NOT increase past the moment of promote. Exercises the production `listenersDetached` guard inside the chain callback, which the bare dispose-was-called check didn't. Note 1 from the same review (the `aborted: true + promoted: true` shape forcing PR-2 callers to check `promoted` before `aborted`) is deliberately NOT addressed here — it's a contract simplification that affects PR-2's branching, so it belongs in PR-2 along with the caller-side decision on whether to flip `aborted` for promoted results. Added a TODO upstream in QwenLM#3831 (PR-2 design) to track. 70 / 70 tests pass (69 baseline + 1 new helper boundary for the throwing-getOwnPropertyDescriptor case). tsc + ESLint clean. * test(core): fix tautological PTY post-promote assertion (audit follow-up) The PTY post-promotion handoff test added in the previous commit copied the child_process equivalent's pattern verbatim — sync \`expect(count).toBe(countAtPromote)\` immediately after dataCallback returns. That works for child_process because its \`handleOutput\` is fully synchronous (sniff → decoder → emit, all on the same call stack), so the count change happens BEFORE the assertion. PTY's \`handleOutput\` is async — \`processingChain.then(...)\` queues a microtask that does the sniff + write + render-then-emit work. The sync assertion captures both \`countAtPromote\` and the post-emit count BEFORE the chain microtask ever runs, so both reads return whatever happened before the assert (typically 0). The test would tautologically pass even if the production \`listenersDetached\` guard were removed — i.e., it didn't actually verify the guard. Restructured to: 1. Drive the PTY through \`simulateExecution\` so \`await handle.result\` forces all queued microtasks (including pre-promote chain items AND the abort handler's drain) to settle. 2. Capture \`eventCountAfterSettle\` once everything has stabilized. 3. Re-invoke the captured \`dataCallback\` with post-promote data, await two more macrotask boundaries to let the new chain item fully run. 4. Assert the count hasn't moved. If the production \`listenersDetached\` guard is removed, the post-promote chain item emits, count increases past \`eventCountAfterSettle\`, and this assertion fails. So the test actually exercises the guard now. Found in self-audit while reviewing my own follow-up commit. Caught because audit was paranoid about *whether the test verifies what it claims to verify*, not just whether it passes. 70 / 70 tests pass; tsc + ESLint clean. * test(core): pin eventCountAfterSettle === 0 in PTY post-promote test The previous fix asserted post-promote count equals the \`eventCountAfterSettle\` baseline, but didn't pin the baseline itself. With the production \`listenersDetached\` guard intact, both halves (pre-promote chain and post-promote chain) suppress emit, so \`eventCountAfterSettle === 0 === post-count\` and the relative comparison is vacuously true. If a future refactor changed the production guard semantics so the pre-promote chain item DID emit (count becomes 1+ after settle), the relative-comparison test would still pass as long as post-promote also emitted the same number — that's a regression the test should catch. Adding \`expect(eventCountAfterSettle).toBe(0)\` makes the contract explicit: once \`listenersDetached\` is set during the abort handler's sync part, BOTH the in-flight chain item (pre-promote) and the future chain item (post-promote) skip emit. Found in another self-audit pass — even after fixing the tautological assertion, the test could still mask certain future regressions. 70 / 70 tests pass; tsc + ESLint clean. * test(core): drop dataCallbackHolder pattern, read mock.calls directly The dataCallbackHolder pattern (capturing the onData callback inside simulateExecution then invoking it after via a closure-shared object) was unnecessary indirection — \`mockPtyProcess.onData.mock.calls[0][0]\` reads the same callback reference whether you read it inside or outside the simulation closure. Vi's mock.calls array is per-mock-instance and beforeEach re-creates mockPtyProcess + .onData freshly, so there's no stale-reference risk in the simpler form. No behavior change in what's tested. 70 / 70 pass. * chore(core): drop in-source @-mention attribution + dead Proxy.get handler Two cosmetic cleanups found in another self-audit pass: - **Helper comment**: removed the parenthetical attribution ("Caught by @tanzhenxin in the PR-1 review; my own audit only covered \`get\` trap throws."). The technical content of the comment — explaining why both the descriptor probe and the value read live inside the try — stands on its own. The reviewer credit lives in commit history / PR description, where it belongs; an in-source @-mention ages poorly (handles change, the relevant person may move on) and doesn't help future readers reason about the code. - **Test Proxy**: \`throwingDescriptorProxy\` declared a \`get()\` handler that always returned \`undefined\`. The descriptor probe throws before the helper ever reaches the value read, so the \`get\` handler is unreachable — dropped it and added a one-line comment explaining why no \`get\` handler is needed for this test. Mirror test (\`throwingReason\` with throwing accessor + \`proxyReason\` with throwing \`get\`) keeps the symmetric "throwing-`get`" coverage. 70 / 70 tests pass; tsc + ESLint clean.
Summary
PR-1 of 3 implementing Phase D part (b) of #3831 — Ctrl+B promote of a running foreground shell to background. Pure plumbing, zero behavior change for existing callers (no caller sets the new reason yet).
Defines a discriminated
ShellAbortReasonunion that anAbortController.abort(reason)can carry intoShellExecutionService.execute():{ kind: 'cancel' }) → unchanged tree-kill on abort.{ kind: 'background' }→ takeover signal:execute()skips the kill, drops the child from its active set (soShellExecutionService.cleanup()won't kill it later either), flushes a snapshot of captured output into the result, and resolves the result Promise immediately withpromoted: trueso the awaiting caller unblocks. The caller has accepted ownership of the live child.ShellExecutionResultgains an optionalpromoted?: booleanfield; existing consumers compile against the new shape without source changes.Why this PR alone
Full sequence per #3831:
signal.reasonfoundation + reason-aware abort handling. Zero behavior change.shell.tsintegration: lazy promote,BackgroundShellRegistry.promoteFromForeground(...), output buffer flush + stream redirect.Command.PROMOTE_TO_BACKGROUNDkeybinding (Ctrl+B), AppContainer wire-up, E2E tests.#3831's design discussion is open. Pushing PR-1 because the foundation has zero user-visible behavior change and is independently mergeable / revertable — surfacing the contract in code form alongside the issue. Reviewer can sanity-check the abort-handler shape independently from the lazy-promote semantics in PR-2.
Caller contract
Callers wishing to promote MUST attach their own listeners (
data/exit/error) to the live child before callingabortController.abort({ kind: 'background', ... }). After the background-abort,execute()no longer routes events through its own handlers' downstream consumers. The caller seeingresult.promoted === trueknows ownership has been transferred — it must NOT treatresult.exitCode/result.signalas terminal (the underlying process has not exited).Design rationale (why
signal.reason, not a new parameter)Three alternatives were considered (full discussion in #3831):
execute()parameter likeonPromoteRequestcallback orpromotionModeflag — adds an always-present API surface for a rare path; couples the takeover lifecycle into the call signature; harder to extend later (would need a 2nd parameter for shellId, etc.).BackgroundShellEntryfor every foreground spawn, then flipvisible: trueon Ctrl+B — touchesShellExecutionServiceheavily (output stream conditionally file-vs-memory throughout) for a feature that fires <1% of the time. Wasteful in the common case.signal.reasonas takeover channel (this PR) — reuses existingAbortControllerplumbing every caller already has; the discrimination is a 5-line addition to the existing abort handler; zero overhead in the common case (no Ctrl+B → no promote logic runs); leverages Node's nativeAbortSignal.reason(Node 17.2+, ourengines: >=20covers it).Option 3 was preferred in the #3831 design issue. This PR implements that choice.
Why resolve the Promise immediately on background-abort (vs. wait for natural child exit): blocking the caller's
awaituntil the (potentially long-running) child exits would defeat the entire purpose — the caller needs to unblock so the agent can move on. Once the caller has accepted ownership (per the contract above), the result Promise's job is done.Implementation notes
getShellAbortReasonKind(reason)helper (exported) readssignal.reason.kinddefensively: rejects non-object reasons; readskindonly as an OWN property viahasOwnProperty.call(prototype-pollution defense —Object.prototype.kind = 'background'cannot force the promote branch); wraps the read in try/catch so an own getter /Proxytrap that throws can't leak past the async abort handler (which is dispatched viaaddEventListenerand not awaited — a leaked throw would leave the shell process alive instead of killed); whitelists the value against the known union members and defaults to'cancel'for everything else (safe historical kill behavior).switch—executeWithPtyandchildProcessFallbackextractperformBackgroundPromote/performCancelKillnamed helpers andswitch (kind)between them. Thedefault: { const _exhaustive: never = kind; ... }arm forces a compile-time error ifShellAbortReasonever grows a new variant — directing the developer to (1) extend the helper's whitelist and (2) add acasearm. (The runtimedefaultbranch is unreachable today; the assertion is the static-only safety net.)activeChildProcesses/activePtysset deletion — critical soShellExecutionService.cleanup()(the process-exit handler that wipes active sets) won't kill the promoted child on Ctrl+C or process shutdown later.if (child.exitCode !== null || child.signalCode !== null)— fall through to the normal exit path. The child may have already exited but the'exit'event hasn't reached our handler yet (Node delivers child_process events on the next microtask); promoting in that window would detach our exit listener and reportpromoted: truefor a process that's already dead.if (!ShellExecutionService.isPtyActive(ptyProcess.pid))— node-pty'sIPtydoesn't exposeexitCode, so we useprocess.kill(pid, 0)(the existing helper) as a best-effort liveness probe. If the pid is gone, fall through and let the pendingonExitcallback resolve normally with the real exit status.stdoutHandler/stderrHandler/errorHandler/exitHandler); the abort handler callsdetachServiceListeners()whichoff()s all four. Without this, post-promote bytes re-enterhandleOutput, which callsdecoder.decode()on the now-finalized text decoder (cleanup() called.decode()withoutstream:true) → TypeError crash. Even without the crash, the oldonOutputEventwould fire for new data → ownership contract violation + duplication.node-pty'sonData/onExitreturnIDisposablehandles; the abort handler capturesdataDisposable/exitDisposableand calls.dispose()— each in its own try/catch (the IDisposable contract doesn't guarantee no-throw, and a single throwing dispose must not skip subsequent teardown).ptyProcess.on('error')is EventEmitter-style — extracted namedptyErrorHandlerref andremoveListener'd (NOToff()—@lydell/node-pty'sIPtyinterface only exposes the legacyremoveListener;.off()throws TypeError on a real PTY). Without these, post-promote PTY error throws → Node.js process crash; post-promote data continues writing toheadlessTerminaland calling oldonOutputEvent→ ownership violation.processingChainownership —processingChainmay have already-enqueued callbacks past the dispose point.listenersDetachedflag guards eachonOutputEventemit individually inside the chain callbacks. This way in-flight writes still LAND inheadlessTerminal(so the snapshot reflects them), but no events leak to the foregroundonOutputEvent. Also clearrenderTimeoutin the abort handler so a pending throttled render doesn't fire post-promote.PROMOTE_DRAIN_TIMEOUT_MSconstant — abort handlerawaitsPromise.race([processingChain.then(drain).then(drain).catch(() => undefined), timeout])before serialization (mirrors theonExitfinalize pattern at line ~970) so in-flightheadlessTerminal.writecallbacks land before we read the buffer. The chain side has.catch(() => undefined)so a chain rejection becomes the resolved branch instead of an unhandled rejection (abort handlers run viaaddEventListenerwhich doesn't await our return, so a leaked rejection would meanresolve()is never called and the caller hangs). Race result is observed via aSymbolsentinel; if the timeout side wins,debugLogger.warnfires pointing the caller atrawOutputas the un-truncated fallback. The drain timeout has its own constant separate fromSIGKILL_TIMEOUT_MSso tuning kill escalation doesn't silently change drain behavior.result.outputfor the promoted PTY is the same shape as the normal exit path: re-decodefinalBufferwithgetCachedEncodingForBuffer(finalBuffer)+replayTerminalOutput(...). DirectserializeTerminalToText(headlessTerminal)would risk mojibake when early output is ASCII-only but later output is in a different encoding (GBK / Shift-JIS). Direct serialize stays as a last-ditch fallback if replay throws. Deliberately does NOT callrender(true)—renderFnemits viaonOutputEvent, which would violate the "no emit post-promote" invariant.result.outputfor the promoted result is the output captured up to the moment of abort (text decoders flushed for child_process viacleanup(); encoding-aware re-decode + replay for PTY after drain). Caller (PR-2) uses this to seed theBackgroundShellEntry's on-disk output file before installing its own data listener for subsequent output.streamStdout: true+{ kind: 'background' }is a latent empty-snapshot combo (snapshot reads from string accumulators thatstreamStdoutmode bypasses). Today PR-1's only caller usesstreamStdout: falseso the combination is unreachable, butdebugLogger.warnfires if a future caller pairs them, pointing atrawOutputas the fallback.?? ''on PTY snapshot —serializeTerminalToTextmay return undefined under unusual conditions (mid-render race, or in test mocks); the fallback keepsresult.outputtyped as string.Test coverage
shellExecutionService.test.ts: 69 / 69 pass (50 baseline + 19 new across five review rounds).First-pass — discrimination (round 1, 4 tests):
{ kind: 'cancel' }still tree-kills viaprocess.kill(-pid, 'SIGTERM');{ kind: 'background' }skips kill —mockProcessKillNOT called with SIGTERM/SIGKILL on group pid;mockPtyProcess.killNOT called;result.promoted === true;result.exitCode === null;result.signal === null.{ kind: 'cancel' }still tree-kills (mirror of PTY);{ kind: 'background' }skips kill — output captured pre-abort preserved as snapshot inresult.output(test emits'line1\nline2\n'via stdout, then aborts; verifies snapshot contains both lines).Handoff-boundary (round 1, 2 tests):
mockPtyProcess.onData.mock.results[0].value.disposeandmockPtyProcess.onExit.mock.results[0].value.disposeARE called — verifying the disposable handles returned by node-pty are correctly disposed; result shape stayspromoted: trueeven when subsequent events would otherwise fire.'data'on the livecp.stdout/cp.stderrdoes NOT increaseonOutputEvent.mock.calls.length; emitting'exit', 42, nulldoes NOT changeresult.exitCode(staysnull). Pre-promote bytes ARE in the snapshot; post-promote bytes are NOT.Post-exit race guards (round 4, 2 tests):
child.exitCodewas set to 0 BEFORE'exit'reaches our handler, abort with{ kind: 'background' }— production race guard detects the terminal state and falls through; result has nopromoted: true,exitCode: 0flows through.process.kill(pid, 0)to throw ESRCH on the liveness probe, abort with{ kind: 'background' }— production guard refuses to promote a dead PTY; result has nopromoted: true.removeListenerregression guard (round 3, 1 test):mockPtyProcess.removeListenerandmockPtyProcess.off, asserts production teardown callsremoveListener('error', …)and NOToff('error', …)—@lydell/node-pty'sIPtyinterface only exposesremoveListener, not the modernoffalias. The EventEmitter mock tolerates both, so without this guard a future swap to.off()would silently regress under tests but throw TypeError on a real PTY.getShellAbortReasonKindboundary tests (round 4, 8 tests, helper exported for direct testing):Object.create({ kind: 'background' })) → 'cancel' (pollution defense)Object.defineProperty('kind', { get() { throw } })) → 'cancel'gettrap → 'cancel'{ kind: 'background' }/{ kind: 'cancel' }happy pathsTest fixture hygiene (round 4, 2 mock-setup adjustments):
child_process fallbackandexecution method selectionbeforeEachhooks setmockChildProcess.exitCode/signalCodetonull(matches real NodeChildProcessshape — alive process has these asnull, notundefined). Without this, the new race guardchild.exitCode !== nullwould silently misfire onundefined.mockPtyProcess.onData/.onExitreturn{ dispose: vi.fn() }so the productiondataDisposable.dispose()/exitDisposable.dispose()calls don't crash onundefined.dispose(). The stub's.mock.results[0].valueis then inspected by the handoff-boundary tests.Test plan
Related
中文版
PR-1 of 3 — 实现 #3831 的 Phase D 第 (b) 部分:Ctrl+B 把运行中的前台 shell 甩到后台。纯 plumbing,对现有调用方零行为变化(没人传新 reason)。
定义一个 discriminated union
ShellAbortReason,让AbortController.abort(reason)携带进ShellExecutionService.execute():{ kind: 'cancel' })→ 维持现状的 tree-kill。{ kind: 'background' }→ takeover 信号:execute()跳过 kill,把 child 从 active 集合移除(防ShellExecutionService.cleanup()后续杀它),flush 一份已捕获 output 的 snapshot 进 result,立即 resolve result Promise 带promoted: true让 await 中的调用方解锁。调用方已经接管了运行中的 child。ShellExecutionResult加可选字段promoted?: boolean,现有消费方不用改源码就能编译通过。为什么 PR-1 单独推(不等 #3831 alignment):foundation 零用户可见行为变化,独立可合可回退;用 PR 把契约具象化呈现给 reviewer,跟 issue 的设计讨论对照。Reviewer 可以在不看 PR-2 lazy-promote 语义的情况下独立 sanity-check abort handler 的形状。
调用方契约:想 promote 的调用方必须在调
abortController.abort({ kind: 'background', ... })之前给 live child attach 自己的 listener(data/exit/error)。background-abort 之后,execute()不再把事件路由给自己 handler 的下游消费方。调用方看到result.promoted === true就知道所有权已交接 — 不能把result.exitCode/result.signal当作终态(底层进程没真退出)。设计 rationale(为啥用
signal.reason不加新参数):考虑过 3 个选项(详细在 #3831):(1) 给execute()加新参数 — 给罕见路径加常驻 API 表面,耦合 takeover 生命周期到调用签名;(2) 给每个前台 spawn 预分配 hiddenBackgroundShellEntry然后 Ctrl+B 翻visible: true— 改 ShellExecutionService 太多(输出流条件 file-vs-memory),common case 浪费;(3)signal.reason作 takeover 通道(本 PR)— 复用每个调用方都有的 AbortController plumbing,区分逻辑是 5 行加在现有 abort handler 里,common case 零开销,用 Node 原生AbortSignal.reason(Node 17.2+,我们engines: >=20覆盖)。#3831 design 倾向 (3),本 PR 实现这个选择。为什么 background-abort 立刻 resolve Promise(不是等 child 自然 exit):让调用方
await阻塞到(可能长跑的)child exit 就完全违背初衷 — 调用方需要解锁让 agent 继续。一旦调用方接管了所有权(按上面契约),result Promise 的使命完成。Review 状态:5 轮 review 累计 22 个 thread 全部 resolved。第 1 轮(
8e8e18ca7,6 个 — 4 Critical + 2 Suggestion):引入 dispose-based listener detach。第 2 轮(9008717b8,6 个):defensive hardening — 穷尽 switch、per-dispose try/catch、drain race.catch防 unhandled rejection、独立PROMOTE_DRAIN_TIMEOUT_MS、drain timeout sentinel/warn、snapshot serialize warn。第 3 轮(094fff09c,3 个):真 bug —ptyProcess.off('error')在@lydell/node-pty抛 TypeError(只暴露removeListener),改 + 加 regression guard;prototype-pollution-safe own-property kind 读。第 4 轮(4cc558b3d,6 个 — 1 Critical + 5):throw-safe abort-reason 读(own getter / Proxy trap try/catch 包)、child_process post-exit liveness check、encoding-aware PTY snapshot(re-decode + replay)、删 unreachabledefault-warn(保_exhaustive: never静态检查)、8 helper 边界测试。第 5 轮(289feaa74,1 Critical):PTY 同款 race guard viaprocess.kill(pid, 0)liveness probe。下面"实现要点"描述最终 ownership 模型。实现要点:
getShellAbortReasonKind(reason)helper(已 export)防御性读signal.reason.kind:拒绝非对象 reason;用hasOwnProperty.call只读 OWN property(防原型污染 —Object.prototype.kind = 'background'不能进 promote);try/catch 包读取(防 own getter / Proxy trap 抛错穿透 async abort handler — addEventListener 不 await,泄漏 throw 会让 shell 进程留活而非被 kill);返值白名单到 union 成员,未知值默认 'cancel'(安全历史 kill 行为)。switch:executeWithPty+childProcessFallback提取performBackgroundPromote/performCancelKill命名 helper,switch (kind)分支。default: { const _exhaustive: never = kind; ... }在ShellAbortReasonunion 扩展时强制 TS 报错 — 提示开发者扩 helper 白名单 + 加 case。(runtime default 不可达,仅静态保护。)activeChildProcesses/activePtys集合 delete 关键:防ShellExecutionService.cleanup()(process exit handler 清 active 集合)在 Ctrl+C 或 process shutdown 时杀掉 promoted child。if (child.exitCode !== null || child.signalCode !== null)→ fall through 走正常 exit 路径。child 可能已 exit 但'exit'事件还没 reach handler(Node 异步派发);该窗口 promote 会 detach exit listener 上报promoted: true给已死进程。if (!ShellExecutionService.isPtyActive(ptyProcess.pid))。node-pty 的IPty不暴露exitCode,用process.kill(pid, 0)(现有 helper)作 best-effort liveness probe。pid 不存在 → fall through,让 pendingonExit走真实 exit。stdoutHandler/stderrHandler/errorHandler/exitHandler);abort handler 调detachServiceListeners()把 4 个全off()。否则 post-promote 字节会 re-enterhandleOutput→ 它会调decoder.decode()在已 finalize 的 decoder(cleanup() 已调.decode()nostream:true)→ TypeError crash。即使没 crash,老onOutputEvent也会 fire → 违反 ownership + duplicate。onData/onExit返IDisposable;abort handler 拿dataDisposable/exitDisposable调.dispose()— 每个独立 try/catch(IDisposable contract 不保证 no-throw,单一 dispose 抛错不能跳过后续 teardown)。ptyProcess.on('error')是 EventEmitter 风格 — 提取 namedptyErrorHandlerref +removeListener(不是off()—@lydell/node-pty的 IPty 接口只暴露 legacyremoveListener;.off()在真 PTY 抛 TypeError)。否则 post-promote PTY error throw → Node.js 进程 crash;post-promote data 继续写headlessTerminal调老onOutputEvent→ 违反 ownership。processingChainownership — chain 可能有 dispose 之前已 enqueue 的 callback。listenersDetachedflag 守每个 chain callback 内的onOutputEventemit。这样 in-flight write 仍 LAND 进headlessTerminal(snapshot 反映它们),但事件不 leak 到老onOutputEvent。abort handler 内也清renderTimeout防 pending 节流 render fire。PROMOTE_DRAIN_TIMEOUT_MS常量 — abort handlerawait Promise.race([processingChain.then(drain).then(drain).catch(() => undefined), timeout])后再 serialize(镜像onExit的 finalize pattern at line ~970)让 in-flightheadlessTerminal.writecallback 在读 buffer 之前 land。chain 边带.catch(() => undefined),让 chain rejection 变 resolved 而非 unhandled rejection(abort handler 经addEventListener不被 await,泄漏 rejection → resolve 永不调 → caller 死锁)。Race 结果用 Symbol sentinel 检测;timeout 边赢 →debugLogger.warn指向rawOutput作未截断 fallback。drain timeout 用独立常量与SIGKILL_TIMEOUT_MS解耦,调一不动另一。result.output跟正常 exit 同形:用getCachedEncodingForBuffer(finalBuffer)+replayTerminalOutput(...)重新解码finalBuffer。直接serializeTerminalToText(headlessTerminal)在早期 ASCII 但后段 GBK / Shift-JIS 时会 mojibake。replay 抛错时直接 serialize 作 last-ditch fallback。故意不调render(true)—renderFn通过onOutputEventemit,会违反 "no emit post-promote" 不变量。output是 abort 那一刻已捕获的 output(child_process 走 text decoder flush viacleanup();PTY drain 后 encoding-aware re-decode + replay)。caller (PR-2) 用这个 seedBackgroundShellEntry的 on-disk output file,再 attach 自己的 data listener 接后续输出。streamStdout: true+{ kind: 'background' }是潜在空 snapshot 组合(snapshot 读 string accumulator 但 streamStdout 模式跳过累积)。今 PR-1 的唯一 caller 用streamStdout: false不踩,但 future caller 配对会触发debugLogger.warn指向rawOutput兜底。?? ''防御:serializeTerminalToText在异常情况下(mid-render race / 测试 mock)可能返 undefined,fallback 保result.output类型仍是 string。测试覆盖:
shellExecutionService.test.ts69 / 69 pass(50 baseline + 19 个新加,跨 5 轮 review)。第 1 轮 — discrimination(4 测试):
{ kind: 'cancel' }仍走process.kill(-pid, 'SIGTERM')tree-kill;{ kind: 'background' }跳 kill —mockProcessKill不被调,mockPtyProcess.kill不被调,result.promoted === true。result.outputsnapshot 包含 abort 前 emit 的'line1\nline2\n'。第 1 轮 — Handoff-boundary(2 测试):
mockPtyProcess.onData.mock.results[0].value.dispose和.onExit.mock.results[0].value.dispose被调用 — 验证 node-pty 返回的 disposable handle 被正确 dispose;result shape 仍promoted: true。cp.stdout/cp.stderr再 emit'data'不增加onOutputEvent.mock.calls.length;emit'exit', 42, null不改result.exitCode(仍null)。第 4 轮 — Post-exit race guards(2 测试):
child.exitCode已设 0 但'exit'还没 reach handler 的 race 窗口,abort with{ kind: 'background' }— production race guard 检测到 terminal 状态 → fall through,result 没promoted: true,正常exitCode: 0。process.kill(pid, 0)在 liveness probe 时抛 ESRCH,abort with{ kind: 'background' }— production guard 拒绝 promote 已死 PTY;result 没promoted: true。第 3 轮 —
removeListenerregression guard(1 测试):mockPtyProcess.removeListener+mockPtyProcess.off,断言 production teardown 调removeListener('error', …)而不调off('error', …)。@lydell/node-pty的 IPty 接口只暴露removeListener(无offalias)— EventEmitter mock 容忍两者,没这测试 future 改回.off()会在 mock 下静默 regress 但 real PTY 抛 TypeError。第 4 轮 —
getShellAbortReasonKind边界(8 测试,helper 已 export 给直接测试):Object.create({ kind: 'background' }))→ 'cancel'(污染防御)Object.defineProperty('kind', { get() { throw } }))→ 'cancel'gettrap → 'cancel'{ kind: 'background' }/{ kind: 'cancel' }happy path第 4 轮 — Test fixture 一致性(2 mock setup 调整):
beforeEach(child_process fallback+execution method selection)都设mockChildProcess.exitCode/signalCode = null— 镜像真 Node ChildProcess 形状(alive 时 null 不 undefined)。否则 race guardchild.exitCode !== null会因undefined误触发。mockPtyProcess.onData/.onExit返{ dispose: vi.fn() }让 production dispose path 不会 crash onundefined.dispose()— stub 的.mock.results[0].value然后被 handoff 测试 inspect。测试计划:
vitest run packages/core/src/services/shellExecutionService.test.ts:69 / 69 pass(50 baseline + 19 个新加,跨 5 轮 review)tsc --build packages/cli(CI 等价 full mode):cleannpm run build --workspace=@qwen-code/qwen-code-core:cleanRelated:#3831(Phase D b 设计 + 3-PR sequencing)/ #3634(Background task 管理 roadmap)/ #3809(Phase D a,已合)