fix(gateway): count active CLI runs in restart deferral — prevents mid-turn termination#2460
Merged
alexey-pelykh merged 1 commit intomainfrom Apr 22, 2026
Merged
Conversation
…d-turn termination (#2345) `server-reload-handlers.ts::getActiveCounts()` and `server.impl.ts::setPreRestartDeferralCheck()` both computed gateway capacity / restart-deferral as `queueSize + pendingReplies`, silently omitting active CLI agent subprocess runs. Result: a config reload or SIGUSR1-triggered restart would fire while CLI agents were mid-turn, killing live subprocess runs. Fix: include `getActiveSessionRunCount()` from `src/agents/session-run-registry.ts` in both sums. The registry is already populated by `ChannelBridge` (register at `:160`, unregister at `:349`) and was designed as the replacement for the `getActiveEmbeddedRunCount()` stub that was gutted with the Pi-embedded execution engine. Changes: - `src/gateway/server-reload-handlers.ts`: import `getActiveSessionRunCount`; add `activeCliRuns` field to `getActiveCounts()` return; fold into `totalActive` sum; extend `formatActiveDetails()` with an `activeCliRuns > 0` branch so the deferral log reports "N active CLI run(s)" alongside operations and replies. Field name `activeCliRuns` (not `activeRuns`) to disambiguate from the per-channel `activeRuns` concept used in `channels/run-state-machine.ts`, `gateway/channel-health-policy.ts`, and related modules. - `src/gateway/server.impl.ts`: import `getActiveSessionRunCount`; add `+ getActiveSessionRunCount()` to the `setPreRestartDeferralCheck` callback arrow. Note on stale issue body: #2345 prescribes editing a hardcoded `embeddedRuns = 0` in `server-reload-handlers.ts:150-158` and replacing a `getActiveEmbeddedRunCount()` import + call at `server.impl.ts:302`. Neither exists in the current tree — both were removed during the pi-embedded-runner gut (`f749ed3fb6`, #2146/#2273) and the subsequent cherry-pick cleanup (`028566c42b`, #2442). The semantic bug the issue names (capacity sums miscount active CLI runs as zero) is still present as an *omission* rather than a hardcoded literal, and this PR fixes it. Verification: - `pnpm check` (format + tsgo + lint + project-specific lints) → exit 0 - `pnpm vitest run --config vitest.unit.config.ts src/infra/restart src/infra/infra-runtime src/agents/session-run-registry src/gateway/server.impl` → 5 files, 71 tests, all passed - Rescan: `git grep "getTotalQueueSize() + getTotalPendingReplies"` returns only the one site I updated; no other sum call sites in the codebase need the same fix - Adversarial validation (fresh-context subclaude): CLEAN verdict on 8 AC + 11 adversarial checks — confirms `getActiveSessionRunCount` is LIVE (not a stub), registry is actively populated by ChannelBridge, no import cycle possible (session-run-registry has zero imports), `formatActiveDetails` correctly handles multi-counter output Closes #2345 Refs: #2089 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
alexey-pelykh
added a commit
that referenced
this pull request
Apr 22, 2026
…ctually terminate CLI subprocesses (#2343) Replace three dead `const aborted = false` / `const ended = true` placeholders left behind by the Pi-embedded gut (`b27cecc795`, #76/#77) with real calls to `killSessionRun` / `waitForSessionRunEnd` from `src/agents/session-run-registry.ts`: - `abort.ts::stopSubagentsForRequester` inner loop — `killSessionRun(childKey)` now flows into the `killed` aggregate, so the loop actually counts subagent runs whose subprocesses were signaled. - `abort.ts::tryFastAbortFromMessage` — `killSessionRun(resolvedTargetKey)` returned in the `aborted` field (previously always `false`). User-facing `/abort` now terminates the running CLI subprocess for non-ACP sessions; ACP cancellation path remains separate. - `sessions.ts::ensureSessionRuntimeCleanup` — replaces the `const ended = true` placeholder with `killSessionRun(canonicalKey)` + `waitForSessionRunEnd` (15s timeout). Returns `UNAVAILABLE` on timeout, restoring the pre-gut contract that `sessions.reset` / `sessions.delete` surface the error when a subprocess refuses to end. Also removes the `if (!params.sessionId) return undefined;` early-exit in `ensureSessionRuntimeCleanup`: `sessionId` is used only to extend the queue-key set; termination operates on `canonicalKey`, which is always present. Callers with no `sessionId` now correctly attempt termination (`waitForSessionRunEnd` short-circuits to `true` when no run is registered, so the wait has no cost when there's nothing to wait for). Polish: extracted `15_000` to `SESSION_RUN_TERMINATION_TIMEOUT_MS`, grouped with the sibling `ACP_RUNTIME_CLEANUP_TIMEOUT_MS` constant. Issue body lists three regression sites; only two apply. The third (`commands-session.ts::applyAbortTarget`) does not exist in the current tree — `git grep applyAbortTarget` returns zero hits. The `/session abort` pathway was consolidated into `tryFastAbortFromMessage` at some earlier point, so no duplicate wiring is needed. AC3 marked N/A. Verification: - `pnpm check` → exit 0 - `pnpm test` (full parallel suite) → 800 files / 7010 passed / 3 skipped — no regression - Rescan: `git grep "const aborted = false"` → zero hits across the whole tree (combined sweep of #2344 subagent kill/steer + #2343 abort/subagent-cascade/fastAbort) - Rescan: `git grep "abortEmbeddedPiRun"` → only test-mock contexts; production has no remaining references (test-mock cleanup tracked under #2089 sweep) - Adversarial validation (fresh-context subclaude): CLEAN on 7 AC + 9 adversarial checks — confirms waitForSessionRunEnd is LIVE (50ms poll tick, max 15s wall-clock), `ensureSessionRuntimeCleanup` callers always pass a valid `canonicalKey`, and no test asserts `aborted === false` on the fast-abort path Closes #2343 Refs: #2089, #2344, #2460, #2461 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
alexey-pelykh
added a commit
that referenced
this pull request
Apr 22, 2026
…ctually terminate CLI subprocesses (#2343) (#2462) Replace three dead `const aborted = false` / `const ended = true` placeholders left behind by the Pi-embedded gut (`b27cecc795`, #76/#77) with real calls to `killSessionRun` / `waitForSessionRunEnd` from `src/agents/session-run-registry.ts`: - `abort.ts::stopSubagentsForRequester` inner loop — `killSessionRun(childKey)` now flows into the `killed` aggregate, so the loop actually counts subagent runs whose subprocesses were signaled. - `abort.ts::tryFastAbortFromMessage` — `killSessionRun(resolvedTargetKey)` returned in the `aborted` field (previously always `false`). User-facing `/abort` now terminates the running CLI subprocess for non-ACP sessions; ACP cancellation path remains separate. - `sessions.ts::ensureSessionRuntimeCleanup` — replaces the `const ended = true` placeholder with `killSessionRun(canonicalKey)` + `waitForSessionRunEnd` (15s timeout). Returns `UNAVAILABLE` on timeout, restoring the pre-gut contract that `sessions.reset` / `sessions.delete` surface the error when a subprocess refuses to end. Also removes the `if (!params.sessionId) return undefined;` early-exit in `ensureSessionRuntimeCleanup`: `sessionId` is used only to extend the queue-key set; termination operates on `canonicalKey`, which is always present. Callers with no `sessionId` now correctly attempt termination (`waitForSessionRunEnd` short-circuits to `true` when no run is registered, so the wait has no cost when there's nothing to wait for). Polish: extracted `15_000` to `SESSION_RUN_TERMINATION_TIMEOUT_MS`, grouped with the sibling `ACP_RUNTIME_CLEANUP_TIMEOUT_MS` constant. Issue body lists three regression sites; only two apply. The third (`commands-session.ts::applyAbortTarget`) does not exist in the current tree — `git grep applyAbortTarget` returns zero hits. The `/session abort` pathway was consolidated into `tryFastAbortFromMessage` at some earlier point, so no duplicate wiring is needed. AC3 marked N/A. Verification: - `pnpm check` → exit 0 - `pnpm test` (full parallel suite) → 800 files / 7010 passed / 3 skipped — no regression - Rescan: `git grep "const aborted = false"` → zero hits across the whole tree (combined sweep of #2344 subagent kill/steer + #2343 abort/subagent-cascade/fastAbort) - Rescan: `git grep "abortEmbeddedPiRun"` → only test-mock contexts; production has no remaining references (test-mock cleanup tracked under #2089 sweep) - Adversarial validation (fresh-context subclaude): CLEAN on 7 AC + 9 adversarial checks — confirms waitForSessionRunEnd is LIVE (50ms poll tick, max 15s wall-clock), `ensureSessionRuntimeCleanup` callers always pass a valid `canonicalKey`, and no test asserts `aborted === false` on the fast-abort path Closes #2343 Refs: #2089, #2344, #2460, #2461 Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
server-reload-handlers.ts::getActiveCounts()andserver.impl.ts::setPreRestartDeferralCheck()both computed gateway capacity asqueueSize + pendingReplies, silently omitting active CLI agent subprocess runs. Result: a config reload or SIGUSR1-triggered restart would fire while CLI agents were mid-turn, killing live subprocess runs.This PR adds
getActiveSessionRunCount()(from the already-livesrc/agents/session-run-registry.ts, populated byChannelBridge) to both capacity sums.⚠ Note on stale issue body
Issue #2345's prescriptive text quotes code (
embeddedRuns = 0hardcode,getActiveEmbeddedRunCount()import + call,pi-embedded-runner/runs.tsstub file) that was already removed by prior PRs:f749ed3fb6/ gut: remove vestigial embedded Pi orchestrator stubs and call sites #2146 / gut: remove vestigial embedded Pi orchestrator stubs and call sites #2273 — gut: remove vestigial embedded Pi orchestrator stubs028566c42b/ fix: cherry-pick B1-B10 audit real gaps (#2406 batch close) #2442 — cherry-pick B1-B10 audit real gaps (last touch of both gateway files)The semantic bug (capacity sums miscount active CLI runs as zero) is still present — now as an omission from the sum rather than a hardcoded literal — and this PR fixes it. The issue's prescriptive AC items (delete stub file, rename
embeddedRuns→activeRuns, update test mocks ofgetActiveEmbeddedRunCount) are all N/A because those code shapes no longer exist.Changes
src/gateway/server-reload-handlers.tsgetActiveSessionRunCount. AddactiveCliRunsfield togetActiveCounts()return; fold intototalActivesum. ExtendformatActiveDetails()withactiveCliRuns > 0branch so deferral log reports"N active CLI run(s)"alongside operations and replies.src/gateway/server.impl.tsgetActiveSessionRunCount. Add+ getActiveSessionRunCount()to thesetPreRestartDeferralCheckcallback.Field name
activeCliRuns(notactiveRuns) to disambiguate from the per-channelactiveRunsconcept used inchannels/run-state-machine.ts,gateway/channel-health-policy.ts,gateway/protocol/schema/channels.ts, anddiscord/monitor/status.ts— different semantics, gateway-wide vs. per-channel.Behavior change
Before: A config reload arriving while CLI agents are mid-turn proceeds immediately.
deferGatewayRestartUntilIdle({getPendingCount: () => getActiveCounts().totalActive})polls a sum that doesn't include CLI runs, sototalActive === 0even when runs are in flight. Restart fires, CLI subprocess receives SIGTERM, run dies.After:
totalActiveincludesgetActiveSessionRunCount(). Reload defers viadeferGatewayRestartUntilIdle(...)until the registry drains OR the configuredgateway.reload.deferralTimeoutMsfires (existing safety net). Log surface:"config change requires gateway restart — deferring until 3 active CLI run(s) complete"instead of silent mid-turn termination.Verification
pnpm check(format + tsgo + lint +lint:tmp:no-random-messaging+lint:no-remoteclaw-ai) → exit 0pnpm vitest run --config vitest.unit.config.ts src/infra/restart src/infra/infra-runtime src/agents/session-run-registry src/gateway/server.impl→ 5 files, 71 tests, all passedgit grep "getTotalQueueSize() + getTotalPendingReplies"— only the one line I updated; no other sum sites in the codebasegetActiveSessionRunCountis LIVE (ACTIVE_SESSION_RUNS.size, not a stub)ChannelBridgeatchannel-bridge.ts:160(register in atry) +:349(unregister infinally)session-run-registry.tshas zero imports — no cycle possibleformatActiveDetails()correctly produces comma-separated list with all three counters when each> 0getActiveCounts()insideserver-reload-handlers.tsuse the correctedtotalActivestructTest plan
pnpm checkexit 0getActiveEmbeddedRunCountto update (verified zero hits pre-commit)build,test,lint,docs,rebrand-gate,zombie-import-gate,stub-debt-gate,throwing-stub-callers-gate,obsolescence-audit-gate,attestation-gateContext
src/agents/session-run-registry.ts(live; already used byChannelBridge)Closes #2345
Refs: #2089
🤖 Generated with Claude Code