Skip to content

fix(gateway): count active CLI runs in restart deferral — prevents mid-turn termination#2460

Merged
alexey-pelykh merged 1 commit intomainfrom
fix/2345-gateway-session-run-count
Apr 22, 2026
Merged

fix(gateway): count active CLI runs in restart deferral — prevents mid-turn termination#2460
alexey-pelykh merged 1 commit intomainfrom
fix/2345-gateway-session-run-count

Conversation

@alexey-pelykh
Copy link
Copy Markdown

Summary

server-reload-handlers.ts::getActiveCounts() and server.impl.ts::setPreRestartDeferralCheck() both computed gateway capacity as queueSize + pendingReplies, silently omitting active CLI agent subprocess runs. Result: a config reload or SIGUSR1-triggered restart would fire while CLI agents were mid-turn, killing live subprocess runs.

This PR adds getActiveSessionRunCount() (from the already-live src/agents/session-run-registry.ts, populated by ChannelBridge) to both capacity sums.

⚠ Note on stale issue body

Issue #2345's prescriptive text quotes code (embeddedRuns = 0 hardcode, getActiveEmbeddedRunCount() import + call, pi-embedded-runner/runs.ts stub file) that was already removed by prior PRs:

The semantic bug (capacity sums miscount active CLI runs as zero) is still present — now as an omission from the sum rather than a hardcoded literal — and this PR fixes it. The issue's prescriptive AC items (delete stub file, rename embeddedRunsactiveRuns, update test mocks of getActiveEmbeddedRunCount) are all N/A because those code shapes no longer exist.

Changes

File What
src/gateway/server-reload-handlers.ts Import getActiveSessionRunCount. Add activeCliRuns field to getActiveCounts() return; fold into totalActive sum. Extend formatActiveDetails() with activeCliRuns > 0 branch so deferral log reports "N active CLI run(s)" alongside operations and replies.
src/gateway/server.impl.ts Import getActiveSessionRunCount. Add + getActiveSessionRunCount() to the setPreRestartDeferralCheck callback.

Field name activeCliRuns (not activeRuns) to disambiguate from the per-channel activeRuns concept used in channels/run-state-machine.ts, gateway/channel-health-policy.ts, gateway/protocol/schema/channels.ts, and discord/monitor/status.ts — different semantics, gateway-wide vs. per-channel.

Behavior change

Before: A config reload arriving while CLI agents are mid-turn proceeds immediately. deferGatewayRestartUntilIdle({getPendingCount: () => getActiveCounts().totalActive}) polls a sum that doesn't include CLI runs, so totalActive === 0 even when runs are in flight. Restart fires, CLI subprocess receives SIGTERM, run dies.

After: totalActive includes getActiveSessionRunCount(). Reload defers via deferGatewayRestartUntilIdle(...) until the registry drains OR the configured gateway.reload.deferralTimeoutMs fires (existing safety net). Log surface: "config change requires gateway restart — deferring until 3 active CLI run(s) complete" instead of silent mid-turn termination.

Verification

  • pnpm check (format + tsgo + lint + lint:tmp:no-random-messaging + lint:no-remoteclaw-ai) → exit 0
  • pnpm vitest run --config vitest.unit.config.ts src/infra/restart src/infra/infra-runtime src/agents/session-run-registry src/gateway/server.impl5 files, 71 tests, all passed
  • Rescan: git grep "getTotalQueueSize() + getTotalPendingReplies" — only the one line I updated; no other sum sites in the codebase
  • Adversarial validation (fresh-context subclaude, 8 AC + 11 adversarial checks): CLEAN. Key confirmations:
    • getActiveSessionRunCount is LIVE (ACTIVE_SESSION_RUNS.size, not a stub)
    • Registry is populated by ChannelBridge at channel-bridge.ts:160 (register in a try) + :349 (unregister in finally)
    • session-run-registry.ts has zero imports — no cycle possible
    • formatActiveDetails() correctly produces comma-separated list with all three counters when each > 0
    • Both callers of getActiveCounts() inside server-reload-handlers.ts use the corrected totalActive struct

Test plan

  • pnpm check exit 0
  • Targeted unit tests green (restart, infra-runtime, session-run-registry, gateway/server.impl)
  • No test mocks of getActiveEmbeddedRunCount to update (verified zero hits pre-commit)
  • CI green on build, test, lint, docs, rebrand-gate, zombie-import-gate, stub-debt-gate, throwing-stub-callers-gate, obsolescence-audit-gate, attestation-gate

Context

Closes #2345

Refs: #2089

🤖 Generated with Claude Code

…d-turn termination (#2345)

`server-reload-handlers.ts::getActiveCounts()` and
`server.impl.ts::setPreRestartDeferralCheck()` both computed gateway
capacity / restart-deferral as `queueSize + pendingReplies`, silently
omitting active CLI agent subprocess runs. Result: a config reload or
SIGUSR1-triggered restart would fire while CLI agents were mid-turn,
killing live subprocess runs.

Fix: include `getActiveSessionRunCount()` from
`src/agents/session-run-registry.ts` in both sums. The registry is
already populated by `ChannelBridge` (register at `:160`, unregister
at `:349`) and was designed as the replacement for the
`getActiveEmbeddedRunCount()` stub that was gutted with the Pi-embedded
execution engine.

Changes:
- `src/gateway/server-reload-handlers.ts`: import
  `getActiveSessionRunCount`; add `activeCliRuns` field to
  `getActiveCounts()` return; fold into `totalActive` sum; extend
  `formatActiveDetails()` with an `activeCliRuns > 0` branch so the
  deferral log reports "N active CLI run(s)" alongside operations
  and replies. Field name `activeCliRuns` (not `activeRuns`) to
  disambiguate from the per-channel `activeRuns` concept used in
  `channels/run-state-machine.ts`, `gateway/channel-health-policy.ts`,
  and related modules.
- `src/gateway/server.impl.ts`: import `getActiveSessionRunCount`;
  add `+ getActiveSessionRunCount()` to the `setPreRestartDeferralCheck`
  callback arrow.

Note on stale issue body: #2345 prescribes editing a hardcoded
`embeddedRuns = 0` in `server-reload-handlers.ts:150-158` and
replacing a `getActiveEmbeddedRunCount()` import + call at
`server.impl.ts:302`. Neither exists in the current tree — both were
removed during the pi-embedded-runner gut (`f749ed3fb6`, #2146/#2273)
and the subsequent cherry-pick cleanup (`028566c42b`, #2442). The
semantic bug the issue names (capacity sums miscount active CLI runs
as zero) is still present as an *omission* rather than a hardcoded
literal, and this PR fixes it.

Verification:
- `pnpm check` (format + tsgo + lint + project-specific lints) → exit 0
- `pnpm vitest run --config vitest.unit.config.ts src/infra/restart
  src/infra/infra-runtime src/agents/session-run-registry
  src/gateway/server.impl` → 5 files, 71 tests, all passed
- Rescan: `git grep "getTotalQueueSize() + getTotalPendingReplies"`
  returns only the one site I updated; no other sum call sites in the
  codebase need the same fix
- Adversarial validation (fresh-context subclaude): CLEAN verdict on
  8 AC + 11 adversarial checks — confirms `getActiveSessionRunCount`
  is LIVE (not a stub), registry is actively populated by ChannelBridge,
  no import cycle possible (session-run-registry has zero imports),
  `formatActiveDetails` correctly handles multi-counter output

Closes #2345

Refs: #2089

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alexey-pelykh alexey-pelykh merged commit c49bdaf into main Apr 22, 2026
15 checks passed
@alexey-pelykh alexey-pelykh deleted the fix/2345-gateway-session-run-count branch April 22, 2026 10:09
alexey-pelykh added a commit that referenced this pull request Apr 22, 2026
…ctually terminate CLI subprocesses (#2343)

Replace three dead `const aborted = false` / `const ended = true`
placeholders left behind by the Pi-embedded gut (`b27cecc795`, #76/#77)
with real calls to `killSessionRun` / `waitForSessionRunEnd` from
`src/agents/session-run-registry.ts`:

- `abort.ts::stopSubagentsForRequester` inner loop — `killSessionRun(childKey)`
  now flows into the `killed` aggregate, so the loop actually counts
  subagent runs whose subprocesses were signaled.
- `abort.ts::tryFastAbortFromMessage` — `killSessionRun(resolvedTargetKey)`
  returned in the `aborted` field (previously always `false`). User-facing
  `/abort` now terminates the running CLI subprocess for non-ACP sessions;
  ACP cancellation path remains separate.
- `sessions.ts::ensureSessionRuntimeCleanup` — replaces the `const ended = true`
  placeholder with `killSessionRun(canonicalKey)` + `waitForSessionRunEnd`
  (15s timeout). Returns `UNAVAILABLE` on timeout, restoring the pre-gut
  contract that `sessions.reset` / `sessions.delete` surface the error
  when a subprocess refuses to end.

Also removes the `if (!params.sessionId) return undefined;` early-exit
in `ensureSessionRuntimeCleanup`: `sessionId` is used only to extend the
queue-key set; termination operates on `canonicalKey`, which is always
present. Callers with no `sessionId` now correctly attempt termination
(`waitForSessionRunEnd` short-circuits to `true` when no run is
registered, so the wait has no cost when there's nothing to wait for).

Polish: extracted `15_000` to `SESSION_RUN_TERMINATION_TIMEOUT_MS`,
grouped with the sibling `ACP_RUNTIME_CLEANUP_TIMEOUT_MS` constant.

Issue body lists three regression sites; only two apply. The third
(`commands-session.ts::applyAbortTarget`) does not exist in the current
tree — `git grep applyAbortTarget` returns zero hits. The `/session
abort` pathway was consolidated into `tryFastAbortFromMessage` at some
earlier point, so no duplicate wiring is needed. AC3 marked N/A.

Verification:
- `pnpm check` → exit 0
- `pnpm test` (full parallel suite) → 800 files / 7010 passed /
  3 skipped — no regression
- Rescan: `git grep "const aborted = false"` → zero hits across the
  whole tree (combined sweep of #2344 subagent kill/steer + #2343
  abort/subagent-cascade/fastAbort)
- Rescan: `git grep "abortEmbeddedPiRun"` → only test-mock contexts;
  production has no remaining references (test-mock cleanup tracked
  under #2089 sweep)
- Adversarial validation (fresh-context subclaude): CLEAN on 7 AC +
  9 adversarial checks — confirms waitForSessionRunEnd is LIVE (50ms
  poll tick, max 15s wall-clock), `ensureSessionRuntimeCleanup` callers
  always pass a valid `canonicalKey`, and no test asserts `aborted ===
  false` on the fast-abort path

Closes #2343

Refs: #2089, #2344, #2460, #2461

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
alexey-pelykh added a commit that referenced this pull request Apr 22, 2026
…ctually terminate CLI subprocesses (#2343) (#2462)

Replace three dead `const aborted = false` / `const ended = true`
placeholders left behind by the Pi-embedded gut (`b27cecc795`, #76/#77)
with real calls to `killSessionRun` / `waitForSessionRunEnd` from
`src/agents/session-run-registry.ts`:

- `abort.ts::stopSubagentsForRequester` inner loop — `killSessionRun(childKey)`
  now flows into the `killed` aggregate, so the loop actually counts
  subagent runs whose subprocesses were signaled.
- `abort.ts::tryFastAbortFromMessage` — `killSessionRun(resolvedTargetKey)`
  returned in the `aborted` field (previously always `false`). User-facing
  `/abort` now terminates the running CLI subprocess for non-ACP sessions;
  ACP cancellation path remains separate.
- `sessions.ts::ensureSessionRuntimeCleanup` — replaces the `const ended = true`
  placeholder with `killSessionRun(canonicalKey)` + `waitForSessionRunEnd`
  (15s timeout). Returns `UNAVAILABLE` on timeout, restoring the pre-gut
  contract that `sessions.reset` / `sessions.delete` surface the error
  when a subprocess refuses to end.

Also removes the `if (!params.sessionId) return undefined;` early-exit
in `ensureSessionRuntimeCleanup`: `sessionId` is used only to extend the
queue-key set; termination operates on `canonicalKey`, which is always
present. Callers with no `sessionId` now correctly attempt termination
(`waitForSessionRunEnd` short-circuits to `true` when no run is
registered, so the wait has no cost when there's nothing to wait for).

Polish: extracted `15_000` to `SESSION_RUN_TERMINATION_TIMEOUT_MS`,
grouped with the sibling `ACP_RUNTIME_CLEANUP_TIMEOUT_MS` constant.

Issue body lists three regression sites; only two apply. The third
(`commands-session.ts::applyAbortTarget`) does not exist in the current
tree — `git grep applyAbortTarget` returns zero hits. The `/session
abort` pathway was consolidated into `tryFastAbortFromMessage` at some
earlier point, so no duplicate wiring is needed. AC3 marked N/A.

Verification:
- `pnpm check` → exit 0
- `pnpm test` (full parallel suite) → 800 files / 7010 passed /
  3 skipped — no regression
- Rescan: `git grep "const aborted = false"` → zero hits across the
  whole tree (combined sweep of #2344 subagent kill/steer + #2343
  abort/subagent-cascade/fastAbort)
- Rescan: `git grep "abortEmbeddedPiRun"` → only test-mock contexts;
  production has no remaining references (test-mock cleanup tracked
  under #2089 sweep)
- Adversarial validation (fresh-context subclaude): CLEAN on 7 AC +
  9 adversarial checks — confirms waitForSessionRunEnd is LIVE (50ms
  poll tick, max 15s wall-clock), `ensureSessionRuntimeCleanup` callers
  always pass a valid `canonicalKey`, and no test asserts `aborted ===
  false` on the fast-abort path

Closes #2343

Refs: #2089, #2344, #2460, #2461

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(gateway): wire reload handlers and server.impl capacity accounting to getActiveSessionRunCount

1 participant