Summary
The fix in #1371 (`5283ea93 fix(providers/codex): fresh AbortController per retry attempt`) added a per-attempt `attemptController.abort()` inside the retry loop's `finally` block as cleanup. By the time that abort fires, `@openai/codex-sdk@0.125.0`'s own finally has already executed `child.removeAllListeners()` + `child.kill()` (the SDK's `dist/index.js:282-289`). The subsequent abort fires Node's internal `spawn({signal})` abort listener, which calls `abortChildProcess` → emits an `'error'` event on the now-listenerless child → Bun surfaces it as an uncaught exception that bypasses any surrounding try/catch (the throw originates inside child_process's internal abort handler, not as a JS sync throw from `abort()`).
100% reproducible. Crashes the workflow ~9-15s into the first `codex` node every time. Affects any DAG workflow with codex provider nodes; observed on `archon-comprehensive-pr-review` and `archon-smart-pr-review`.
Environment
- Archon: latest `dev` branch (HEAD `cfe7076e`, includes `5283ea93`)
- `@openai/codex-sdk`: 0.125.0
- `@openai/codex` (binary): 0.120.0 + bundled 0.125.0 SDK binary
- Bun: 1.3.10 (macOS arm64)
- Reproduces both nested-inside-Claude-Code and standalone (the `CLAUDECODE=1` warning at startup is a DIFFERENT issue — that one is silent hangs of the Claude provider; this one crashes the codex provider with a stack trace)
Stack trace
```
AbortError: The operation was aborted.
at emitError (node:events:51:13)
at abortChildProcess (node:child_process:937:15)
at onAbortListener2 (node:child_process:35:24)
at sendQuery (.../codex/provider.ts:848:27)
```
Line 848 is the `attemptController.abort()` call added in #1371.
The race
- `thread.runStreamed(prompt, turnOptions)` runs. The codex SDK spawns the codex CLI with `spawn(this.executablePath, args, { env, signal: args.signal })`. Node auto-registers an internal abort listener on the child for that signal.
- The codex CLI runs and exits (success OR failure).
- Codex SDK's finally (`@openai/codex-sdk@0.125.0/dist/index.js:282-289`):
```js
} finally {
rl.close();
child.removeAllListeners(); // ← wipes child.once('error', ...) listener too
try { if (!child.killed) child.kill(); } catch {}
}
```
- Returns control to Archon.
- Archon's outer finally (`packages/providers/src/codex/provider.ts:848`):
```ts
} finally {
if (requestOptions?.abortSignal) {
requestOptions.abortSignal.removeEventListener('abort', onCallerAbort);
}
attemptController.abort(); // ← THIS LINE
}
```
- `attemptController.abort()` fires Node's spawn-signal listener → `abortChildProcess` → emits `'error'` on the now-listenerless child → uncaught, crashes the process.
Why try/catch around the abort doesn't work
I confirmed this empirically. Wrapping in `try { attemptController.abort() } catch {}` does NOT catch the AbortError — Bun surfaces it as a process-level uncaught exception via the child_process internals, not as a sync JS throw.
Suggested fix
Just don't call `attemptController.abort()` in the finally. The per-attempt `AbortController` is short-lived and goes out of scope at iteration end — no explicit cleanup needed. The cancel propagation (#1266's original concern) is unaffected: the caller's `abortSignal` still chains in via `addEventListener('abort', onCallerAbort, { once: true })`, which we DO remove with `removeEventListener` in finally.
```diff
} finally {
if (requestOptions?.abortSignal) {
requestOptions.abortSignal.removeEventListener('abort', onCallerAbort);
}
-
// Signal to any downstream consumers that this attempt is done.
-
// Next iteration creates a fresh controller; caller's signal state
-
-
attemptController.abort();
}
```
Verified locally — patched at `/Users/taylorsmith/Coding/MelowLLC/Archon/packages/providers/src/codex/provider.ts` and the comprehensive-pr-review workflow runs end-to-end (review posted to GitHub PR + auto-fix commits pushed). Previously crashed reliably ~9s in on the first `scope` node.
Repro
Any workflow with a `provider: codex` node:
```bash
archon workflow run archon-comprehensive-pr-review --no-worktree "Review PR #"
```
100% crash rate within ~15s on the first codex node, regardless of model or prompt.
Adjacent: existing tests
The PR #1371 added two regression tests for the original signal-reuse bug ("retry after crash receives a fresh signal" and "caller abort forwards into the active per-attempt signal"). Neither covers the "abort after attempt completes cleanly" case, which is the path that hits the bug. A test that mocks the SDK to invoke its finally THEN expects the route handler to complete without uncaughtException would have caught this.
Summary
The fix in #1371 (`5283ea93 fix(providers/codex): fresh AbortController per retry attempt`) added a per-attempt `attemptController.abort()` inside the retry loop's `finally` block as cleanup. By the time that abort fires, `@openai/codex-sdk@0.125.0`'s own finally has already executed `child.removeAllListeners()` + `child.kill()` (the SDK's `dist/index.js:282-289`). The subsequent abort fires Node's internal `spawn({signal})` abort listener, which calls `abortChildProcess` → emits an `'error'` event on the now-listenerless child → Bun surfaces it as an uncaught exception that bypasses any surrounding try/catch (the throw originates inside child_process's internal abort handler, not as a JS sync throw from `abort()`).
100% reproducible. Crashes the workflow ~9-15s into the first `codex` node every time. Affects any DAG workflow with codex provider nodes; observed on `archon-comprehensive-pr-review` and `archon-smart-pr-review`.
Environment
Stack trace
```
AbortError: The operation was aborted.
at emitError (node:events:51:13)
at abortChildProcess (node:child_process:937:15)
at onAbortListener2 (node:child_process:35:24)
at sendQuery (.../codex/provider.ts:848:27)
```
Line 848 is the `attemptController.abort()` call added in #1371.
The race
```js
} finally {
rl.close();
child.removeAllListeners(); // ← wipes child.once('error', ...) listener too
try { if (!child.killed) child.kill(); } catch {}
}
```
```ts
} finally {
if (requestOptions?.abortSignal) {
requestOptions.abortSignal.removeEventListener('abort', onCallerAbort);
}
attemptController.abort(); // ← THIS LINE
}
```
Why try/catch around the abort doesn't work
I confirmed this empirically. Wrapping in `try { attemptController.abort() } catch {}` does NOT catch the AbortError — Bun surfaces it as a process-level uncaught exception via the child_process internals, not as a sync JS throw.
Suggested fix
Just don't call `attemptController.abort()` in the finally. The per-attempt `AbortController` is short-lived and goes out of scope at iteration end — no explicit cleanup needed. The cancel propagation (#1266's original concern) is unaffected: the caller's `abortSignal` still chains in via `addEventListener('abort', onCallerAbort, { once: true })`, which we DO remove with `removeEventListener` in finally.
```diff
} finally {
if (requestOptions?.abortSignal) {
requestOptions.abortSignal.removeEventListener('abort', onCallerAbort);
}
```
Verified locally — patched at `/Users/taylorsmith/Coding/MelowLLC/Archon/packages/providers/src/codex/provider.ts` and the comprehensive-pr-review workflow runs end-to-end (review posted to GitHub PR + auto-fix commits pushed). Previously crashed reliably ~9s in on the first `scope` node.
Repro
Any workflow with a `provider: codex` node:
```bash
archon workflow run archon-comprehensive-pr-review --no-worktree "Review PR #"
```
100% crash rate within ~15s on the first codex node, regardless of model or prompt.
Adjacent: existing tests
The PR #1371 added two regression tests for the original signal-reuse bug ("retry after crash receives a fresh signal" and "caller abort forwards into the active per-attempt signal"). Neither covers the "abort after attempt completes cleanly" case, which is the path that hits the bug. A test that mocks the SDK to invoke its finally THEN expects the route handler to complete without uncaughtException would have caught this.