Skip to content

bug(providers/codex): attemptController.abort() in retry finally crashes via codex-sdk's removeAllListeners (regression from #1371) #1735

@melowllc

Description

@melowllc

Summary

The fix in #1371 (`5283ea93 fix(providers/codex): fresh AbortController per retry attempt`) added a per-attempt `attemptController.abort()` inside the retry loop's `finally` block as cleanup. By the time that abort fires, `@openai/codex-sdk@0.125.0`'s own finally has already executed `child.removeAllListeners()` + `child.kill()` (the SDK's `dist/index.js:282-289`). The subsequent abort fires Node's internal `spawn({signal})` abort listener, which calls `abortChildProcess` → emits an `'error'` event on the now-listenerless child → Bun surfaces it as an uncaught exception that bypasses any surrounding try/catch (the throw originates inside child_process's internal abort handler, not as a JS sync throw from `abort()`).

100% reproducible. Crashes the workflow ~9-15s into the first `codex` node every time. Affects any DAG workflow with codex provider nodes; observed on `archon-comprehensive-pr-review` and `archon-smart-pr-review`.

Environment

  • Archon: latest `dev` branch (HEAD `cfe7076e`, includes `5283ea93`)
  • `@openai/codex-sdk`: 0.125.0
  • `@openai/codex` (binary): 0.120.0 + bundled 0.125.0 SDK binary
  • Bun: 1.3.10 (macOS arm64)
  • Reproduces both nested-inside-Claude-Code and standalone (the `CLAUDECODE=1` warning at startup is a DIFFERENT issue — that one is silent hangs of the Claude provider; this one crashes the codex provider with a stack trace)

Stack trace

```
AbortError: The operation was aborted.
at emitError (node:events:51:13)
at abortChildProcess (node:child_process:937:15)
at onAbortListener2 (node:child_process:35:24)
at sendQuery (.../codex/provider.ts:848:27)
```

Line 848 is the `attemptController.abort()` call added in #1371.

The race

  1. `thread.runStreamed(prompt, turnOptions)` runs. The codex SDK spawns the codex CLI with `spawn(this.executablePath, args, { env, signal: args.signal })`. Node auto-registers an internal abort listener on the child for that signal.
  2. The codex CLI runs and exits (success OR failure).
  3. Codex SDK's finally (`@openai/codex-sdk@0.125.0/dist/index.js:282-289`):
    ```js
    } finally {
    rl.close();
    child.removeAllListeners(); // ← wipes child.once('error', ...) listener too
    try { if (!child.killed) child.kill(); } catch {}
    }
    ```
  4. Returns control to Archon.
  5. Archon's outer finally (`packages/providers/src/codex/provider.ts:848`):
    ```ts
    } finally {
    if (requestOptions?.abortSignal) {
    requestOptions.abortSignal.removeEventListener('abort', onCallerAbort);
    }
    attemptController.abort(); // ← THIS LINE
    }
    ```
  6. `attemptController.abort()` fires Node's spawn-signal listener → `abortChildProcess` → emits `'error'` on the now-listenerless child → uncaught, crashes the process.

Why try/catch around the abort doesn't work

I confirmed this empirically. Wrapping in `try { attemptController.abort() } catch {}` does NOT catch the AbortError — Bun surfaces it as a process-level uncaught exception via the child_process internals, not as a sync JS throw.

Suggested fix

Just don't call `attemptController.abort()` in the finally. The per-attempt `AbortController` is short-lived and goes out of scope at iteration end — no explicit cleanup needed. The cancel propagation (#1266's original concern) is unaffected: the caller's `abortSignal` still chains in via `addEventListener('abort', onCallerAbort, { once: true })`, which we DO remove with `removeEventListener` in finally.

```diff
} finally {
if (requestOptions?.abortSignal) {
requestOptions.abortSignal.removeEventListener('abort', onCallerAbort);
}

  •    // Signal to any downstream consumers that this attempt is done.
    
  •    // Next iteration creates a fresh controller; caller's signal state
    
  •    // is unchanged.
    
  •    attemptController.abort();
     }
    

```

Verified locally — patched at `/Users/taylorsmith/Coding/MelowLLC/Archon/packages/providers/src/codex/provider.ts` and the comprehensive-pr-review workflow runs end-to-end (review posted to GitHub PR + auto-fix commits pushed). Previously crashed reliably ~9s in on the first `scope` node.

Repro

Any workflow with a `provider: codex` node:

```bash
archon workflow run archon-comprehensive-pr-review --no-worktree "Review PR #"
```

100% crash rate within ~15s on the first codex node, regardless of model or prompt.

Adjacent: existing tests

The PR #1371 added two regression tests for the original signal-reuse bug ("retry after crash receives a fresh signal" and "caller abort forwards into the active per-attempt signal"). Neither covers the "abort after attempt completes cleanly" case, which is the path that hits the bug. A test that mocks the SDK to invoke its finally THEN expects the route handler to complete without uncaughtException would have caught this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions