bug(providers/codex): `attemptController.abort()` in retry finally crashes via codex-sdk's removeAllListeners (regression from #1371)

## Summary

The fix in #1371 (\`5283ea93 fix(providers/codex): fresh AbortController per retry attempt\`) added a per-attempt \`attemptController.abort()\` inside the retry loop's \`finally\` block as cleanup. By the time that abort fires, \`@openai/codex-sdk@0.125.0\`'s own finally has already executed \`child.removeAllListeners()\` + \`child.kill()\` (the SDK's \`dist/index.js:282-289\`). The subsequent abort fires Node's internal \`spawn({signal})\` abort listener, which calls \`abortChildProcess\` → emits an \`'error'\` event on the now-listenerless child → Bun surfaces it as an **uncaught exception that bypasses any surrounding try/catch** (the throw originates inside child_process's internal abort handler, not as a JS sync throw from \`abort()\`).

100% reproducible. Crashes the workflow ~9-15s into the first \`codex\` node every time. Affects any DAG workflow with codex provider nodes; observed on \`archon-comprehensive-pr-review\` and \`archon-smart-pr-review\`.

## Environment

- Archon: latest \`dev\` branch (HEAD \`cfe7076e\`, includes \`5283ea93\`)
- \`@openai/codex-sdk\`: 0.125.0
- \`@openai/codex\` (binary): 0.120.0 + bundled 0.125.0 SDK binary
- Bun: 1.3.10 (macOS arm64)
- Reproduces both nested-inside-Claude-Code and standalone (the [\`CLAUDECODE=1\` warning](https://github.com/coleam00/Archon/issues/1067) at startup is a DIFFERENT issue — that one is silent hangs of the Claude provider; this one crashes the codex provider with a stack trace)

## Stack trace

\`\`\`
AbortError: The operation was aborted.
    at emitError (node:events:51:13)
    at abortChildProcess (node:child_process:937:15)
    at onAbortListener2 (node:child_process:35:24)
    at sendQuery (.../codex/provider.ts:848:27)
\`\`\`

Line 848 is the \`attemptController.abort()\` call added in \#1371.

## The race

1. \`thread.runStreamed(prompt, turnOptions)\` runs. The codex SDK spawns the codex CLI with \`spawn(this.executablePath, args, { env, signal: args.signal })\`. Node auto-registers an internal abort listener on the child for that signal.
2. The codex CLI runs and exits (success OR failure).
3. **Codex SDK's finally** (\`@openai/codex-sdk@0.125.0/dist/index.js:282-289\`):
   \`\`\`js
   } finally {
     rl.close();
     child.removeAllListeners();   // ← wipes child.once('error', ...) listener too
     try { if (!child.killed) child.kill(); } catch {}
   }
   \`\`\`
4. Returns control to Archon.
5. **Archon's outer finally** (\`packages/providers/src/codex/provider.ts:848\`):
   \`\`\`ts
   } finally {
     if (requestOptions?.abortSignal) {
       requestOptions.abortSignal.removeEventListener('abort', onCallerAbort);
     }
     attemptController.abort();   // ← THIS LINE
   }
   \`\`\`
6. \`attemptController.abort()\` fires Node's spawn-signal listener → \`abortChildProcess\` → emits \`'error'\` on the now-listenerless child → **uncaught**, crashes the process.

## Why try/catch around the abort doesn't work

I confirmed this empirically. Wrapping in \`try { attemptController.abort() } catch {}\` does NOT catch the AbortError — Bun surfaces it as a process-level uncaught exception via the child_process internals, not as a sync JS throw.

## Suggested fix

Just don't call \`attemptController.abort()\` in the finally. The per-attempt \`AbortController\` is short-lived and goes out of scope at iteration end — no explicit cleanup needed. The cancel propagation (\#1266's original concern) is unaffected: the caller's \`abortSignal\` still chains in via \`addEventListener('abort', onCallerAbort, { once: true })\`, which we DO remove with \`removeEventListener\` in finally.

\`\`\`diff
       } finally {
         if (requestOptions?.abortSignal) {
           requestOptions.abortSignal.removeEventListener('abort', onCallerAbort);
         }
-        // Signal to any downstream consumers that this attempt is done.
-        // Next iteration creates a fresh controller; caller's signal state
-        // is unchanged.
-        attemptController.abort();
       }
\`\`\`

Verified locally — patched at \`/Users/taylorsmith/Coding/MelowLLC/Archon/packages/providers/src/codex/provider.ts\` and the comprehensive-pr-review workflow runs end-to-end (review posted to GitHub PR + auto-fix commits pushed). Previously crashed reliably ~9s in on the first \`scope\` node.

## Repro

Any workflow with a \`provider: codex\` node:

\`\`\`bash
archon workflow run archon-comprehensive-pr-review --no-worktree "Review PR #<any>"
\`\`\`

100% crash rate within ~15s on the first codex node, regardless of model or prompt.

## Adjacent: existing tests

The PR #1371 added two regression tests for the original signal-reuse bug ("retry after crash receives a fresh signal" and "caller abort forwards into the active per-attempt signal"). Neither covers the "abort after attempt completes cleanly" case, which is the path that hits the bug. A test that mocks the SDK to invoke its finally THEN expects the route handler to complete without uncaughtException would have caught this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(providers/codex): `attemptController.abort()` in retry finally crashes via codex-sdk's removeAllListeners (regression from #1371) #1735

Summary

Environment

Stack trace

The race

Why try/catch around the abort doesn't work

Suggested fix

Repro

Adjacent: existing tests

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

bug(providers/codex): attemptController.abort() in retry finally crashes via codex-sdk's removeAllListeners (regression from #1371) #1735

Description

Summary

Environment

Stack trace

The race

Why try/catch around the abort doesn't work

Suggested fix

Repro

Adjacent: existing tests

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

bug(providers/codex): `attemptController.abort()` in retry finally crashes via codex-sdk's removeAllListeners (regression from #1371) #1735