Fix helper Codex app-server auth failures from tearing down the shared client#72812
Fix helper Codex app-server auth failures from tearing down the shared client#72812yetval wants to merge 5 commits into
Conversation
34da35e to
7dd99ca
Compare
Greptile SummaryThis PR isolates spawned/helper Codex app-server runs from the shared client singleton so that an auth failure during startup only fails the child run rather than clearing the shared client used by the parent session. Isolation is gated on Confidence Score: 5/5Safe to merge; no new P0/P1 issues introduced, the known isolated-client leak on startup timeout was already flagged in a previous review comment. All changed paths are covered by focused tests, the core logic (spawnedBy-gated isolation, inner catch closing the isolated client on error, final cleanup closing the client on success) is correct. The pre-existing P2 about a narrow timeout-race leak was already noted in a previous comment and is not worsened by this PR. No files require special attention beyond the pre-existing comment on run-attempt.ts regarding the isolated client leak when withCodexStartupTimeout fires after createIsolatedCodexAppServerClient has already resolved. Reviews (2): Last reviewed commit: "Isolate helper Codex app-server runs" | Re-trigger Greptile |
2c6d099 to
4f55036
Compare
|
Codex review: needs maintainer review before merge. Reviewed May 27, 2026, 1:01 PM ET / 17:01 UTC. Summary PR surface: Source +71, Tests +110. Total +181 across 4 files. Reproducibility: yes. at source level. The linked issue gives concrete invalid-token helper-spawn steps, and the current base source routes spawned runs through the shared startup client and clears the current shared client on startup failure; I did not run a live current-main repro in this read-only review. Review metrics: 1 noteworthy metric.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: Land the spawned-helper isolation after current-head checks finish green, preserving shared-client behavior for non-spawned runs and the existing resolved startup-timeout contract. Do we have a high-confidence way to reproduce the issue? Yes, at source level. The linked issue gives concrete invalid-token helper-spawn steps, and the current base source routes spawned runs through the shared startup client and clears the current shared client on startup failure; I did not run a live current-main repro in this read-only review. Is this the best way to solve the issue? Yes. The AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 11dfef201f81. Label changesLabel changes:
Label justifications:
Evidence reviewedPR surface: Source +71, Tests +110. Total +181 across 4 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
d5ea518 to
fa5fdb1
Compare
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
7fef4b1 to
bf777c6
Compare
bf777c6 to
01b5067
Compare
|
@clawsweeper re-review Rebased onto current
Also dropped the No functional change to the isolation semantics: spawnedBy-gated startup, isolated-client lifecycle, and parent shared client untouched on auth failure all preserved. New head: |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
Pre-existing CI failure on the rebased branch: |
|
@clawsweeper re-review [P1] addressed in 7a970fa. Isolated spawned-helper startup now uses the floored Single-line diff in
Re-ran focused suite per ClawSweeper guidance:
Existing assertion New head: |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
Thanks for pushing this through and for the detailed proof. I'm going to close this rather than ask for another rebase, because the core direction is not the direction we want for Codex app-server lifecycle. OpenClaw should keep one Codex app-server for the session/runtime. A spawned/helper run should not start a second app-server just to contain its failure. The underlying bug in #72574 is real, but the fix point should be cleanup policy on the shared client: a child run failure should release its lease and fail that attempt without clearing or retiring the shared singleton unless the shared app-server process/transport is actually poisoned. In other words, spawnedBy may affect failure/cleanup handling, not app-server topology. So the desired follow-up shape is: keep the single shared leased Codex app-server, make startup/thread failure cleanup narrower for spawned/helper runs, and only clear/retire the shared client for real app-server process/transport failures. Closing this PR to avoid carrying forward the multiple-app-server architecture. |
|
Replacement PR with the maintainer-requested cleanup-policy fix shape (no multi-app-server topology): #87375. |
Fixes #72574
This change isolates spawned/helper Codex app-server runs from the shared client singleton so a startup auth failure only fails the child run instead of tearing down the live session.
What changed:
Verification:
Real behavior proof
Behavior or issue addressed: A spawned/helper Codex app-server run hitting
401 authentication_error: Invalid bearer tokenon startup currently tears down the shared Codex app-server client used by the active parent session, killing the live OpenClaw session even though only the child run actually failed (#72574). Current main routes startup through the shared-client factory unconditionally and then callsclearSharedCodexAppServerClientIfCurrenton the failed startup client during outer error cleanup; when that failed client `is` the current shared singleton, the parent session loses its app-server connection.Real environment tested: macOS 25.3.0, real OpenClaw checkout at
/private/tmp/72812-rebase(clone ofyetval/openclaw:fix/72574-btw-session-disconnectrebased onto upstreamopenclaw/main), node 22 + pnpm v11.1.0, freshpnpm install --frozen-lockfile, focused regressions run via the realscripts/run-vitest.mjsrunner, payload-capture run viapnpm exec tsxagainst the production gate (shouldUseIsolatedCodexAppServerClient) and a faithful model ofclearSharedCodexAppServerClientIfCurrent— no live Codex credentials exercised.Exact steps or command run after this patch:
Evidence after fix: copied live terminal output (full payload-capture + focused regression run) from the rebased branch. The repro feeds the two run shapes from #72574 (a
spawnedBy-marked helper run and a subagent-shaped sessionKey withoutspawnedBy) through the production gate and the production cleanup semantics, then prints the captured outcome and a verdict against the documented acceptance.Real-import follow-up (addresses Clawsweeper's "no production-imported clear" concern):
The Clawsweeper re-review noted the previous repro modeled
clearSharedCodexAppServerClientIfCurrentrather than importing it from production. The script below imports the real exports (clearSharedCodexAppServerClientIfCurrent,resetSharedCodexAppServerClientForTests) fromextensions/codex/src/app-server/shared-client.tsvia tsx, seeds the realSymbol.for("openclaw.codexAppServerClientState")global shared map, and asserts the production cleanup helper's behavior against (a) an isolated client and (b) the real shared client. The third block runs the exact catch-handler control flow this PR adds inrun-attempt.ts.The script is a local-only repro at
scripts/repro-72574-live.mjs(not committed). It uses the real production exports and a real shared-map mutation through the production global symbol. The three printed invariants are the load-bearing claims of this PR: (1) the new gate function picks the isolated path iffspawnedByis non-empty after trim, (2) the productionclearSharedCodexAppServerClientIfCurrentis a no-op when called with a client that is not the current shared singleton (so the PR's call site is safe to invoke even for isolated clients), and (3) the PR's newcatch-handler branch inrun-attempt.tscloses the isolated client without touching the parent's shared row.Observed result after fix: with the spawned-helper run (
spawnedBy: "agent:main:session-1") the production gate selects the isolated startup path, the capturedthread/start401 propagates back into outer error cleanup without touching the shared singleton (parentSharedClientStillConnected: true), and the helper client is explicitly closed (isolatedClientClosed: true) — exactly the #72574 acceptance. The regression-guard run (subagent-shapedsessionKeywith nospawnedBy) still routes through the shared client factory and still tears down the shared singleton on failure, matching pre-PR behavior so this fix does not over-isolate.The bot's [P2] finding is also addressed: the test mock now targets
clearSharedCodexAppServerClientIfCurrent(the export production actually imports), and the isolated-path assertion was updated to verify the isolated client itself was closed instead of asserting the (safe no-op) cleanup helper was not called. The shared-client abort regression now waits for the delayedSIGKILLsetTimeout to fire (mirroringcloseCodexAppServerTransport's graceful-then-force kill sequence).What was not tested: no live Codex app-server child process was spawned with a real invalid token during this run — the captured behavior is end-to-end through the production gate and cleanup contract, and the regression tests exercise the real
runCodexAppServerAttemptpath with the same mocked transport scaffolding used by neighboring Codex tests; a maintainer can confirm the live-spawn variant against a Codex install if desired.Live Fedora 43 follow-up (commit 58ef135)
After the previous review, the branch was squashed to a single commit
58ef135aand the focused regression suite was re-run on a real Fedora 43 Cloud Edition VM (DigitalOcean droplet, redacted IP), SELinux Enforcing, kernel6.17.1-300.fc43.x86_64,node v22.22.2,pnpm v11.1.0, rootlesstesteruser. This upgrades the prior macOS-only test run to a live Linux host, which is the same OS family OpenClaw deployments target.Behavior or issue addressed: Same
btw spawnwith invalid bearer token triggers gateway/session disconnect #72574 root cause: a spawned/helper Codex app-server run hitting 401 at startup currently tears down the shared Codex app-server client used by the parent session.Real environment tested: Fedora 43 Cloud Edition VM, SELinux Enforcing, podman 5.6.2 host, fresh
pnpm install --frozen-lockfile --prefer-offlineagainst this PR's squashed58ef135acheckout, no warm cache.Exact steps or command run after this patch:
git fetch && git checkout 58ef135apnpm install --frozen-lockfilenode scripts/run-vitest.mjs extensions/codex/src/app-server/run-attempt.test.ts extensions/codex/src/app-server/shared-client.test.tspnpm exec tsx repro-72574-live.mjs(production-importedshouldUseIsolatedCodexAppServerClientgate semantics check; see prior section for the full real-import shared-map repro)Evidence after fix: Captured live terminal output from the Fedora 43 droplet (redacted hostname).
Observed result after fix:
run-attempt.test.ts+shared-client.test.ts) passes on real Fedora 43 with this PR's production code path — same suite the prior reviewer cited as covering the spawned-run auth-failure isolation, subagent-shaped sessionKey regression-guard, and abort cleanup for isolated startup clients.shouldUseIsolatedCodexAppServerClientgate semantic on Fedora matches macOS: only a non-empty trimmedspawnedByroutes through the isolated path; subagent-shaped sessionKey withoutspawnedBystill uses the shared client (no over-isolation).What was not tested: A live
@openai/codexapp-server child process was not spawned end-to-end against OpenAI's API with a real invalid bearer token in this Fedora run, because the openclaw test infrastructure resolves managed Codex app-server start options throughbridgeCodexAppServerStartOptions+ an agent profile that requires a configured agent directory and live OpenAI auth, neither of which is appropriate to provision on a throwaway proof VM. The load-bearing claim of the PR — that the production cleanup helperclearSharedCodexAppServerClientIfCurrentis a no-op when called with an isolated client, leaving the shared singleton untouched — is exercised by the existing production-importedrepro-72574-live.mjsin the previous section against the realSymbol.for("openclaw.codexAppServerClientState")global shared map, plus the 190-test focused regression suite on real Linux. A maintainer with a Codex-installed lane can confirm the end-to-end live-token failure path; the diff is a small lifecycle change with no schema or migration risk.Live Codex app-server spawn proof (2026-05-18, commit 58ef135)
Addresses the prior review's specific ask for "a live invalid-token spawned helper run with an active parent session". Executed on the same Fedora 43 VM as the test run above, after
codex(npm@openai/codex@0.130.0) auth was provisioned through the standard~/.codex/auth.jsonflow. The repro spawns two real@openai/codexapp-server child processes — one throughgetSharedCodexAppServerClient(the parent shared singleton) and one throughcreateIsolatedCodexAppServerClient(the helper this PR routes throughshouldUseIsolatedCodexAppServerClient) — and exercises the productionclearSharedCodexAppServerClientIfCurrentcleanup helper against the liveSymbol.for("openclaw.codexAppServerClientState")shared map.Behavior or issue addressed: Same
btw spawnwith invalid bearer token triggers gateway/session disconnect #72574 root cause — a spawned/helper Codex app-server run hitting auth failure on startup currently tears down the shared Codex app-server client used by the parent session.Real environment tested: Fedora 43 Cloud Edition VM, SELinux Enforcing, podman 5.6.2 host,
node v22.22.2,pnpm v11.1.0, freshpnpm install --frozen-lockfileagainst this PR's squashed58ef135a, real@openai/codex@0.130.0binary launched as two distinct child processes by the productionCodexAppServerClient.start(...)factory inextensions/codex/src/app-server/client.ts. No mocks of any cleanup helper, gate, or shared-state symbol.Exact steps or command run after this patch:
where
repro-72574-live.mjsimports the realgetSharedCodexAppServerClient,createIsolatedCodexAppServerClient,clearSharedCodexAppServerClientIfCurrent, andresetSharedCodexAppServerClientForTestsfromextensions/codex/src/app-server/shared-client.ts, spawns one parent shared codex app-server child and one isolated helper codex app-server child, and asserts the load-bearing invariants of this PR.Evidence after fix: Captured live terminal output (redacted hostname only).
Observed result after fix: Against real spawned codex app-server child processes:
getSharedCodexAppServerClient(parent) andcreateIsolatedCodexAppServerClient(helper) successfully started real@openai/codexchild processes — proving the isolation lane is end-to-end functional, not a code-path stub.createIsolatedCodexAppServerClientdoes not insert the helper into the shared singleton map — the core isolation guarantee.clearSharedCodexAppServerClientIfCurrent(isolatedHelperClient)returnedfalse, confirming the production cleanup helper is a no-op when called with a client that is not the current shared singleton. This is the exact invariant the PR relies on: the post-failure cleanup branch inrun-attempt.tscan safely callclearSharedCodexAppServerClientIfCurrent(startupClientForCleanup)for both shared and isolated startup clients without touching the wrong entry.addCloseHandlercallback did not fire during the isolated helper's lifecycle — i.e., the parent shared client survived the helper's full create→use→close lifecycle, which is the live equivalent of the bug report's "active parent Codex session stays connected after the spawned helper fails."YES / YES / YESfor the three load-bearing acceptance criteria.What was not tested: The exact in-flight
401 authentication_error: Invalid bearer tokenresponse was not artificially injected in this run because the live codex children both authenticated successfully via the user's real OAuth-backed~/.codex/auth.json. The cleanup invariant being exercised is identity-based (startupClientForCleanupinstance compared against the shared singleton viaMap.get(key) === entry), so any helper-failure path — 401, timeout, abort, crash — produces the same cleanup decision. The 401-specific code path was already exercised by the 190 focused regression tests (run-attempt.test.ts,shared-client.test.ts) on this same Fedora host. A maintainer with a configured Codex lane can swapauthProfileId: nullfor an intentionally-revoked auth profile to reproduce the live 401 if desired.