Skip to content

Fix helper Codex app-server auth failures from tearing down the shared client#72812

Closed
yetval wants to merge 5 commits into
openclaw:mainfrom
yetval:fix/72574-btw-session-disconnect
Closed

Fix helper Codex app-server auth failures from tearing down the shared client#72812
yetval wants to merge 5 commits into
openclaw:mainfrom
yetval:fix/72574-btw-session-disconnect

Conversation

@yetval

@yetval yetval commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Fixes #72574

This change isolates spawned/helper Codex app-server runs from the shared client singleton so a startup auth failure only fails the child run instead of tearing down the live session.

What changed:

  • Use an isolated app-server client when spawnedBy is present.
  • Keep shared-client teardown limited to non-isolated runs.
  • Close isolated clients explicitly during cleanup.
  • Add regression coverage for spawned-run auth failures and a counter-test showing subagent-shaped session keys alone do not trigger isolation.

Verification:

  • git diff --check
  • Focused vitest run is still blocked by an unrelated workspace dependency issue: missing typebox in src/gateway/protocol/schema/agent.ts.

Real behavior proof

Behavior or issue addressed: A spawned/helper Codex app-server run hitting 401 authentication_error: Invalid bearer token on startup currently tears down the shared Codex app-server client used by the active parent session, killing the live OpenClaw session even though only the child run actually failed (#72574). Current main routes startup through the shared-client factory unconditionally and then calls clearSharedCodexAppServerClientIfCurrent on the failed startup client during outer error cleanup; when that failed client `is` the current shared singleton, the parent session loses its app-server connection.

Real environment tested: macOS 25.3.0, real OpenClaw checkout at /private/tmp/72812-rebase (clone of yetval/openclaw:fix/72574-btw-session-disconnect rebased onto upstream openclaw/main), node 22 + pnpm v11.1.0, fresh pnpm install --frozen-lockfile, focused regressions run via the real scripts/run-vitest.mjs runner, payload-capture run via pnpm exec tsx against the production gate (shouldUseIsolatedCodexAppServerClient) and a faithful model of clearSharedCodexAppServerClientIfCurrent — no live Codex credentials exercised.

Exact steps or command run after this patch:

pnpm exec tsx repro-72574.ts
node scripts/run-vitest.mjs \
  extensions/codex/src/app-server/run-attempt.test.ts \
  extensions/codex/src/app-server/shared-client.test.ts

Evidence after fix: copied live terminal output (full payload-capture + focused regression run) from the rebased branch. The repro feeds the two run shapes from #72574 (a spawnedBy-marked helper run and a subagent-shaped sessionKey without spawnedBy) through the production gate and the production cleanup semantics, then prints the captured outcome and a verdict against the documented acceptance.

$ pnpm exec tsx repro-72574.ts
=== Spawned helper run (this PR's path) ===
{
  "params": {
    "sessionKey": "agent:main:session-1",
    "spawnedBy": "agent:main:session-1"
  },
  "outcome": {
    "startupPath": "isolated",
    "startupClientId": "isolated-helper-client",
    "threadStartError": "401 authentication_error: Invalid bearer token",
    "parentSharedClientStillConnected": true,
    "isolatedClientClosed": true
  }
}
=== 
Non-isolated subagent-shaped sessionKey (regression guard) ===
{
  "params": {
    "sessionKey": "agent:main:subagent:child"
  },
  "outcome": {
    "startupPath": "shared",
    "startupClientId": "shared-parent-client-A",
    "threadStartError": "401 authentication_error: Invalid bearer token",
    "parentSharedClientStillConnected": false,
    "isolatedClientClosed": false
  }
}

=== VERDICT (#72574 acceptance) ===
Spawned helper invalid-token run leaves parent shared client connected: YES (fixed)
Spawned helper releases its own isolated client on failure: YES
Subagent-shaped sessionKey alone still routes through shared client: YES (no over-isolation)
Subagent failure still disconnects shared client (matches pre-PR behavior): YES
$ node scripts/run-vitest.mjs \
    extensions/codex/src/app-server/run-attempt.test.ts \
    extensions/codex/src/app-server/shared-client.test.ts

 Test Files  2 passed (2)
      Tests  151 passed (151)
   Duration  18.23s

Real-import follow-up (addresses Clawsweeper's "no production-imported clear" concern):

The Clawsweeper re-review noted the previous repro modeled clearSharedCodexAppServerClientIfCurrent rather than importing it from production. The script below imports the real exports (clearSharedCodexAppServerClientIfCurrent, resetSharedCodexAppServerClientForTests) from extensions/codex/src/app-server/shared-client.ts via tsx, seeds the real Symbol.for("openclaw.codexAppServerClientState") global shared map, and asserts the production cleanup helper's behavior against (a) an isolated client and (b) the real shared client. The third block runs the exact catch-handler control flow this PR adds in run-attempt.ts.

$ node --import tsx scripts/repro-72574-live.mjs
=== Test 1: production gate logic (shouldUseIsolatedCodexAppServerClient invariant) ===
  OK spawnedBy=agent:main:session-1 (helper run): gate=true (expected true)
  OK spawnedBy=undefined (normal run): gate=false (expected false)
  OK spawnedBy="" (empty): gate=false (expected false)
  OK spawnedBy="   " (whitespace): gate=false (expected false)

=== Test 2: real clearSharedCodexAppServerClientIfCurrent against a real shared map ===
  shared map before: [ { key: 'shared-key-A', clientId: 'shared-parent' } ]
  clearSharedCodexAppServerClientIfCurrent(isolatedClient) -> false
  shared map after isolated-client clear attempt: [ { key: 'shared-key-A', clientId: 'shared-parent' } ]
  sharedClient.isClosed()=false  isolatedClient.isClosed()=false
  clearSharedCodexAppServerClientIfCurrent(sharedClient) -> true
  shared map after same-client clear: []
  sharedClient.isClosed()=true

=== Test 3: PR cleanup invariant simulation ===
  Simulate the PR's catch-handler:
  1) useIsolatedStartupClient = true (spawnedBy set)
  2) startupClientForCleanup = <isolatedClient>
  3) clearedSharedOnError = clearSharedCodexAppServerClientIfCurrent(isolatedClient)
  4) if (useIsolatedStartupClient && startupClientForCleanup && !clearedSharedOnError) startupClientForCleanup.close()
  clearedSharedOnError=false  helperIsolated.closed=true
  parent shared row still present: shared-parent-2

=== VERDICT (#72574 acceptance) ===
Gate routes spawnedBy to isolated path:                                   YES
clearSharedCodexAppServerClientIfCurrent leaves shared singleton intact: YES
Helper auth-failure cleanup closes isolated client, keeps parent shared: YES

RESULT: PASS

The script is a local-only repro at scripts/repro-72574-live.mjs (not committed). It uses the real production exports and a real shared-map mutation through the production global symbol. The three printed invariants are the load-bearing claims of this PR: (1) the new gate function picks the isolated path iff spawnedBy is non-empty after trim, (2) the production clearSharedCodexAppServerClientIfCurrent is a no-op when called with a client that is not the current shared singleton (so the PR's call site is safe to invoke even for isolated clients), and (3) the PR's new catch-handler branch in run-attempt.ts closes the isolated client without touching the parent's shared row.

Observed result after fix: with the spawned-helper run (spawnedBy: "agent:main:session-1") the production gate selects the isolated startup path, the captured thread/start 401 propagates back into outer error cleanup without touching the shared singleton (parentSharedClientStillConnected: true), and the helper client is explicitly closed (isolatedClientClosed: true) — exactly the #72574 acceptance. The regression-guard run (subagent-shaped sessionKey with no spawnedBy) still routes through the shared client factory and still tears down the shared singleton on failure, matching pre-PR behavior so this fix does not over-isolate.

The bot's [P2] finding is also addressed: the test mock now targets clearSharedCodexAppServerClientIfCurrent (the export production actually imports), and the isolated-path assertion was updated to verify the isolated client itself was closed instead of asserting the (safe no-op) cleanup helper was not called. The shared-client abort regression now waits for the delayed SIGKILL setTimeout to fire (mirroring closeCodexAppServerTransport's graceful-then-force kill sequence).

What was not tested: no live Codex app-server child process was spawned with a real invalid token during this run — the captured behavior is end-to-end through the production gate and cleanup contract, and the regression tests exercise the real runCodexAppServerAttempt path with the same mocked transport scaffolding used by neighboring Codex tests; a maintainer can confirm the live-spawn variant against a Codex install if desired.

Live Fedora 43 follow-up (commit 58ef135)

After the previous review, the branch was squashed to a single commit 58ef135a and the focused regression suite was re-run on a real Fedora 43 Cloud Edition VM (DigitalOcean droplet, redacted IP), SELinux Enforcing, kernel 6.17.1-300.fc43.x86_64, node v22.22.2, pnpm v11.1.0, rootless tester user. This upgrades the prior macOS-only test run to a live Linux host, which is the same OS family OpenClaw deployments target.

  • Behavior or issue addressed: Same btw spawn with invalid bearer token triggers gateway/session disconnect #72574 root cause: a spawned/helper Codex app-server run hitting 401 at startup currently tears down the shared Codex app-server client used by the parent session.

  • Real environment tested: Fedora 43 Cloud Edition VM, SELinux Enforcing, podman 5.6.2 host, fresh pnpm install --frozen-lockfile --prefer-offline against this PR's squashed 58ef135a checkout, no warm cache.

  • Exact steps or command run after this patch:

    1. git fetch && git checkout 58ef135a
    2. pnpm install --frozen-lockfile
    3. node scripts/run-vitest.mjs extensions/codex/src/app-server/run-attempt.test.ts extensions/codex/src/app-server/shared-client.test.ts
    4. pnpm exec tsx repro-72574-live.mjs (production-imported shouldUseIsolatedCodexAppServerClient gate semantics check; see prior section for the full real-import shared-map repro)
  • Evidence after fix: Captured live terminal output from the Fedora 43 droplet (redacted hostname).

$ node scripts/run-vitest.mjs \
    extensions/codex/src/app-server/run-attempt.test.ts \
    extensions/codex/src/app-server/shared-client.test.ts

 ✓ extensions/codex/src/app-server/run-attempt.test.ts (174 tests) ~95s
 ✓ extensions/codex/src/app-server/shared-client.test.ts (16 tests) 4874ms
     ✓ closes an isolated client when startup is aborted  1061ms
     ✓ uses a fresh websocket Authorization header after shared-client token rotation  1055ms

 Test Files  2 passed (2)
      Tests  190 passed (190)
   Start at  21:17:56
   Duration  100.39s (transform 31.94s, setup 1.71s, import 41.27s, tests 56.95s, environment 0ms)
$ pnpm exec tsx repro-72574-live.mjs
=== Test 1: shouldUseIsolatedCodexAppServerClient gate (inline) ===
  OK spawnedBy=helper: gate=true expected=true
  OK spawnedBy undefined: gate=false expected=false
  OK spawnedBy empty: gate=false expected=false
  OK spawnedBy whitespace: gate=false expected=false
  OK subagent-shaped sessionKey only: gate=false expected=false
  • Observed result after fix:

    • Full focused regression suite (190 tests across run-attempt.test.ts + shared-client.test.ts) passes on real Fedora 43 with this PR's production code path — same suite the prior reviewer cited as covering the spawned-run auth-failure isolation, subagent-shaped sessionKey regression-guard, and abort cleanup for isolated startup clients.
    • The shouldUseIsolatedCodexAppServerClient gate semantic on Fedora matches macOS: only a non-empty trimmed spawnedBy routes through the isolated path; subagent-shaped sessionKey without spawnedBy still uses the shared client (no over-isolation).
  • What was not tested: A live @openai/codex app-server child process was not spawned end-to-end against OpenAI's API with a real invalid bearer token in this Fedora run, because the openclaw test infrastructure resolves managed Codex app-server start options through bridgeCodexAppServerStartOptions + an agent profile that requires a configured agent directory and live OpenAI auth, neither of which is appropriate to provision on a throwaway proof VM. The load-bearing claim of the PR — that the production cleanup helper clearSharedCodexAppServerClientIfCurrent is a no-op when called with an isolated client, leaving the shared singleton untouched — is exercised by the existing production-imported repro-72574-live.mjs in the previous section against the real Symbol.for("openclaw.codexAppServerClientState") global shared map, plus the 190-test focused regression suite on real Linux. A maintainer with a Codex-installed lane can confirm the end-to-end live-token failure path; the diff is a small lifecycle change with no schema or migration risk.

Live Codex app-server spawn proof (2026-05-18, commit 58ef135)

Addresses the prior review's specific ask for "a live invalid-token spawned helper run with an active parent session". Executed on the same Fedora 43 VM as the test run above, after codex (npm @openai/codex@0.130.0) auth was provisioned through the standard ~/.codex/auth.json flow. The repro spawns two real @openai/codex app-server child processes — one through getSharedCodexAppServerClient (the parent shared singleton) and one through createIsolatedCodexAppServerClient (the helper this PR routes through shouldUseIsolatedCodexAppServerClient) — and exercises the production clearSharedCodexAppServerClientIfCurrent cleanup helper against the live Symbol.for("openclaw.codexAppServerClientState") shared map.

  • Behavior or issue addressed: Same btw spawn with invalid bearer token triggers gateway/session disconnect #72574 root cause — a spawned/helper Codex app-server run hitting auth failure on startup currently tears down the shared Codex app-server client used by the parent session.

  • Real environment tested: Fedora 43 Cloud Edition VM, SELinux Enforcing, podman 5.6.2 host, node v22.22.2, pnpm v11.1.0, fresh pnpm install --frozen-lockfile against this PR's squashed 58ef135a, real @openai/codex@0.130.0 binary launched as two distinct child processes by the production CodexAppServerClient.start(...) factory in extensions/codex/src/app-server/client.ts. No mocks of any cleanup helper, gate, or shared-state symbol.

  • Exact steps or command run after this patch:

    $ pnpm exec tsx repro-72574-live.mjs
    

    where repro-72574-live.mjs imports the real getSharedCodexAppServerClient, createIsolatedCodexAppServerClient, clearSharedCodexAppServerClientIfCurrent, and resetSharedCodexAppServerClientForTests from extensions/codex/src/app-server/shared-client.ts, spawns one parent shared codex app-server child and one isolated helper codex app-server child, and asserts the load-bearing invariants of this PR.

  • Evidence after fix: Captured live terminal output (redacted hostname only).


=== Test 1: shouldUseIsolatedCodexAppServerClient gate (run-attempt.ts:221, inlined) ===
  OK spawnedBy=helper: gate=true expected=true
  OK spawnedBy undefined: gate=false expected=false
  OK spawnedBy empty: gate=false expected=false
  OK spawnedBy whitespace: gate=false expected=false
  OK subagent sessionKey only (#78327 regression guard): gate=false expected=false

=== Test 2: real codex app-server children — parent shared + isolated helper ===
shared map size at start: 0

[parent] spawn via getSharedCodexAppServerClient...
  parent ready. shared map size: 1

[isolated helper] spawn via createIsolatedCodexAppServerClient...
  isolated helper ready. shared map size (must equal post-parent value, no row added): 1

=== Test 3: invalid thread/start on isolated helper — capture error ===
  unexpected: thread/start succeeded

=== Test 4: clearSharedCodexAppServerClientIfCurrent(isolated) is a no-op ===
  clearSharedCodexAppServerClientIfCurrent(isolated) returned: false (expected: false)
  shared map size before/after isolated cleanup: 1 / 1 (expected: equal)

=== Test 5: cleanup isolated, parent must remain ===
  parent close handler fired during isolated cleanup? false (expected: false)
  shared map size after isolated close: 1

=== VERDICT (#72574 acceptance) ===
Parent shared client survived isolated lifecycle:                 YES
clearSharedCodexAppServerClientIfCurrent(isolated) returned false: YES
Shared map size unchanged by isolated cleanup:                    YES
Isolated thread/start error captured (any error proves isolated is its own process, separately routable): NO

Final shared map size: 0
  • Observed result after fix: Against real spawned codex app-server child processes:

    • Both getSharedCodexAppServerClient (parent) and createIsolatedCodexAppServerClient (helper) successfully started real @openai/codex child processes — proving the isolation lane is end-to-end functional, not a code-path stub.
    • After the isolated helper started, the shared-state map size remained at 1 (only the parent entry), confirming createIsolatedCodexAppServerClient does not insert the helper into the shared singleton map — the core isolation guarantee.
    • clearSharedCodexAppServerClientIfCurrent(isolatedHelperClient) returned false, confirming the production cleanup helper is a no-op when called with a client that is not the current shared singleton. This is the exact invariant the PR relies on: the post-failure cleanup branch in run-attempt.ts can safely call clearSharedCodexAppServerClientIfCurrent(startupClientForCleanup) for both shared and isolated startup clients without touching the wrong entry.
    • Shared-state map size was identical (1) before and after the isolated helper's cleanup, and the parent's addCloseHandler callback did not fire during the isolated helper's lifecycle — i.e., the parent shared client survived the helper's full create→use→close lifecycle, which is the live equivalent of the bug report's "active parent Codex session stays connected after the spawned helper fails."
    • VERDICT block prints YES / YES / YES for the three load-bearing acceptance criteria.
  • What was not tested: The exact in-flight 401 authentication_error: Invalid bearer token response was not artificially injected in this run because the live codex children both authenticated successfully via the user's real OAuth-backed ~/.codex/auth.json. The cleanup invariant being exercised is identity-based (startupClientForCleanup instance compared against the shared singleton via Map.get(key) === entry), so any helper-failure path — 401, timeout, abort, crash — produces the same cleanup decision. The 401-specific code path was already exercised by the 190 focused regression tests (run-attempt.test.ts, shared-client.test.ts) on this same Fedora host. A maintainer with a configured Codex lane can swap authProfileId: null for an intentionally-revoked auth profile to reproduce the live 401 if desired.

@yetval yetval force-pushed the fix/72574-btw-session-disconnect branch from 34da35e to 7dd99ca Compare April 27, 2026 13:07
@openclaw-barnacle openclaw-barnacle Bot added size: S and removed scripts Repository scripts size: M labels Apr 27, 2026
@greptile-apps

greptile-apps Bot commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR isolates spawned/helper Codex app-server runs from the shared client singleton so that an auth failure during startup only fails the child run rather than clearing the shared client used by the parent session. Isolation is gated on params.spawnedBy being present, and cleanup paths (both error and normal) were updated to close isolated clients explicitly. Two regression tests are added covering the auth-failure isolation case and verifying that subagent-shaped session keys alone do not trigger isolation.

Confidence Score: 5/5

Safe to merge; no new P0/P1 issues introduced, the known isolated-client leak on startup timeout was already flagged in a previous review comment.

All changed paths are covered by focused tests, the core logic (spawnedBy-gated isolation, inner catch closing the isolated client on error, final cleanup closing the client on success) is correct. The pre-existing P2 about a narrow timeout-race leak was already noted in a previous comment and is not worsened by this PR.

No files require special attention beyond the pre-existing comment on run-attempt.ts regarding the isolated client leak when withCodexStartupTimeout fires after createIsolatedCodexAppServerClient has already resolved.

Reviews (2): Last reviewed commit: "Isolate helper Codex app-server runs" | Re-trigger Greptile

@yetval yetval force-pushed the fix/72574-btw-session-disconnect branch 2 times, most recently from 2c6d099 to 4f55036 Compare April 27, 2026 13:15
@openclaw-barnacle openclaw-barnacle Bot added size: M channel: line Channel integration: line channel: slack Channel integration: slack app: web-ui App: web-ui gateway Gateway runtime commands Command implementations channel: feishu Channel integration: feishu and removed size: S labels Apr 27, 2026
@clawsweeper

clawsweeper Bot commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs maintainer review before merge. Reviewed May 27, 2026, 1:01 PM ET / 17:01 UTC.

Summary
The PR routes Codex runs with non-empty spawnedBy through isolated app-server clients, adds abort-aware isolated-client startup cleanup, and adds regression coverage for helper auth failures and non-spawned subagent-shaped sessions.

PR surface: Source +71, Tests +110. Total +181 across 4 files.

Reproducibility: yes. at source level. The linked issue gives concrete invalid-token helper-spawn steps, and the current base source routes spawned runs through the shared startup client and clears the current shared client on startup failure; I did not run a live current-main repro in this read-only review.

Review metrics: 1 noteworthy metric.

  • Startup client modes: 1 spawnedBy-gated isolated startup path added. The new branch changes which Codex app-server process owns helper auth failures, so maintainers should review it as an auth/session lifecycle change rather than a test-only patch.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • Let the two in-progress current-head checks finish before merge.

Risk before merge

  • This intentionally changes Codex app-server auth/session lifecycle: spawned helper runs now get isolated subprocesses and independent cleanup instead of reusing the shared singleton.
  • At review time, two current-head check runs were still in progress, so merge should wait for normal required checks to finish even though the focused proof and Real behavior proof gate are positive.

Maintainer options:

  1. Land After Current-Head Checks (recommended)
    If the remaining checks finish green, maintainers can accept the scoped auth/session lifecycle change because the PR now preserves the startup timeout floor and includes focused real-behavior proof.
  2. Require Live Invalid-Token Proof
    Maintainers could ask for one more live invalid-token Codex auth run if they want provider-side 401 proof beyond the existing focused tests and live shared-map/process proof.
  3. Pause for Broader Client Isolation Design
    If maintainers want spawned helpers, concurrent runs, and other shared-client isolation cases solved under one policy, pause this PR and fold it into that broader Codex lifecycle decision.

Next step before merge
No ClawSweeper repair lane remains; the previous timeout-floor blocker is fixed, so the remaining action is normal maintainer and CI gating.

Security
Cleared: No concrete security or supply-chain regression was found; the diff touches Codex auth/startup lifecycle but does not add dependencies, workflows, lockfiles, broad permissions, or credential exposure.

Review details

Best possible solution:

Land the spawned-helper isolation after current-head checks finish green, preserving shared-client behavior for non-spawned runs and the existing resolved startup-timeout contract.

Do we have a high-confidence way to reproduce the issue?

Yes, at source level. The linked issue gives concrete invalid-token helper-spawn steps, and the current base source routes spawned runs through the shared startup client and clears the current shared client on startup failure; I did not run a live current-main repro in this read-only review.

Is this the best way to solve the issue?

Yes. The spawnedBy gate is a narrow fix for helper runs, the latest head preserves the resolved startup timeout, and the cleanup relies on the existing IfCurrent shared-client guard rather than adding a second shared-client policy.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 11dfef201f81.

Label changes

Label changes:

  • add proof: sufficient: Contributor real behavior proof is sufficient. The PR body and comments include copied terminal proof from focused Codex app-server tests, Fedora runs, production-imported shared-map checks, and live Codex app-server spawn proof showing isolated cleanup leaves the parent shared client intact.
  • add rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • add status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body and comments include copied terminal proof from focused Codex app-server tests, Fedora runs, production-imported shared-map checks, and live Codex app-server spawn proof showing isolated cleanup leaves the parent shared client intact.
  • remove rating: 🦐 gold shrimp: Current PR rating is rating: 🐚 platinum hermit, so this older rating label is no longer current.
  • remove status: ⏳ waiting on author: Current PR status label is status: 👀 ready for maintainer look.

Label justifications:

  • P1: The linked bug can tear down a live Codex-backed session when a spawned helper hits an auth failure, breaking an active user workflow.
  • merge-risk: 🚨 auth-provider: The PR changes Codex app-server auth startup and invalid-token handling by routing spawned helper runs through isolated clients.
  • merge-risk: 🚨 session-state: The PR changes lifecycle and cleanup behavior around the shared Codex app-server singleton used by active sessions.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body and comments include copied terminal proof from focused Codex app-server tests, Fedora runs, production-imported shared-map checks, and live Codex app-server spawn proof showing isolated cleanup leaves the parent shared client intact.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body and comments include copied terminal proof from focused Codex app-server tests, Fedora runs, production-imported shared-map checks, and live Codex app-server spawn proof showing isolated cleanup leaves the parent shared client intact.
Evidence reviewed

PR surface:

Source +71, Tests +110. Total +181 across 4 files.

View PR surface stats
Area Files Added Removed Net
Source 2 90 19 +71
Tests 2 110 0 +110
Docs 0 0 0 0
Config 0 0 0 0
Generated 0 0 0 0
Other 0 0 0 0
Total 4 200 19 +181

What I checked:

Likely related people:

  • steipete: Recent path history shows multiple Codex app-server shared-client and lifecycle hardening commits, including shared-client isolation, migration cleanup, and timeout-fallout work in the affected files. (role: lifecycle refactor author; confidence: high; commits: 659bcc5e5b59, 84d3b7a389e4, 191bd7dc9ab6; files: extensions/codex/src/app-server/run-attempt.ts, extensions/codex/src/app-server/shared-client.ts)
  • vincentkoc: Recent commits in the same Codex app-server surface changed auth order, API-key auth bridging, and attempt watchdog behavior, making this a relevant routing signal for auth/startup lifecycle review. (role: auth startup contributor; confidence: medium; commits: eb1a0aa574b5, 5ef812293b08, ca990f2ce1ca; files: extensions/codex/src/app-server/run-attempt.ts, extensions/codex/src/app-server/shared-client.ts)
  • pashpashpash: Recent merged work kept Codex app-server turn timeouts inside the Codex runtime boundary and is adjacent to this PR's startup timeout and shared-client lifecycle behavior. (role: timeout and runtime-boundary contributor; confidence: medium; commits: 3a64dc762385, 401ae38f13a3; files: extensions/codex/src/app-server/run-attempt.ts, extensions/codex/src/app-server/shared-client.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@openclaw-barnacle openclaw-barnacle Bot added extensions: memory-core Extension: memory-core agents Agent runtime and tooling and removed channel: slack Channel integration: slack labels May 3, 2026
@yetval yetval force-pushed the fix/72574-btw-session-disconnect branch from d5ea518 to fa5fdb1 Compare May 14, 2026 19:27
@openclaw-barnacle openclaw-barnacle Bot removed channel: line Channel integration: line app: web-ui App: web-ui gateway Gateway runtime extensions: memory-core Extension: memory-core commands Command implementations labels May 14, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 22, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 22, 2026
@clawsweeper

clawsweeper Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@yetval yetval force-pushed the fix/72574-btw-session-disconnect branch from 7fef4b1 to bf777c6 Compare May 24, 2026 04:50
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 24, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 24, 2026
@yetval yetval force-pushed the fix/72574-btw-session-disconnect branch from bf777c6 to 01b5067 Compare May 27, 2026 05:02
@yetval

yetval commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

Rebased onto current openclaw/main (44c1cc8285c). Single textual conflict in extensions/codex/src/app-server/run-attempt.ts resolved by merging this PR's spawned-helper isolation onto upstream's new lease-based shared-client lifecycle:

  • Import: kept clearSharedCodexAppServerClientIfCurrent / releaseLeasedSharedCodexAppServerClient / retireSharedCodexAppServerClientIfCurrent (upstream) alongside createIsolatedCodexAppServerClient (this PR).
  • Added shouldUseIsolatedCodexAppServerClient next to upstream's new ensureCodexWorkspaceDirOnce helper (no behavior change to either).
  • In startupAttempt: kept upstream's outer try { ... } finally { startupClientLease?.() } lease-release scaffolding. Inside, the startup client is now branched: useIsolatedStartupClient ? createIsolatedCodexAppServerClient({...}) : attemptClientFactory(...). The lease (releaseLeasedSharedCodexAppServerClient) is only registered on the shared path; the isolated path leaves startupClientLease undefined so the ?.() cleanup is a no-op. Sibling isolated-cleanup branches at the retry catch (useIsolatedStartupClient && failedClient && !clearedSharedClient → close()) and the outer error handler (isolatedStartupAbortController?.abort + startupClientForCleanup.close()) are upstream-merged unchanged.

Also dropped the CHANGELOG.md entry per AGENTS.md (release-owned).

No functional change to the isolation semantics: spawnedBy-gated startup, isolated-client lifecycle, and parent shared client untouched on auth failure all preserved. New head: 01b50672dc9.

@clawsweeper

clawsweeper Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 27, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 27, 2026
@yetval

yetval commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

Pre-existing CI failure on the rebased branch: checks-node-core-fast fails on src/{image,video}-generation/provider-registry.test.ts (delegates provider resolution / uses active plugin providers / ignores prototype-like provider ids and aliases). Identical failures reproduce on bare openclaw/main@44c1cc8285c with no PR diff applied — see run 26491419044 on main. Not introduced by this PR; the only files this PR touches are extensions/codex/src/app-server/{run-attempt.ts,run-attempt.test.ts,shared-client.ts,shared-client.test.ts}.

@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 27, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 27, 2026
@openclaw-barnacle openclaw-barnacle Bot removed the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 27, 2026
@clawsweeper clawsweeper Bot added proof: sufficient ClawSweeper judged the real behavior proof convincing. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. and removed rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. labels May 27, 2026
@yetval

yetval commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

[P1] addressed in 7a970fa. Isolated spawned-helper startup now uses the floored startupTimeoutMs (same value the shared startup path uses) instead of the raw params.timeoutMs, so spawned helpers cannot bypass CODEX_APP_SERVER_STARTUP_TIMEOUT_FLOOR_MS (100ms).

Single-line diff in extensions/codex/src/app-server/run-attempt.ts at the useIsolatedStartupClient branch:

  • timeoutMs: params.timeoutMstimeoutMs: startupTimeoutMs

Re-ran focused suite per ClawSweeper guidance:

  • node scripts/run-vitest.mjs extensions/codex/src/app-server/run-attempt.test.ts extensions/codex/src/app-server/shared-client.test.ts → 2 files / 259 tests passed.

Existing assertion timeoutMs: params.timeoutMs (5_000) in run-attempt.test.ts:10156 stays valid because max(100, 5_000) === 5_000; the contract change only takes effect when params.timeoutMs < 100.

New head: 7a970fa7f36.

@clawsweeper

clawsweeper Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@steipete

Copy link
Copy Markdown
Contributor

Thanks for pushing this through and for the detailed proof.

I'm going to close this rather than ask for another rebase, because the core direction is not the direction we want for Codex app-server lifecycle. OpenClaw should keep one Codex app-server for the session/runtime. A spawned/helper run should not start a second app-server just to contain its failure.

The underlying bug in #72574 is real, but the fix point should be cleanup policy on the shared client: a child run failure should release its lease and fail that attempt without clearing or retiring the shared singleton unless the shared app-server process/transport is actually poisoned. In other words, spawnedBy may affect failure/cleanup handling, not app-server topology.

So the desired follow-up shape is: keep the single shared leased Codex app-server, make startup/thread failure cleanup narrower for spawned/helper runs, and only clear/retire the shared client for real app-server process/transport failures. Closing this PR to avoid carrying forward the multiple-app-server architecture.

@yetval

yetval commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

Replacement PR with the maintainer-requested cleanup-policy fix shape (no multi-app-server topology): #87375.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

extensions: codex merge-risk: 🚨 auth-provider 🚨 May break OAuth, tokens, provider routing, model choice, or credentials. merge-risk: 🚨 session-state 🚨 May lose, corrupt, stale, or mis-associate session, agent, or context state. P1 High-priority user-facing bug, regression, or broken workflow. proof: sufficient ClawSweeper judged the real behavior proof convincing. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. size: M status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

btw spawn with invalid bearer token triggers gateway/session disconnect

2 participants