-
-
Notifications
You must be signed in to change notification settings - Fork 79.1k
Codex app-server rotates context-engine bootstrap threads after large first turns #85975
Copy link
Copy link
Open
Labels
P2Normal backlog priority with limited blast radius.Normal backlog priority with limited blast radius.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.ClawSweeper found a clear likely implementation shape for this issue.clawsweeper:linked-pr-openClawSweeper found an open linked pull request for this issue.ClawSweeper found an open linked pull request for this issue.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.ClawSweeper does not recommend queueing a new automated fix PR for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.ClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.ClawSweeper found a high-confidence source-level issue reproduction.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.Very strong issue quality with high-confidence source-level or clear reproduction.
Metadata
Metadata
Assignees
Labels
P2Normal backlog priority with limited blast radius.Normal backlog priority with limited blast radius.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.ClawSweeper found a clear likely implementation shape for this issue.clawsweeper:linked-pr-openClawSweeper found an open linked pull request for this issue.ClawSweeper found an open linked pull request for this issue.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.ClawSweeper does not recommend queueing a new automated fix PR for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.ClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.ClawSweeper found a high-confidence source-level issue reproduction.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.Session, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.Very strong issue quality with high-confidence source-level or clear reproduction.
Type
Fields
Give feedbackNo fields configured for issues without a type.
Summary
On the current
mainbranch and the latest stable release I verified (v2026.5.22, published 2026-05-24), Codex app-server sessions can repeatedly lose their warmed native thread after a large context-engine bootstrap turn. The observed release log shape is:This is not Discord losing messages and it is not the model context limit. It is OpenClaw clearing the saved Codex app-server native thread binding before the context-engine compatibility path can decide whether that thread is still valid.
Impact
For long-running Codex-backed agents with
contextEngineprojection modethread_bootstrap, a large first/bootstrap native turn can exceed the local 70k native active-token guard. Once that happens, each later turn can cold-start the native Codex thread instead of usingthread/resume, causing repeated bootstrap/projection work and loss of the warmed app-server fast path.That matches the token/latency symptom we are seeing in long Discord sessions: the Gateway still routes the turn, but the Codex-side native wrapper repeatedly starts fresh threads and burns tokens/CPU.
Root Cause
extensions/codex/src/app-server/run-attempt.tscallsrotateOversizedCodexAppServerStartupBinding(...)immediately after reading the startup binding. That helper reads the native Codex rollout/session token stats and clears the binding when the latest usage is at or aboveCODEX_APP_SERVER_NATIVE_THREAD_MAX_TOKENS(70_000).For context-engine
thread_bootstrap, that ordering is wrong: the bootstrap turn is expected to be large, and later turns should be able to reuse the same native thread as long as the stored context-engine projection metadata still matches the current engine/policy/epoch. The later context-engine reuse logic already knows how to decide whether the binding is compatible, but it never gets the chance because the startup guard deletes the binding first.Expected Behavior
A saved Codex native thread binding with
contextEngine.projection.mode === "thread_bootstrap"should survive the startup native transcript size guard. Compatibility should then be decided by the context-engine projection/epoch checks and the existing per-turn overflow recovery path. If the epoch or policy changes, OpenClaw should still rotate and reproject.Proposed Fix
Defer the startup native token/byte guard for context-engine
thread_bootstrapbindings. Keep the existing guard behavior for non-context-engine and non-bootstrap native sessions.I have a focused regression test and patch in progress that proves:
thread/resumeValidation So Far
Local validation ran from the Lexar-backed worktree:
Focused checks:
Parallel review also checked Pi runtime risk. The proposed change is limited to the Codex app-server startup binding guard and should not change Pi embedded-runner compaction semantics.