Skip to content

Codex app-server rotates context-engine bootstrap threads after large first turns #85975

@100yenadmin

Description

@100yenadmin

Summary

On the current main branch and the latest stable release I verified (v2026.5.22, published 2026-05-24), Codex app-server sessions can repeatedly lose their warmed native thread after a large context-engine bootstrap turn. The observed release log shape is:

codex app-server native transcript exceeded active token limit; starting a fresh thread

This is not Discord losing messages and it is not the model context limit. It is OpenClaw clearing the saved Codex app-server native thread binding before the context-engine compatibility path can decide whether that thread is still valid.

Impact

For long-running Codex-backed agents with contextEngine projection mode thread_bootstrap, a large first/bootstrap native turn can exceed the local 70k native active-token guard. Once that happens, each later turn can cold-start the native Codex thread instead of using thread/resume, causing repeated bootstrap/projection work and loss of the warmed app-server fast path.

That matches the token/latency symptom we are seeing in long Discord sessions: the Gateway still routes the turn, but the Codex-side native wrapper repeatedly starts fresh threads and burns tokens/CPU.

Root Cause

extensions/codex/src/app-server/run-attempt.ts calls rotateOversizedCodexAppServerStartupBinding(...) immediately after reading the startup binding. That helper reads the native Codex rollout/session token stats and clears the binding when the latest usage is at or above CODEX_APP_SERVER_NATIVE_THREAD_MAX_TOKENS (70_000).

For context-engine thread_bootstrap, that ordering is wrong: the bootstrap turn is expected to be large, and later turns should be able to reuse the same native thread as long as the stored context-engine projection metadata still matches the current engine/policy/epoch. The later context-engine reuse logic already knows how to decide whether the binding is compatible, but it never gets the chance because the startup guard deletes the binding first.

Expected Behavior

A saved Codex native thread binding with contextEngine.projection.mode === "thread_bootstrap" should survive the startup native transcript size guard. Compatibility should then be decided by the context-engine projection/epoch checks and the existing per-turn overflow recovery path. If the epoch or policy changes, OpenClaw should still rotate and reproject.

Proposed Fix

Defer the startup native token/byte guard for context-engine thread_bootstrap bindings. Keep the existing guard behavior for non-context-engine and non-bootstrap native sessions.

I have a focused regression test and patch in progress that proves:

  • an 86k-token bootstrap rollout still resumes with thread/resume
  • the following turn sends only the current user prompt, not the assembled bootstrap context again
  • existing non-bootstrap native over-budget rotation tests still pass

Validation So Far

Local validation ran from the Lexar-backed worktree:

/Volumes/LEXAR/repos/worktrees/openclaw-codex-native-thread-reuse

Focused checks:

OPENCLAW_VITEST_MAX_WORKERS=1 node scripts/run-vitest.mjs extensions/codex/src/app-server/run-attempt.context-engine.test.ts --run
OPENCLAW_VITEST_MAX_WORKERS=1 node scripts/run-vitest.mjs extensions/codex/src/app-server/run-attempt.test.ts --run -t "starts a fresh Codex thread before resume when the native rollout is over budget|uses current rollout token usage before cumulative usage|clears native rollouts at the configured byte limit"
pnpm exec oxfmt --check --threads=1 extensions/codex/src/app-server/run-attempt.ts extensions/codex/src/app-server/run-attempt.context-engine.test.ts
git diff --check

Parallel review also checked Pi runtime risk. The proposed change is limited to the Codex app-server startup binding guard and should not change Pi embedded-runner compaction semantics.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Normal backlog priority with limited blast radius.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:linked-pr-openClawSweeper found an open linked pull request for this issue.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.clawsweeper:queueable-fixClawSweeper marked this issue as an existing queue_fix_pr work candidate.clawsweeper:source-reproClawSweeper found a high-confidence source-level issue reproduction.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🦞 diamond lobsterVery strong issue quality with high-confidence source-level or clear reproduction.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions