Skip to content

Codex long-running sessions should use semantic thread/bootstrap cache ownership #86023

@100yenadmin

Description

@100yenadmin

Codex long-running sessions should use semantic thread/bootstrap cache ownership instead of hard native-token rotation

Problem

Long-running Discord/Codex sessions can still become slow after #85978 when they do not enter or retain the context-engine thread_bootstrap path. The local Codex app-server startup guard can still log:

codex app-server native transcript exceeded active token limit; starting a fresh thread
nativeTokens=116268, max 70000

That means OpenClaw clears the saved native Codex thread and starts cold, even though the selected model may have much more context headroom. The 70k number is an OpenClaw native-thread reuse guard, not the model context window.

Current Understanding

  • Pi embedded runner normally reloads and injects bootstrap files every turn, then relies on stable prompt-prefix/provider-cache behavior to amortize repeated bytes.
  • Codex app-server has a different efficient path: persistent native thread reuse. Context engines can return contextProjection.mode = "thread_bootstrap" so OpenClaw injects assembled history once for a stable epoch and then resumes the same native thread.
  • Current lossless-claw main appears designed for this path: it returns thread_bootstrap with an epoch derived from summary context state rather than ordinary fresh-tail growth.
  • Fix Codex native thread reuse for context-engine bootstraps #85978 fixes one bug in this path by preventing the startup size guard from deleting a still-valid bootstrapped binding before compatibility is checked. It now also keeps stale/no-active-engine bindings safe.
  • The broader architecture gap remains for legacy/non-bootstrap sessions, old lossless-claw builds, session-file rollover, blunt compaction invalidation, workspace bootstrap surfaces outside the projection contract, and insufficient rotation diagnostics.

Desired Architecture

For Codex app-server, native-thread rotation should be primarily semantic:

  • rotate on /new or /reset;
  • rotate on model/provider/auth/tool/MCP/app/environment incompatibility;
  • rotate on context-engine policy or projection epoch/fingerprint change;
  • rotate when a saved context-engine binding has no current active context engine;
  • rotate when the app-server reports the native thread is gone or actually overflows;
  • do not rotate solely because a compatible thread_bootstrap native rollout is above a hard-coded 70k guard.

The native thread should be treated as a projection cache keyed by stable session/channel identity plus context-engine conversation/projection identity, not only by sessionFile + ".codex-app-server.json".

Proposed Follow-Ups

  1. Add a Codex native-thread rotation reason enum and diagnostics block.
    Log current/saved engine id, policy fingerprint, epoch/fingerprint, dynamic tools, MCP/app/environment/auth/model fingerprints, token source, native/session tokens, and whether mirrored history was projected.

  2. Make the native reuse guard model/config/context-owner aware.
    Keep strict clearing for legacy or ownerless sessions, but treat compatible context-engine thread_bootstrap sessions as semantically owned by the context engine unless the app-server actually rejects the turn.

  3. Preserve or migrate Codex bindings across LCM/session-file rollover when conversation identity and projection epoch remain compatible.

  4. Add explicit workspace bootstrap fingerprints to Codex thread binding/diagnostics.
    Track stable inherited developer instructions, turn-scoped collaboration instructions, prompt context contributors, and native project-doc loading separately.

  5. Revisit compaction invalidation.
    Successful context-engine-owned compaction currently clears Codex bindings. If compaction does not change projection epoch/fingerprint, native reuse may be preservable.

Acceptance Criteria

  • A long-running single-agent Codex/Discord session with stable lossless-claw thread_bootstrap epoch can exceed 70k native rollout tokens without cold-starting every turn.
  • When a turn cold-starts, logs state exactly which semantic or runtime compatibility dimension forced it.
  • If LCM compacts, rotates, or rewrites the transcript, OpenClaw either preserves the compatible Codex binding or logs the exact epoch/policy/session identity reason it could not.
  • /doctor or equivalent status output distinguishes model/provider context overflow from OpenClaw native-thread reuse guard rotation.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Normal backlog priority with limited blast radius.clawsweeper:fix-shape-clearClawSweeper found a clear likely implementation shape for this issue.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:session-stateSession, memory, transcript, context, or agent state can drift or corrupt.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions