fix(codex): recover context overflow and budget skip paths#87879
fix(codex): recover context overflow and budget skip paths#87879fuller-stack-dev wants to merge 5 commits into
Conversation
|
Codex review: needs maintainer review before merge. Reviewed May 29, 2026, 10:32 PM ET / 02:32 UTC. Summary PR surface: Source +40, Tests +164. Total +204 across 4 files. Reproducibility: yes. Current main source shows Codex non-manual compaction returns the owned-skip reason while CLI compaction throws on non-fallback native outcomes; the PR body also includes before/after redacted gateway logs for the same failure mode. Review metrics: 1 noteworthy metric.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Next step before merge
Security Review detailsBest possible solution: Adopt one canonical Codex app-server recovery contract that preserves successful turns, clears stale resumed context-engine bindings after terminal overflow, and retires the overlapping compaction PRs after the chosen path lands. Do we have a high-confidence way to reproduce the issue? Yes. Current main source shows Codex non-manual compaction returns the owned-skip reason while CLI compaction throws on non-fallback native outcomes; the PR body also includes before/after redacted gateway logs for the same failure mode. Is this the best way to solve the issue? Unclear but promising. The code change is narrow and tested, but maintainers still need to choose whether context-engine fallback after a Codex-owned budget skip is the canonical behavior versus the overlapping open PRs. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 440e737c67dc. Label changesLabel justifications:
Evidence reviewedPR surface: Source +40, Tests +164. Total +204 across 4 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
fff0954 to
07467ce
Compare
compoodment
left a comment
There was a problem hiding this comment.
Review: upstream guard blocks CLI compaction for Codex sessions entirely
Both commits are correct in isolation, but there is a structural gap upstream of both fixes that leaves Codex sessions without any compaction when the Codex app-server thread binding is absent.
The guard blocks everything
shouldSkipAutomaticCompactionForCodexRuntime() (cli-compaction.ts ~L268) runs before isNativeHarnessCompactionSession and compactNativeHarnessCliTranscript. For any openai-codex provider session where resolveAgentHarnessPolicy returns runtime: "codex", this function returns true and the entire runCliTurnCompactionLifecycle short-circuits — it returns sessionEntry unchanged with a debug log.
This means:
-
Commit 2's fix (
isIntentionalNativeAutoCompactionSkip→skipped: true→ early return) is unreachable for Codex sessions because the outer guard already skipped them. -
When Commit 1 clears the thread binding after terminal overflow, the Codex app-server no longer has a thread to compact. But
shouldSkipAutomaticCompactionForCodexRuntimestill returnstrue, so OpenClaw CLI compaction is also blocked. Nobody handles compaction.
Reproduction
On Dicky's OpenClaw instance (2026.5.27-beta.1, both commits patched in), a Discord session on openai-codex/gpt-5.5 grew to 184 messages / 277k prompt tokens / 22k overflow tokens with compactionCount: 0. The session never compacted because shouldSkipAutomaticCompactionForCodexRuntime returned true every turn.
Suggested fix
When a Codex session has no active thread binding (i.e., readCodexAppServerBinding returns undefined), shouldSkipAutomaticCompactionForCodexRuntime should return false so the CLI compaction lifecycle falls through to context-engine compaction. Alternatively, compactNativeHarnessCliTranscript could detect the missing binding and set fallbackToContextEngine: true so the caller falls through to context-engine compaction instead of returning early on the skipped path.
The current skipped early-return in runCliTurnCompactionLifecycle should also fall through to context-engine compaction rather than returning sessionEntry unchanged — otherwise the session will grow unbounded with no compaction at all.
|
Correction: my earlier review referenced The actual gap I encountered was specific to the beta release, not this PR. The PR itself correctly returns The PR changes are correct as-is. Apologies for the noise. |
…overflow-binding # Conflicts: # extensions/codex/src/app-server/run-attempt.ts
|
Thanks Jason. I landed #88207 as the superset fix in 81505ad. #88207 includes this PR's Codex budget-skip recovery and stale resumed context-engine binding cleanup, then adds the native thread overflow rotation/headroom checks and the maintainer follow-ups from review. Closing this one so we do not keep two divergent fixes open for the same session-state path. |
Summary
This PR fixes two tightly related Codex native-session recovery edges found while replaying a Discord-shaped session after the 5.27 upgrade:
turn/startbut later terminally fails with a context-window overflow while the context engine owns compaction, clear the stale app-server thread binding so the outer retry starts a fresh native Codex thread.codex app-server owns automatic compaction), treat that as a non-fatal skip instead of failing a successful turn after the model already responded.Real behavior proof
OpenClaw 2026.5.28 (3fd6cc0), the gateway health endpoint returned{"ok":true,"status":"live"}, the exact-route smoke returnedstatus=ok,payload.text="OK", andexitCode=0, and the anonymized logs containedskipping codex app-server compaction for non-manual triggerwithout the prior fatal compaction failure.Auto-compaction could not recover this turnmessage did not recur, and the successful assistant response was preserved instead of being converted into a CLI failure.Before Evidence
All identifiers below are anonymized. No personal names, hostnames, channel IDs, session IDs, auth profile IDs, or absolute local paths are included.
Overflow recovery before patch
A production-like replay showed the context-engine projection repeatedly resuming the same stale native thread:
The app-server then reported auto-compaction success, but the next retry still resumed
<old-native-thread>instead of starting fresh. The turn eventually exhausted recovery and surfaced the user-facing error:There was no log evidence that the binding was cleared after the terminal context-window failure, because the overflow happened after
turn/starthad already been accepted.Budget compaction before patch
After the first local fix was installed, a Discord-shaped CLI smoke successfully wrote the assistant response:
But the CLI still exited non-zero during the post-turn budget lifecycle:
So the user-visible response succeeded, then the CLI incorrectly converted the intentional Codex app-server budget skip into a failed command.
After Evidence
Overflow recovery after patch
With the stale binding restored locally for replay, the same Discord-shaped session no longer reused
<old-native-thread>. The retry cleared the binding and started a new native thread:No
context-overflow-diag, exhausted recovery, or auto-compaction recovery failure appeared during the replay. The assistant transcript completed with:Budget compaction after patch
After adding the CLI-side skip handling and installing this branch into the local runtime, the same smoke command completed successfully:
This proves the CLI now preserves the successful turn result when Codex app-server intentionally owns automatic compaction for a budget trigger.
Follow-up Exact-Route Smoke
A follow-up Discord-shaped replay intentionally used the long session that previously triggered the user-facing failure. The first replay reproduced a failure because the shell CLI was patched, but the LaunchAgent gateway was still resolving a separate global OpenClaw install at
2026.5.27.After installing this same PR commit into the gateway-resolved global prefix and restarting the gateway:
Correlated anonymized log evidence from the successful replay:
The old fatal lines did not recur in that successful replay:
Tests
Focused tests: