feat(codex): diagnose native thread lifecycle#86094
Conversation
|
Codex review: needs real behavior proof before merge. Reviewed May 27, 2026, 7:18 AM ET / 11:18 UTC. Summary PR surface: Source +1087, Tests +1282, Docs +60, Generated 0. Total +2429 across 27 files. Reproducibility: not applicable. as a bug reproduction: this is a feature/config diagnostics PR. The code and tests show the intended paths, but the PR body does not provide current-head real behavior proof for the latest head. Review metrics: 2 noteworthy metrics.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Proof guidance: Risk before merge
Maintainer options:
Next step before merge Security Review detailsBest possible solution: Keep the branch open, require current-head real behavior proof for the latest lifecycle/session-state paths, and merge only after maintainers accept the new config and diagnostic surfaces. Do we have a high-confidence way to reproduce the issue? Not applicable as a bug reproduction: this is a feature/config diagnostics PR. The code and tests show the intended paths, but the PR body does not provide current-head real behavior proof for the latest head. Is this the best way to solve the issue? Yes, pending proof: centralizing lifecycle reasons and low-cardinality logs is a maintainable shape for this diagnostic surface. The safer pre-merge path is to keep the default behavior stable while requiring current-head proof and maintainer acceptance of the new config key. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 545ad7f256e2. Label changesLabel justifications:
Evidence reviewedPR surface: Source +1087, Tests +1282, Docs +60, Generated 0. Total +2429 across 27 files. View PR surface stats
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
95772ec to
fe59254
Compare
|
ClawSweeper PR egg 🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat. Where did the egg go?
|
fe59254 to
2372e21
Compare
|
@clawsweeper re-review The previous finding was fixed on the current head |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
ab34c3c to
5e41475
Compare
|
@clawsweeper re-review The previous field-collision finding is fixed on current head |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
fe0fa44 to
1679056
Compare
1679056 to
8e91314
Compare
Add agents.defaults.compaction.maxActiveTranscriptTokens for Codex app-server native thread reuse. The default remains 70000 tokens for existing deployments; positive numeric or shorthand token-count values override it, and 0 disables only the proactive token guard while preserving byte limits and semantic binding invalidation. Also skip rollout directory scans when both native guards are disabled, document the setting, regenerate the config baseline hash, and cover rollout/session token sources plus byte-limit preservation in focused tests.
Add Vincent Koc as a co-author for the PR context and review trail. Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
8e91314 to
f623fc0
Compare
Summary
This is PR #3 in the Codex native-thread reuse stack, after #85978 and #86069. It does not change the 70k default guard, and it does not preserve bindings across context-engine compaction or successor transcript rollover yet. It makes Codex app-server native thread reuse, bypass, rotation, and compaction-invalidation decisions observable first.
Changes:
codex.native_thread.lifecycleas a trusted diagnostic event.extensions/codex/src/app-server/native-thread-diagnostics.tswith stable lifecycle reasons and one helper for trusted events plus low-cardinality logs.thread_bootstrapsemantic reuse, context-engine/projection/dynamic-tool/MCP/environment/plugin/auth mismatch, native-tool-surface-disabled bypass, missing binding, app-server rejection, invalid image rejection, and context-engine-owned compaction invalidation through the helper.thread_bootstrapbindings still bypass proactive token/byte guards; legacy/non-bootstrap bindings still honor configured token/byte guards.Why this matters
The confusing part of the slowdown is that the
70000guard is not GPT-5.5's model context limit. It is a local warm-thread reuse threshold. If a legacy Codex native transcript is already above that threshold, OpenClaw refuses to reuse the saved native thread and starts a fresh app-server thread. That can make every turn pay cold-start context rebuild pressure even when the channel/session itself is fine.PR #82981 is the historical source for the default: it introduced a guard that rotates stale Codex app-server bindings before resume when the bound native rollout is at or above 70k tokens, because large native Codex rollouts could keep getting resumed after OpenClaw had already compacted/mirrored a smaller transcript. So the default should be read as a proactive warm-thread reuse guard, not as a model-window cap, not as a bootstrap-size cap, and not as a carefully documented LCM reserve.
Example failure mode:
native-token-guardrotatesthread/start; bootstrap/context is rebuiltWith #86069 config support,
maxActiveTranscriptTokens: "120k"preserves an 86k legacy binding, while"50k"rotates a 60k binding. Setting the token guard to0disables only proactive token rotation; byte guards and semantic identity mismatches can still rotate.The best-case long-running context-engine path is different:
thread_bootstrapprojects the large context once into the fresh Codex thread, then reuses it as long as the context-engine id, policy fingerprint, projection epoch/fingerprint, and dynamic-tool surface still match. This PR emitsthread-bootstrap-semantic-reuseonly after that semantic match is proven.graph TD A[OpenClaw turn] --> B{Saved Codex binding?} B -->|no| S[Start fresh native thread] B -->|yes| C{Semantic identity matches?} C -->|no| R[Rotate with mismatch reason] C -->|yes legacy/per-turn| G{Native token/byte guard exceeded?} G -->|yes| T[Rotate with native-token/native-byte reason] G -->|no| U[Resume warm native thread] C -->|yes thread_bootstrap| H[Emit thread-bootstrap-semantic-reuse] H --> U R --> S T --> SAdversarial review fixes in latest head
A four-agent adversarial pass found diagnostics issues rather than behavior changes in the guard itself. Latest head fixes them:
thread-bootstrap-semantic-reuseevents from the early proactive guard bypass; semantic reuse is emitted only after the later identity match.native-tool-surface-disabledso transient native-disabled turns are not misclassified asenvironment-selection-mismatch.thread_bootstrap/per_turnbinding mode and context-engine metadata when native compaction rejects a stale binding.Diagnostics
codex.native_thread.lifecycleincludes stable reason/action fields plus thread/session/run ids, binding mode, context-engine identity, projection epoch/fingerprint, previous/current fingerprints, token/byte guard values, native/session token counts, prompt/developer-instruction size estimates, and bootstrap/context-engine contribution booleans.It intentionally does not include raw prompt text, bootstrap file contents, tool args, or secrets. The companion generic log record is lower-cardinality and omits scoped thread/session ids, epoch/fingerprint fields, and path-like/error
extravalues.Non-goals
70000in this PR.Real behavior proof
/Volumes/LEXAR/repos/worktrees/openclaw-codex-semantic-reuse-guard, using the real OpenClaw package runtime withnode --import tsx.codex.native_thread.lifecyclewithaction: "rotated",reason: "native-token-guard", token guard values, native/session token counts, and thread/session identity, with no raw prompt text, bootstrap file contents, tool args, or secrets.git diff --checklocally.Test Plan
Tests added/extended for the latest head:
120kpreserving 86k,50krotating 60k,0disabling proactive token rotation, byte guard pre-read behavior, and sanitized lifecycle log payloads inrun-attempt.test.tsthread_bootstrapsemantic reuse, no false semantic reuse on mismatched over-budget bindings, projection/context-engine mismatch, and overflow rollover successor-binding diagnostics inrun-attempt.context-engine.test.tsthread_bootstrapbinding metadata, auth-profile mismatch, and context-engine-owned compaction invalidation incompact.test.tsstartOrResumeThreadRemote validation expected on GitHub CI for the pushed head
2372e215a2.