Skip to content

fix(codex): prefer native tokens for resume budget#83229

Open
hansolo949 wants to merge 1 commit into
openclaw:mainfrom
hansolo949:fix/codex-context-budget-native-precedence
Open

fix(codex): prefer native tokens for resume budget#83229
hansolo949 wants to merge 1 commit into
openclaw:mainfrom
hansolo949:fix/codex-context-budget-native-precedence

Conversation

@hansolo949

Copy link
Copy Markdown
Contributor

Summary

This is the narrow follow-up for the remaining Codex app-server resume-budget edge case. Upstream main already has the broader native rollout guard and dynamic tool-result cap; this patch only changes token precedence when both native rollout usage and mirrored OpenClaw session totals are present.

Real behavior proof

Behavior or issue addressed: OpenClaw can keep a high mirrored session token total after Codex app-server compaction. Current main reads native last_token_usage, but still evaluates the resume guard with max(sessionTokens, nativeTokens), so a stale mirrored total can rotate a healthy compacted native thread. This patch prefers nativeTokens ?? sessionTokens.

Real environment tested: Han local OpenClaw install on macOS, OpenClaw 2026.5.16-beta.3, Gateway running against the main agent with WebChat/Codex app-server bindings.

Exact steps or command run after this patch: Applied the same one-line runtime patch locally, restarted/checked OpenClaw, verified the installed runtime contains const tokenCount = nativeTokens ?? sessionTokens;, and ran the focused app-server regression lane in this rebased PR worktree.

Evidence after fix: Live terminal output from the local OpenClaw setup after the patch:

$ rg -n "const tokenCount = nativeTokens \\?\\? sessionTokens" /usr/local/lib/node_modules/openclaw/dist/run-attempt-DEhr_oag.js
/usr/local/lib/node_modules/openclaw/dist/run-attempt-DEhr_oag.js:2543:\tconst tokenCount = nativeTokens ?? sessionTokens;

$ openclaw health --json | jq '{ok,eventLoop:.eventLoop,telegram:.channels.telegram.connected}'
{
  "ok": true,
  "eventLoop": {
    "degraded": false,
    "reasons": [],
    "delayP99Ms": 22.8,
    "utilization": 0.021,
    "cpuCoreRatio": 0.022
  },
  "telegram": true
}

Observed result after fix: The local installed runtime now uses native Codex rollout token usage first, so a compacted native thread is not forced to rotate solely because OpenClaw mirrored session totals are stale. Gateway stayed healthy after the patch. The rebased PR regression also passed: Test Files 1 passed (1); Tests 7 passed | 151 skipped (158).

What was not tested: Full repository test suite and a live upstream CI WebChat session. The live local Gateway/runtime check and focused app-server regression were tested.

Supplemental validation

node_modules/.bin/oxfmt --check --threads=1 extensions/codex/src/app-server/run-attempt.ts extensions/codex/src/app-server/run-attempt.test.ts
git diff --check
tmp=$(mktemp); printf '%s\n' '["codex/src/app-server/run-attempt.test.ts"]' > "$tmp"
OPENCLAW_VITEST_INCLUDE_FILE="$tmp" OPENCLAW_VITEST_MAX_WORKERS=1 node scripts/run-vitest.mjs run --config test/vitest/vitest.extensions.config.ts --testNamePattern 'native rollout|stale mirrored'

Supplemental results: oxfmt passed, git diff whitespace passed, focused Vitest passed.

@openclaw-barnacle openclaw-barnacle Bot added extensions: codex size: S proof: supplied External PR includes structured after-fix real behavior proof. labels May 17, 2026
@clawsweeper

clawsweeper Bot commented May 17, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge.

Latest ClawSweeper review: 2026-05-22 22:30 UTC / May 22, 2026, 6:30 PM ET.

Workflow note: Future ClawSweeper reviews update this same comment in place.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

Summary
This PR changes the Codex app-server resume-budget guard to prefer native rollout token usage over mirrored session totals and adds a stale-token regression test.

Reproducibility: yes. source-reproducible: current main computes max(sessionTokens, nativeTokens), so a 96,000 mirrored total with native last_token_usage of 12,000 crosses the 70,000 resume limit and rotates. I did not execute tests in this read-only review.

PR rating
Overall: 🦪 silver shellfish
Proof: 🦪 silver shellfish
Patch quality: 🐚 platinum hermit
Summary: The patch is clean and focused, but merge confidence is capped because real behavior proof does not exercise the changed resume path.

Rank-up moves:

  • Add redacted live terminal output, logs, or a short recording showing a stale mirrored session total with native last_token_usage resuming without thread/start.
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

Real behavior proof
Needs stronger real behavior proof before merge: Terminal proof shows the patched installed file and a healthy Gateway, but it does not show the actual stale mirrored-token resume behavior after the fix. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

  • Contributor proof shows the patched installed runtime and Gateway health, but not the real stale mirrored-token resume path that this PR changes.
  • No tests were executed in this read-only review; validation relies on source inspection and the PR-reported focused regression lane.

Maintainer options:

  1. Decide the mitigation before merge
    Land the narrow native-token precedence fix after adding redacted live terminal/log output or a short recording showing a stale mirrored Codex session resuming instead of rotating, while keeping the focused regression test.
  2. Pause or close
    Do not merge this PR until maintainers decide whether the risk is worth taking.

Next step before merge
Needs contributor-supplied real behavior proof of the stale-token resume path; there is no safe ClawSweeper repair lane for the contributor's local setup.

Security
Cleared: No concrete security or supply-chain concern found in the two-file Codex app-server runtime/test diff.

Review details

Best possible solution:

Land the narrow native-token precedence fix after adding redacted live terminal/log output or a short recording showing a stale mirrored Codex session resuming instead of rotating, while keeping the focused regression test.

Do we have a high-confidence way to reproduce the issue?

Yes, source-reproducible: current main computes max(sessionTokens, nativeTokens), so a 96,000 mirrored total with native last_token_usage of 12,000 crosses the 70,000 resume limit and rotates. I did not execute tests in this read-only review.

Is this the best way to solve the issue?

Yes for the code path: preferring native rollout tokens when present is the narrowest maintainable fix because current app-server code already treats last/current native usage as active context usage. The remaining blocker is proof quality, not the implementation shape.

Label changes:

  • add rating: 🦪 silver shellfish: Current PR rating is 🦪 silver shellfish because proof is 🦪 silver shellfish, patch quality is 🐚 platinum hermit, and The patch is clean and focused, but merge confidence is capped because real behavior proof does not exercise the changed resume path.
  • add status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs stronger real behavior proof before merge: Terminal proof shows the patched installed file and a healthy Gateway, but it does not show the actual stale mirrored-token resume behavior after the fix. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.
  • remove impact:session-state: Current review selected no impact labels.

Label justifications:

  • P2: The PR fixes a bounded Codex app-server session-state edge case with limited blast radius and clear regression coverage.
  • rating: 🦪 silver shellfish: Current PR rating is 🦪 silver shellfish because proof is 🦪 silver shellfish, patch quality is 🐚 platinum hermit, and The patch is clean and focused, but merge confidence is capped because real behavior proof does not exercise the changed resume path.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs stronger real behavior proof before merge: Terminal proof shows the patched installed file and a healthy Gateway, but it does not show the actual stale mirrored-token resume behavior after the fix. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

What I checked:

Likely related people:

  • steipete: Peter Steinberger authored the Codex app-server controls and major app-server module/lifecycle refactors, and shortlog shows the most commits in the central app-server files. (role: feature owner and heavy area contributor; confidence: high; commits: 31a0b7bd42a5, 8d72aafdbb8d, 659bcc5e5b59; files: extensions/codex/src/app-server/run-attempt.ts, extensions/codex/src/app-server/run-attempt.test.ts, extensions/codex/src/app-server/event-projector.ts)
  • cyrusaf: Cyrus Forbes authored the prior merged fix that avoided treating cumulative Codex token usage as current context usage. (role: prior token-usage fix author; confidence: high; commits: 9a94194329de; files: extensions/codex/src/app-server/event-projector.ts, extensions/codex/src/app-server/event-projector.test.ts)
  • vincentkoc: Vincent Koc recently worked on Codex app-server startup/auth handling and shared run-attempt harness coverage around the same runtime path. (role: recent adjacent Codex app-server contributor; confidence: medium; commits: f1cc8f0cfc7c, 859eb0666282, b5dfeaab4ce3; files: extensions/codex/src/app-server/run-attempt.ts, extensions/codex/src/app-server/run-attempt.test.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against 8f8638393ef4.

@clawsweeper clawsweeper Bot added P2 Normal backlog priority with limited blast radius. impact:session-state Session, memory, transcript, context, or agent state can drift or corrupt. labels May 17, 2026
@hansolo949

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

The Real behavior proof check is passing and the PR body includes live local runtime evidence plus the focused regression lane. Please refresh the durable review verdict against the current proof.

@clawsweeper

clawsweeper Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. and removed impact:session-state Session, memory, transcript, context, or agent state can drift or corrupt. labels May 22, 2026
@clawsweeper

clawsweeper Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?
  • The egg game starts only after the PR passes the real-behavior proof check.
  • Before that, no creature or rarity is rolled. The treat waits for real proof.
  • This is still just collectible flavor: proof affects review readiness, not creature quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

extensions: codex P2 Normal backlog priority with limited blast radius. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. size: S status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant