TLDR
A local live run with a real OPENAI_API_KEY made codex-native-live pass functionally (11/11 scenarios), but the live token-efficiency report failed. Most flagged rows are actually Codex savings versus Pi, so the report currently conflates "large token delta" with "token regression." Two rows are genuine Codex-more-expensive candidates and need triage: streaming-final-integrity and especially runtime-tool-fs-read.
This is not a confirmed correctness bug in the Codex runner. It is a release-confidence blocker for the live token-efficiency gate and a possible Codex efficiency regression in one native read fixture.
Impact if OpenClaw moved fully to Codex today
- Product impact: P2 for token/cost risk if the
runtime-tool-fs-read row reflects real default behavior, because Codex used 119,489 tokens and 40 tool calls versus Pi 72,381 tokens and 2 tool calls.
- Product correctness impact: P4 from this evidence; all
codex-native-live functional rows passed.
- QA impact: P1 because strict-global confidence cannot pass live token efficiency until the report semantics are clarified and the Codex-more-expensive rows are classified.
Reproduction
Local checkout: /Volumes/LEXAR/repos/openclaw-runtime-parity-rebase
PR head: 210f900ce81e7cf18f9af921b0f1a31cc7f95c0b
Env source: /Users/lume/.openclaw/secrets/openai.env (OPENAI_API_KEY, value not logged)
set -a
source /Users/lume/.openclaw/secrets/openai.env
set +a
OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa suite \
--provider-mode live-frontier \
--runtime-pair pi,codex \
--runtime-suite codex-native-live \
--codex-tool-loading direct \
--concurrency 1 \
--model openai/gpt-5.5 \
--alt-model openai/gpt-5.5 \
--output-dir .artifacts/local-live-key-smoke-210f900/instruction-followthrough
OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa parity-report \
--repo-root . \
--runtime-axis \
--token-efficiency \
--summary .artifacts/local-live-key-smoke-210f900/instruction-followthrough/qa-suite-summary.json \
--output-dir .artifacts/local-live-key-smoke-210f900/codex-native-live-token-report
Artifacts
- Suite summary:
.artifacts/local-live-key-smoke-210f900/instruction-followthrough/qa-suite-summary.json
- Token summary:
.artifacts/local-live-key-smoke-210f900/codex-native-live-token-report/qa-runtime-token-efficiency-summary.json
Functional result
codex-native-live passed:
{ "total": 11, "passed": 11, "skipped": 0, "failed": 0 }
The API key fixed the earlier Pi missing-api-key blocker. Both runtimes produced live assistant-message usage.
Token report result
The token report had status=evaluated, usageSource=live-usage rows, and failed because threshold rows were flagged.
Rows where Codex was cheaper but still flagged:
instruction-followthrough-repo-contract: Codex 23,142 vs Pi 54,627 (-57.6%)
approval-turn-tool-followthrough: Codex 44,413 vs Pi 54,015 (-17.8%)
compaction-retry-mutating-tool: Codex 25,195 vs Pi 94,676 (-73.4%)
runtime-tool-apply-patch: Codex 44,584 vs Pi 72,465 (-38.5%)
runtime-tool-bash: Codex 44,955 vs Pi 92,394 (-51.3%)
runtime-tool-exec: Codex 49,639 vs Pi 111,369 (-55.4%)
runtime-tool-fs-write: Codex 46,318 vs Pi 72,374 (-36.0%)
runtime-tool-grep: Codex 50,828 vs Pi 92,200 (-44.9%)
Rows where Codex was more expensive:
streaming-final-integrity: Codex 21,946 vs Pi 17,887 (+22.7%), no tools.
runtime-tool-fs-read: Codex 119,489 vs Pi 72,381 (+65.1%), Codex made 40 tool calls vs Pi 2.
Expected behavior
The live token-efficiency gate should distinguish at least these cases:
- Codex savings versus Pi should probably be reported as savings, not a failing regression, unless the intended policy is symmetric drift rather than Codex-regression detection.
- Codex-more-expensive rows should fail or warn according to a clear threshold policy.
runtime-tool-fs-read should be investigated as a possible native-tool loop or inefficient live behavior because the tool-call and token delta are both large.
Actual behavior
The token report fails the whole lane with all large absolute deltas flagged, including rows where Codex is substantially cheaper.
Classification
- Verdict: live token-efficiency finding.
- Confirmed product correctness bug: no.
- Possible product efficiency bug: yes,
runtime-tool-fs-read needs triage.
- QA/reporting bug: yes, if savings are not supposed to fail the gate.
Links
TLDR
A local live run with a real
OPENAI_API_KEYmadecodex-native-livepass functionally (11/11scenarios), but the live token-efficiency report failed. Most flagged rows are actually Codex savings versus Pi, so the report currently conflates "large token delta" with "token regression." Two rows are genuine Codex-more-expensive candidates and need triage:streaming-final-integrityand especiallyruntime-tool-fs-read.This is not a confirmed correctness bug in the Codex runner. It is a release-confidence blocker for the live token-efficiency gate and a possible Codex efficiency regression in one native read fixture.
Impact if OpenClaw moved fully to Codex today
runtime-tool-fs-readrow reflects real default behavior, because Codex used119,489tokens and40tool calls versus Pi72,381tokens and2tool calls.codex-native-livefunctional rows passed.Reproduction
Local checkout:
/Volumes/LEXAR/repos/openclaw-runtime-parity-rebasePR head:
210f900ce81e7cf18f9af921b0f1a31cc7f95c0bEnv source:
/Users/lume/.openclaw/secrets/openai.env(OPENAI_API_KEY, value not logged)Artifacts
.artifacts/local-live-key-smoke-210f900/instruction-followthrough/qa-suite-summary.json.artifacts/local-live-key-smoke-210f900/codex-native-live-token-report/qa-runtime-token-efficiency-summary.jsonFunctional result
codex-native-livepassed:{ "total": 11, "passed": 11, "skipped": 0, "failed": 0 }The API key fixed the earlier Pi
missing-api-keyblocker. Both runtimes produced live assistant-message usage.Token report result
The token report had
status=evaluated,usageSource=live-usagerows, and failed because threshold rows were flagged.Rows where Codex was cheaper but still flagged:
instruction-followthrough-repo-contract: Codex23,142vs Pi54,627(-57.6%)approval-turn-tool-followthrough: Codex44,413vs Pi54,015(-17.8%)compaction-retry-mutating-tool: Codex25,195vs Pi94,676(-73.4%)runtime-tool-apply-patch: Codex44,584vs Pi72,465(-38.5%)runtime-tool-bash: Codex44,955vs Pi92,394(-51.3%)runtime-tool-exec: Codex49,639vs Pi111,369(-55.4%)runtime-tool-fs-write: Codex46,318vs Pi72,374(-36.0%)runtime-tool-grep: Codex50,828vs Pi92,200(-44.9%)Rows where Codex was more expensive:
streaming-final-integrity: Codex21,946vs Pi17,887(+22.7%), no tools.runtime-tool-fs-read: Codex119,489vs Pi72,381(+65.1%), Codex made40tool calls vs Pi2.Expected behavior
The live token-efficiency gate should distinguish at least these cases:
runtime-tool-fs-readshould be investigated as a possible native-tool loop or inefficient live behavior because the tool-call and token delta are both large.Actual behavior
The token report fails the whole lane with all large absolute deltas flagged, including rows where Codex is substantially cheaper.
Classification
runtime-tool-fs-readneeds triage.Links