[QA-lab] Live token-efficiency gate flags Codex savings and exposes fs.read overhead

## TLDR

A local live run with a real `OPENAI_API_KEY` made `codex-native-live` pass functionally (`11/11` scenarios), but the live token-efficiency report failed. Most flagged rows are actually **Codex savings** versus Pi, so the report currently conflates "large token delta" with "token regression." Two rows are genuine Codex-more-expensive candidates and need triage: `streaming-final-integrity` and especially `runtime-tool-fs-read`.

This is not a confirmed correctness bug in the Codex runner. It is a release-confidence blocker for the live token-efficiency gate and a possible Codex efficiency regression in one native read fixture.

## Impact if OpenClaw moved fully to Codex today

- Product impact: P2 for token/cost risk if the `runtime-tool-fs-read` row reflects real default behavior, because Codex used `119,489` tokens and `40` tool calls versus Pi `72,381` tokens and `2` tool calls.
- Product correctness impact: P4 from this evidence; all `codex-native-live` functional rows passed.
- QA impact: P1 because strict-global confidence cannot pass live token efficiency until the report semantics are clarified and the Codex-more-expensive rows are classified.

## Reproduction

Local checkout: `/Volumes/LEXAR/repos/openclaw-runtime-parity-rebase`
PR head: `210f900ce81e7cf18f9af921b0f1a31cc7f95c0b`
Env source: `/Users/lume/.openclaw/secrets/openai.env` (`OPENAI_API_KEY`, value not logged)

```bash
set -a
source /Users/lume/.openclaw/secrets/openai.env
set +a

OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa suite \
  --provider-mode live-frontier \
  --runtime-pair pi,codex \
  --runtime-suite codex-native-live \
  --codex-tool-loading direct \
  --concurrency 1 \
  --model openai/gpt-5.5 \
  --alt-model openai/gpt-5.5 \
  --output-dir .artifacts/local-live-key-smoke-210f900/instruction-followthrough

OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa parity-report \
  --repo-root . \
  --runtime-axis \
  --token-efficiency \
  --summary .artifacts/local-live-key-smoke-210f900/instruction-followthrough/qa-suite-summary.json \
  --output-dir .artifacts/local-live-key-smoke-210f900/codex-native-live-token-report
```

## Artifacts

- Suite summary: `.artifacts/local-live-key-smoke-210f900/instruction-followthrough/qa-suite-summary.json`
- Token summary: `.artifacts/local-live-key-smoke-210f900/codex-native-live-token-report/qa-runtime-token-efficiency-summary.json`

## Functional result

`codex-native-live` passed:

```json
{ "total": 11, "passed": 11, "skipped": 0, "failed": 0 }
```

The API key fixed the earlier Pi `missing-api-key` blocker. Both runtimes produced live assistant-message usage.

## Token report result

The token report had `status=evaluated`, `usageSource=live-usage` rows, and failed because threshold rows were flagged.

Rows where Codex was cheaper but still flagged:

- `instruction-followthrough-repo-contract`: Codex `23,142` vs Pi `54,627` (`-57.6%`)
- `approval-turn-tool-followthrough`: Codex `44,413` vs Pi `54,015` (`-17.8%`)
- `compaction-retry-mutating-tool`: Codex `25,195` vs Pi `94,676` (`-73.4%`)
- `runtime-tool-apply-patch`: Codex `44,584` vs Pi `72,465` (`-38.5%`)
- `runtime-tool-bash`: Codex `44,955` vs Pi `92,394` (`-51.3%`)
- `runtime-tool-exec`: Codex `49,639` vs Pi `111,369` (`-55.4%`)
- `runtime-tool-fs-write`: Codex `46,318` vs Pi `72,374` (`-36.0%`)
- `runtime-tool-grep`: Codex `50,828` vs Pi `92,200` (`-44.9%`)

Rows where Codex was more expensive:

- `streaming-final-integrity`: Codex `21,946` vs Pi `17,887` (`+22.7%`), no tools.
- `runtime-tool-fs-read`: Codex `119,489` vs Pi `72,381` (`+65.1%`), Codex made `40` tool calls vs Pi `2`.

## Expected behavior

The live token-efficiency gate should distinguish at least these cases:

- Codex savings versus Pi should probably be reported as savings, not a failing regression, unless the intended policy is symmetric drift rather than Codex-regression detection.
- Codex-more-expensive rows should fail or warn according to a clear threshold policy.
- `runtime-tool-fs-read` should be investigated as a possible native-tool loop or inefficient live behavior because the tool-call and token delta are both large.

## Actual behavior

The token report fails the whole lane with all large absolute deltas flagged, including rows where Codex is substantially cheaper.

## Classification

- Verdict: live token-efficiency finding.
- Confirmed product correctness bug: no.
- Possible product efficiency bug: yes, `runtime-tool-fs-read` needs triage.
- QA/reporting bug: yes, if savings are not supposed to fail the gate.

## Links

- Umbrella beta.5 confidence issue: #80936
- Live proof tracker: #80397
- Previous live zero-usage guard issue: #80411


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QA-lab] Live token-efficiency gate flags Codex savings and exposes fs.read overhead #81093

TLDR

Impact if OpenClaw moved fully to Codex today

Reproduction

Artifacts

Functional result

Token report result

Expected behavior

Actual behavior

Classification

Links

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[QA-lab] Live token-efficiency gate flags Codex savings and exposes fs.read overhead #81093

Description

TLDR

Impact if OpenClaw moved fully to Codex today

Reproduction

Artifacts

Functional result

Token report result

Expected behavior

Actual behavior

Classification

Links

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions