Parent: #80171
Related PR: #80323
Related confidence proof: #80936
Related live proof tracker: #80397
TLDR
Still open. The beta.5 confidence proof did not run live/OAuth lanes, so this live-token guard is not closed by the mock/static proof.
The current mock token report is correctly labeled as an estimate:
{
"status": "estimated",
"providerMode": "mock-openai",
"usageSources": ["mock-estimate"],
"pass": true
}
This issue is specifically about a failed live-frontier run that previously produced live-usage with 0 vs 0 and still looked like a pass.
Priority: P1 for live proof trust, P4 for mock/static lanes. It does not block the beta.5 mock confidence result in #80936, but it must be fixed or proven fixed before claiming live token-efficiency proof.
What Happened
After a failed live runtime-pair probe, qa parity-report --runtime-axis --token-efficiency emitted a token-efficiency report with:
- Provider mode:
live-frontier
- Usage source:
live-usage
- Verdict:
pass
- Pi total tokens:
0
- Codex total tokens:
0
The runtime parity report correctly failed with failure-mode, but the token report looked like a valid live token-efficiency pass even though no assistant-message usage was captured.
Repro
First run the failing live probe:
OPENCLAW_QA_SUITE_PROGRESS=1 OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa suite \
--provider-mode live-frontier \
--runtime-pair pi,codex \
--model openai-codex/gpt-5.5 \
--alt-model openai-codex/gpt-5.5 \
--scenario channel-chat-baseline \
--concurrency 1 \
--allow-failures \
--output-dir .artifacts/qa-e2e/live-runtime-pair-channel-baseline
Then render reports:
OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa parity-report \
--repo-root . \
--runtime-axis \
--token-efficiency \
--summary .artifacts/qa-e2e/live-runtime-pair-channel-baseline/qa-suite-summary.json \
--output-dir .artifacts/qa-e2e/live-runtime-pair-channel-baseline/report
Expected
If either runtime cell fails before producing assistant-message usage, the token-efficiency report should be marked unavailable/skipped/fail-for-token-proof rather than pass live-usage 0 vs 0.
Suggested rule: live token-efficiency can pass only for rows where both runtime cells completed and at least one assistant-message usage record was captured, or where zero usage is explicitly classified as an intentional no-assistant-output scenario.
Links
Parent: #80171
Related PR: #80323
Related confidence proof: #80936
Related live proof tracker: #80397
TLDR
Still open. The beta.5 confidence proof did not run live/OAuth lanes, so this live-token guard is not closed by the mock/static proof.
The current mock token report is correctly labeled as an estimate:
{ "status": "estimated", "providerMode": "mock-openai", "usageSources": ["mock-estimate"], "pass": true }This issue is specifically about a failed live-frontier run that previously produced
live-usagewith0 vs 0and still looked like a pass.Priority: P1 for live proof trust, P4 for mock/static lanes. It does not block the beta.5 mock confidence result in #80936, but it must be fixed or proven fixed before claiming live token-efficiency proof.
What Happened
After a failed live runtime-pair probe,
qa parity-report --runtime-axis --token-efficiencyemitted a token-efficiency report with:live-frontierlive-usagepass00The runtime parity report correctly failed with
failure-mode, but the token report looked like a valid live token-efficiency pass even though no assistant-message usage was captured.Repro
First run the failing live probe:
Then render reports:
OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa parity-report \ --repo-root . \ --runtime-axis \ --token-efficiency \ --summary .artifacts/qa-e2e/live-runtime-pair-channel-baseline/qa-suite-summary.json \ --output-dir .artifacts/qa-e2e/live-runtime-pair-channel-baseline/reportExpected
If either runtime cell fails before producing assistant-message usage, the token-efficiency report should be marked unavailable/skipped/fail-for-token-proof rather than
pass live-usage 0 vs 0.Suggested rule: live token-efficiency can pass only for rows where both runtime cells completed and at least one assistant-message usage record was captured, or where zero usage is explicitly classified as an intentional no-assistant-output scenario.
Links