Skip to content

[QA-lab] Token-efficiency report marks failed live zero-usage run as pass #80411

@100yenadmin

Description

@100yenadmin

Parent: #80171
Related PR: #80323
Related confidence proof: #80936
Related live proof tracker: #80397

TLDR

Still open. The beta.5 confidence proof did not run live/OAuth lanes, so this live-token guard is not closed by the mock/static proof.

The current mock token report is correctly labeled as an estimate:

{
  "status": "estimated",
  "providerMode": "mock-openai",
  "usageSources": ["mock-estimate"],
  "pass": true
}

This issue is specifically about a failed live-frontier run that previously produced live-usage with 0 vs 0 and still looked like a pass.

Priority: P1 for live proof trust, P4 for mock/static lanes. It does not block the beta.5 mock confidence result in #80936, but it must be fixed or proven fixed before claiming live token-efficiency proof.

What Happened

After a failed live runtime-pair probe, qa parity-report --runtime-axis --token-efficiency emitted a token-efficiency report with:

  • Provider mode: live-frontier
  • Usage source: live-usage
  • Verdict: pass
  • Pi total tokens: 0
  • Codex total tokens: 0

The runtime parity report correctly failed with failure-mode, but the token report looked like a valid live token-efficiency pass even though no assistant-message usage was captured.

Repro

First run the failing live probe:

OPENCLAW_QA_SUITE_PROGRESS=1 OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa suite \
  --provider-mode live-frontier \
  --runtime-pair pi,codex \
  --model openai-codex/gpt-5.5 \
  --alt-model openai-codex/gpt-5.5 \
  --scenario channel-chat-baseline \
  --concurrency 1 \
  --allow-failures \
  --output-dir .artifacts/qa-e2e/live-runtime-pair-channel-baseline

Then render reports:

OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa parity-report \
  --repo-root . \
  --runtime-axis \
  --token-efficiency \
  --summary .artifacts/qa-e2e/live-runtime-pair-channel-baseline/qa-suite-summary.json \
  --output-dir .artifacts/qa-e2e/live-runtime-pair-channel-baseline/report

Expected

If either runtime cell fails before producing assistant-message usage, the token-efficiency report should be marked unavailable/skipped/fail-for-token-proof rather than pass live-usage 0 vs 0.

Suggested rule: live token-efficiency can pass only for rows where both runtime cells completed and at least one assistant-message usage record was captured, or where zero usage is explicitly classified as an intentional no-assistant-output scenario.

Links

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions