[QA-lab] Token-efficiency report marks failed live zero-usage run as pass

Parent: #80171
Related PR: #80323
Related confidence proof: #80936
Related live proof tracker: #80397

# TLDR

**Still open.** The beta.5 confidence proof did not run live/OAuth lanes, so this live-token guard is not closed by the mock/static proof.

The current mock token report is correctly labeled as an estimate:

```json
{
  "status": "estimated",
  "providerMode": "mock-openai",
  "usageSources": ["mock-estimate"],
  "pass": true
}
```

This issue is specifically about a failed **live-frontier** run that previously produced `live-usage` with `0 vs 0` and still looked like a pass.

**Priority:** P1 for live proof trust, P4 for mock/static lanes. It does not block the beta.5 mock confidence result in #80936, but it must be fixed or proven fixed before claiming live token-efficiency proof.

# What Happened

After a failed live runtime-pair probe, `qa parity-report --runtime-axis --token-efficiency` emitted a token-efficiency report with:

- Provider mode: `live-frontier`
- Usage source: `live-usage`
- Verdict: `pass`
- Pi total tokens: `0`
- Codex total tokens: `0`

The runtime parity report correctly failed with `failure-mode`, but the token report looked like a valid live token-efficiency pass even though no assistant-message usage was captured.

# Repro

First run the failing live probe:

```bash
OPENCLAW_QA_SUITE_PROGRESS=1 OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa suite \
  --provider-mode live-frontier \
  --runtime-pair pi,codex \
  --model openai-codex/gpt-5.5 \
  --alt-model openai-codex/gpt-5.5 \
  --scenario channel-chat-baseline \
  --concurrency 1 \
  --allow-failures \
  --output-dir .artifacts/qa-e2e/live-runtime-pair-channel-baseline
```

Then render reports:

```bash
OPENCLAW_BUILD_PRIVATE_QA=1 OPENCLAW_ENABLE_PRIVATE_QA_CLI=1 pnpm openclaw qa parity-report \
  --repo-root . \
  --runtime-axis \
  --token-efficiency \
  --summary .artifacts/qa-e2e/live-runtime-pair-channel-baseline/qa-suite-summary.json \
  --output-dir .artifacts/qa-e2e/live-runtime-pair-channel-baseline/report
```

# Expected

If either runtime cell fails before producing assistant-message usage, the token-efficiency report should be marked unavailable/skipped/fail-for-token-proof rather than `pass live-usage 0 vs 0`.

Suggested rule: live token-efficiency can pass only for rows where both runtime cells completed and at least one assistant-message usage record was captured, or where zero usage is explicitly classified as an intentional no-assistant-output scenario.

# Links

- Mock/static confidence proof: #80936
- Live proof tracker: #80397


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QA-lab] Token-efficiency report marks failed live zero-usage run as pass #80411

TLDR

What Happened

Repro

Expected

Links

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[QA-lab] Token-efficiency report marks failed live zero-usage run as pass #80411

Description

TLDR

What Happened

Repro

Expected

Links

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions