Skip to content

[QA-lab] Complete live-frontier token-efficiency and Testbox parity proof #80397

@100yenadmin

Description

@100yenadmin

Parent: #80171
Related PR: #80323
Confidence proof tracker: #80936
Related plugin wrapper issue: #80365
Related harness correction: #80319
Related scheduled soak tracker: #80433
Related live token guard: #80411

TLDR

Still open. The beta.5 mock/static confidence proof for PR #80323 is green, but live-frontier token efficiency and scheduled/Testbox proof remain incomplete.

Latest proof:

OpenClaw baseline: v2026.5.10-beta.5
PR head: 3336dec6419c9cc9a87dc7cfa6f48118ca2d838e
Remote proof run: https://github.com/electricsheephq/openclaw-local-test/actions/runs/25719383976
Strict confidence: pass=true, zeroUnknowns=true

The confidence report classified live lanes as environment-blocked, not passed.

Why This Issue Exists

The runtime/prompt/tool parity harness now has artifact-backed mock/static proof across the implemented suites. That does not replace the live/Testbox proof requested by the expansion plan.

This issue tracks the remaining validation gap so the project does not accidentally treat mock-estimate token efficiency as live token truth.

Completed Beta.5 Mock/Static Proof

From run 25719383976:

{
  "tool-defaults-direct": { "total": 20, "passed": 20, "skipped": 0, "failed": 0 },
  "openclaw-dynamic-tools-direct": { "total": 8, "passed": 8, "skipped": 0, "failed": 0 },
  "tool-defaults-searchable": { "total": 20, "passed": 15, "skipped": 5, "failed": 0 },
  "first-hour-20-direct": { "total": 18, "passed": 15, "skipped": 3, "failed": 0 },
  "fault-injection-mock": { "total": 5, "passed": 3, "skipped": 2, "failed": 0 },
  "jsonl-expanded": { "curatedTranscripts": 7, "turnsCompared": 15, "driftedTurns": 0 },
  "confidence-self-test": { "pass": true, "detectedCanaries": "7/7" }
}

Token-efficiency artifact from the mock lane:

{
  "status": "estimated",
  "providerMode": "mock-openai",
  "usageSources": ["mock-estimate"],
  "rows": 18,
  "pass": true
}

Remaining Proof Needed

  • Run codex-native-live with live/OAuth credentials and attach qa-suite-summary.json.
  • Run first-hour-live with live/OAuth credentials and attach qa-suite-summary.json.
  • Generate live token-efficiency from assistant-message usage and confirm usageSource=live-usage.
  • Run or schedule soak-100 in Testbox/scheduled infrastructure and attach artifacts.
  • Keep [QA-lab] Token-efficiency report marks failed live zero-usage run as pass #80411 open until failed live zero-usage runs cannot masquerade as valid token-efficiency passes.

Guardrail

Mock-mode token efficiency must remain clearly labeled as an estimate. Do not use mock-mode token rows as live-token proof.

Metadata

Metadata

Assignees

No one assigned

    Labels

    staleMarked as stale due to inactivity

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions