Skip to content

[QA-lab] Codex runtime parity beta.5 confidence proof for PR #80323 #80936

@100yenadmin

Description

@100yenadmin

TLDR

PR #80323 has a beta.5 confidence proof run with zero unknowns in the defined mock/static matrix.

  • OpenClaw baseline: v2026.5.10-beta.5
  • PR head validated by workflow: 3336dec6419c9cc9a87dc7cfa6f48118ca2d838e
  • Remote proof run: https://github.com/electricsheephq/openclaw-local-test/actions/runs/25719383976
  • Strict confidence report: pass=true, zeroUnknowns=true
  • Current product-bug verdict: no confirmed Codex runner product bug from the mock proof lanes
  • Remaining proof gaps: live/OAuth Codex-native lanes, live token efficiency, and scheduled/Testbox soak-100

Why This Issue Exists

This is the maintainer-facing confidence tracker for PR #80323. It ties the beta.5 proof artifacts back to the original RFC (#80171), the corrected tool-defaults harness issue (#80319), and the remaining live/Testbox proof trackers (#80397, #80433).

The goal is not to claim every possible OpenClaw behavior is proven. The goal is stricter: every lane in the defined confidence manifest must either pass or have a classified, artifact-backed verdict. This run achieved that for the mock/static matrix.

Evidence

Remote workflow:

QA Runtime Confidence Proof
run: 25719383976
repo: electricsheephq/openclaw-local-test
target_ref: codex-vs-pi-runtime-parity-tools
expected_sha: 3336dec6419c9cc9a87dc7cfa6f48118ca2d838e
run_soak: false
run_live: false

Remote job results:

Static and targeted QA unit proof: success
- pnpm check:test-types: success
- pnpm lint --threads=8: success
- targeted QA-lab/Codex dynamic-tools tests: success

Mock confidence proof bundle: success
- tool-defaults direct: success
- openclaw-dynamic-tools direct: success
- tool-defaults searchable: success
- first-hour-20 direct: success
- first-hour-20 token report: success
- fault-injection mock: success
- expanded JSONL replay: success
- confidence negative controls: success
- strict confidence report: success

Live confidence proof lanes: skipped by dispatch; classified as environment-blocked in the confidence report.

Downloaded artifact root used for inspection:

/Volumes/LEXAR/Codex/qa-runtime-confidence-artifacts-25719383976/qa-runtime-confidence-mock-3336dec6419c9cc9a87dc7cfa6f48118ca2d838e/

Machine-Readable Results

{
  "tool-defaults-direct": { "total": 20, "passed": 20, "skipped": 0, "failed": 0 },
  "openclaw-dynamic-tools-direct": { "total": 8, "passed": 8, "skipped": 0, "failed": 0 },
  "tool-defaults-searchable": { "total": 20, "passed": 15, "skipped": 5, "failed": 0 },
  "first-hour-20-direct": { "total": 18, "passed": 15, "skipped": 3, "failed": 0 },
  "fault-injection-mock": { "total": 5, "passed": 3, "skipped": 2, "failed": 0 },
  "jsonl-expanded": { "curatedTranscripts": 7, "turnsCompared": 15, "driftedTurns": 0 },
  "confidence-self-test": { "pass": true, "detectedCanaries": "7/7" },
  "confidence-report": {
    "pass": true,
    "zeroUnknowns": true,
    "lanes": 12,
    "passed": 8,
    "blocked": 4,
    "unknown": 0,
    "failed": 0
  }
}

Classification

Product Impact If OpenClaw Moved Fully To Codex Today

  • P4 for the old broad tool-defaults claims: missing duplicate OpenClaw dynamic read/write/edit/apply_patch/exec/update_plan exposure is intentional Codex-native ownership, not a Codex product bug.
  • P1 proof gap for live Codex-native behavior: approval/read/write/compaction rows still need native/live proof before being used as product evidence.
  • P3 proof gap for soak-100: optional long-run coverage needs scheduled/Testbox artifacts, but it is not part of the default maintainer gate.

QA Impact

  • P0 resolved for deterministic mock CI gate: tool-defaults direct, openclaw-dynamic-tools direct, and first-hour-20 direct have 0 hard failures.
  • P1 still open for live proof: mock token efficiency is labeled mock-estimate; real live-usage is tracked separately.
  • P2 searchable/deferred mock limitation: searchable rows are report-only until the mock provider can model deferred Codex tool discovery honestly.

Important Boundaries

  • Mock-only failures are not Codex runner product bugs unless reproduced through native/live Codex behavior or source-level proof independent of the mock provider.
  • Codex-native workspace tools remain native-owned and must not be duplicated as OpenClaw dynamic tools in production.
  • OpenClaw integration tools are still tested through the dynamic openclaw bridge and passed the direct mock lane.
  • Token efficiency in this proof is mock-estimate, not live usage.

Linked Work

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions