[QA-lab] Codex runtime parity beta.5 confidence proof for PR #80323

# TLDR

PR #80323 has a beta.5 confidence proof run with **zero unknowns in the defined mock/static matrix**.

- OpenClaw baseline: `v2026.5.10-beta.5`
- PR head validated by workflow: `3336dec6419c9cc9a87dc7cfa6f48118ca2d838e`
- Remote proof run: https://github.com/electricsheephq/openclaw-local-test/actions/runs/25719383976
- Strict confidence report: `pass=true`, `zeroUnknowns=true`
- Current product-bug verdict: **no confirmed Codex runner product bug from the mock proof lanes**
- Remaining proof gaps: live/OAuth Codex-native lanes, live token efficiency, and scheduled/Testbox `soak-100`

# Why This Issue Exists

This is the maintainer-facing confidence tracker for PR #80323. It ties the beta.5 proof artifacts back to the original RFC (#80171), the corrected tool-defaults harness issue (#80319), and the remaining live/Testbox proof trackers (#80397, #80433).

The goal is not to claim every possible OpenClaw behavior is proven. The goal is stricter: every lane in the defined confidence manifest must either pass or have a classified, artifact-backed verdict. This run achieved that for the mock/static matrix.

# Evidence

Remote workflow:

```text
QA Runtime Confidence Proof
run: 25719383976
repo: electricsheephq/openclaw-local-test
target_ref: codex-vs-pi-runtime-parity-tools
expected_sha: 3336dec6419c9cc9a87dc7cfa6f48118ca2d838e
run_soak: false
run_live: false
```

Remote job results:

```text
Static and targeted QA unit proof: success
- pnpm check:test-types: success
- pnpm lint --threads=8: success
- targeted QA-lab/Codex dynamic-tools tests: success

Mock confidence proof bundle: success
- tool-defaults direct: success
- openclaw-dynamic-tools direct: success
- tool-defaults searchable: success
- first-hour-20 direct: success
- first-hour-20 token report: success
- fault-injection mock: success
- expanded JSONL replay: success
- confidence negative controls: success
- strict confidence report: success

Live confidence proof lanes: skipped by dispatch; classified as environment-blocked in the confidence report.
```

Downloaded artifact root used for inspection:

```text
/Volumes/LEXAR/Codex/qa-runtime-confidence-artifacts-25719383976/qa-runtime-confidence-mock-3336dec6419c9cc9a87dc7cfa6f48118ca2d838e/
```

# Machine-Readable Results

```json
{
  "tool-defaults-direct": { "total": 20, "passed": 20, "skipped": 0, "failed": 0 },
  "openclaw-dynamic-tools-direct": { "total": 8, "passed": 8, "skipped": 0, "failed": 0 },
  "tool-defaults-searchable": { "total": 20, "passed": 15, "skipped": 5, "failed": 0 },
  "first-hour-20-direct": { "total": 18, "passed": 15, "skipped": 3, "failed": 0 },
  "fault-injection-mock": { "total": 5, "passed": 3, "skipped": 2, "failed": 0 },
  "jsonl-expanded": { "curatedTranscripts": 7, "turnsCompared": 15, "driftedTurns": 0 },
  "confidence-self-test": { "pass": true, "detectedCanaries": "7/7" },
  "confidence-report": {
    "pass": true,
    "zeroUnknowns": true,
    "lanes": 12,
    "passed": 8,
    "blocked": 4,
    "unknown": 0,
    "failed": 0
  }
}
```

# Classification

## Product Impact If OpenClaw Moved Fully To Codex Today

- **P4 for the old broad tool-defaults claims**: missing duplicate OpenClaw dynamic `read/write/edit/apply_patch/exec/update_plan` exposure is intentional Codex-native ownership, not a Codex product bug.
- **P1 proof gap for live Codex-native behavior**: approval/read/write/compaction rows still need native/live proof before being used as product evidence.
- **P3 proof gap for `soak-100`**: optional long-run coverage needs scheduled/Testbox artifacts, but it is not part of the default maintainer gate.

## QA Impact

- **P0 resolved for deterministic mock CI gate**: `tool-defaults direct`, `openclaw-dynamic-tools direct`, and `first-hour-20 direct` have `0` hard failures.
- **P1 still open for live proof**: mock token efficiency is labeled `mock-estimate`; real `live-usage` is tracked separately.
- **P2 searchable/deferred mock limitation**: searchable rows are report-only until the mock provider can model deferred Codex tool discovery honestly.

# Important Boundaries

- Mock-only failures are not Codex runner product bugs unless reproduced through native/live Codex behavior or source-level proof independent of the mock provider.
- Codex-native workspace tools remain native-owned and must not be duplicated as OpenClaw dynamic tools in production.
- OpenClaw integration tools are still tested through the dynamic `openclaw` bridge and passed the direct mock lane.
- Token efficiency in this proof is `mock-estimate`, not live usage.

# Linked Work

- RFC/tracking: #80171
- PR: #80323
- Tool-defaults correction: #80319
- Live token/Testbox proof: #80397
- Scheduled/Testbox soak: #80433
- Token live zero-usage guard: #80411
- Stale first-hour-20 tracker updated by this proof: #80434


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QA-lab] Codex runtime parity beta.5 confidence proof for PR #80323 #80936

TLDR

Why This Issue Exists

Evidence

Machine-Readable Results

Classification

Product Impact If OpenClaw Moved Fully To Codex Today

QA Impact

Important Boundaries

Linked Work

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[QA-lab] Codex runtime parity beta.5 confidence proof for PR #80323 #80936

Description

TLDR

Why This Issue Exists

Evidence

Machine-Readable Results

Classification

Product Impact If OpenClaw Moved Fully To Codex Today

QA Impact

Important Boundaries

Linked Work

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions