TLDR
PR #80323 has a beta.5 confidence proof run with zero unknowns in the defined mock/static matrix.
- OpenClaw baseline:
v2026.5.10-beta.5
- PR head validated by workflow:
3336dec6419c9cc9a87dc7cfa6f48118ca2d838e
- Remote proof run: https://github.com/electricsheephq/openclaw-local-test/actions/runs/25719383976
- Strict confidence report:
pass=true, zeroUnknowns=true
- Current product-bug verdict: no confirmed Codex runner product bug from the mock proof lanes
- Remaining proof gaps: live/OAuth Codex-native lanes, live token efficiency, and scheduled/Testbox
soak-100
Why This Issue Exists
This is the maintainer-facing confidence tracker for PR #80323. It ties the beta.5 proof artifacts back to the original RFC (#80171), the corrected tool-defaults harness issue (#80319), and the remaining live/Testbox proof trackers (#80397, #80433).
The goal is not to claim every possible OpenClaw behavior is proven. The goal is stricter: every lane in the defined confidence manifest must either pass or have a classified, artifact-backed verdict. This run achieved that for the mock/static matrix.
Evidence
Remote workflow:
QA Runtime Confidence Proof
run: 25719383976
repo: electricsheephq/openclaw-local-test
target_ref: codex-vs-pi-runtime-parity-tools
expected_sha: 3336dec6419c9cc9a87dc7cfa6f48118ca2d838e
run_soak: false
run_live: false
Remote job results:
Static and targeted QA unit proof: success
- pnpm check:test-types: success
- pnpm lint --threads=8: success
- targeted QA-lab/Codex dynamic-tools tests: success
Mock confidence proof bundle: success
- tool-defaults direct: success
- openclaw-dynamic-tools direct: success
- tool-defaults searchable: success
- first-hour-20 direct: success
- first-hour-20 token report: success
- fault-injection mock: success
- expanded JSONL replay: success
- confidence negative controls: success
- strict confidence report: success
Live confidence proof lanes: skipped by dispatch; classified as environment-blocked in the confidence report.
Downloaded artifact root used for inspection:
/Volumes/LEXAR/Codex/qa-runtime-confidence-artifacts-25719383976/qa-runtime-confidence-mock-3336dec6419c9cc9a87dc7cfa6f48118ca2d838e/
Machine-Readable Results
{
"tool-defaults-direct": { "total": 20, "passed": 20, "skipped": 0, "failed": 0 },
"openclaw-dynamic-tools-direct": { "total": 8, "passed": 8, "skipped": 0, "failed": 0 },
"tool-defaults-searchable": { "total": 20, "passed": 15, "skipped": 5, "failed": 0 },
"first-hour-20-direct": { "total": 18, "passed": 15, "skipped": 3, "failed": 0 },
"fault-injection-mock": { "total": 5, "passed": 3, "skipped": 2, "failed": 0 },
"jsonl-expanded": { "curatedTranscripts": 7, "turnsCompared": 15, "driftedTurns": 0 },
"confidence-self-test": { "pass": true, "detectedCanaries": "7/7" },
"confidence-report": {
"pass": true,
"zeroUnknowns": true,
"lanes": 12,
"passed": 8,
"blocked": 4,
"unknown": 0,
"failed": 0
}
}
Classification
Product Impact If OpenClaw Moved Fully To Codex Today
- P4 for the old broad tool-defaults claims: missing duplicate OpenClaw dynamic
read/write/edit/apply_patch/exec/update_plan exposure is intentional Codex-native ownership, not a Codex product bug.
- P1 proof gap for live Codex-native behavior: approval/read/write/compaction rows still need native/live proof before being used as product evidence.
- P3 proof gap for
soak-100: optional long-run coverage needs scheduled/Testbox artifacts, but it is not part of the default maintainer gate.
QA Impact
- P0 resolved for deterministic mock CI gate:
tool-defaults direct, openclaw-dynamic-tools direct, and first-hour-20 direct have 0 hard failures.
- P1 still open for live proof: mock token efficiency is labeled
mock-estimate; real live-usage is tracked separately.
- P2 searchable/deferred mock limitation: searchable rows are report-only until the mock provider can model deferred Codex tool discovery honestly.
Important Boundaries
- Mock-only failures are not Codex runner product bugs unless reproduced through native/live Codex behavior or source-level proof independent of the mock provider.
- Codex-native workspace tools remain native-owned and must not be duplicated as OpenClaw dynamic tools in production.
- OpenClaw integration tools are still tested through the dynamic
openclaw bridge and passed the direct mock lane.
- Token efficiency in this proof is
mock-estimate, not live usage.
Linked Work
TLDR
PR #80323 has a beta.5 confidence proof run with zero unknowns in the defined mock/static matrix.
v2026.5.10-beta.53336dec6419c9cc9a87dc7cfa6f48118ca2d838epass=true,zeroUnknowns=truesoak-100Why This Issue Exists
This is the maintainer-facing confidence tracker for PR #80323. It ties the beta.5 proof artifacts back to the original RFC (#80171), the corrected tool-defaults harness issue (#80319), and the remaining live/Testbox proof trackers (#80397, #80433).
The goal is not to claim every possible OpenClaw behavior is proven. The goal is stricter: every lane in the defined confidence manifest must either pass or have a classified, artifact-backed verdict. This run achieved that for the mock/static matrix.
Evidence
Remote workflow:
Remote job results:
Downloaded artifact root used for inspection:
Machine-Readable Results
{ "tool-defaults-direct": { "total": 20, "passed": 20, "skipped": 0, "failed": 0 }, "openclaw-dynamic-tools-direct": { "total": 8, "passed": 8, "skipped": 0, "failed": 0 }, "tool-defaults-searchable": { "total": 20, "passed": 15, "skipped": 5, "failed": 0 }, "first-hour-20-direct": { "total": 18, "passed": 15, "skipped": 3, "failed": 0 }, "fault-injection-mock": { "total": 5, "passed": 3, "skipped": 2, "failed": 0 }, "jsonl-expanded": { "curatedTranscripts": 7, "turnsCompared": 15, "driftedTurns": 0 }, "confidence-self-test": { "pass": true, "detectedCanaries": "7/7" }, "confidence-report": { "pass": true, "zeroUnknowns": true, "lanes": 12, "passed": 8, "blocked": 4, "unknown": 0, "failed": 0 } }Classification
Product Impact If OpenClaw Moved Fully To Codex Today
read/write/edit/apply_patch/exec/update_planexposure is intentional Codex-native ownership, not a Codex product bug.soak-100: optional long-run coverage needs scheduled/Testbox artifacts, but it is not part of the default maintainer gate.QA Impact
tool-defaults direct,openclaw-dynamic-tools direct, andfirst-hour-20 directhave0hard failures.mock-estimate; reallive-usageis tracked separately.Important Boundaries
openclawbridge and passed the direct mock lane.mock-estimate, not live usage.Linked Work