[qa-lab] Complete Codex vs Pi runtime parity harness phases 2-5#80323
[qa-lab] Complete Codex vs Pi runtime parity harness phases 2-5#80323100yenadmin wants to merge 82 commits into
Conversation
|
Codex review: needs real behavior proof before merge. Reviewed June 7, 2026, 1:18 AM ET / 05:18 UTC. Summary Reproducibility: not applicable. this is a broad QA harness feature PR rather than a single reported runtime bug, and current-head real-behavior proof is still missing. Review metrics: 2 noteworthy metrics.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Proof guidance:
Risk before merge
Maintainer options:
Next step before merge
Security Review detailsBest possible solution: Keep the PR as a source branch until maintainers either extract the remaining current-main-compatible slices or rebase it to the Do we have a high-confidence way to reproduce the issue? Not applicable: this is a broad QA harness feature PR rather than a single reported runtime bug, and current-head real-behavior proof is still missing. Is this the best way to solve the issue? No for direct merge: the sliced current-main direction is better than landing the stale branch wholesale because the branch predates the AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against 7e7ea0fed17c. Label changesLabel justifications:
Evidence reviewedAcceptance criteria:
What I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
…-parity-tools # Conflicts: # docs/.generated/config-baseline.sha256 # docs/.generated/plugin-sdk-api-baseline.sha256 # extensions/acpx/package.json # extensions/alibaba/package.json # extensions/amazon-bedrock-mantle/package.json # extensions/amazon-bedrock/package.json # extensions/anthropic-vertex/package.json # extensions/anthropic/package.json # extensions/arcee/package.json # extensions/azure-speech/package.json # extensions/bonjour/package.json # extensions/brave/package.json # extensions/browser/package.json # extensions/byteplus/package.json # extensions/canvas/package.json # extensions/cerebras/package.json # extensions/chutes/package.json # extensions/clickclack/package.json # extensions/cloudflare-ai-gateway/package.json # extensions/codex/package.json # extensions/comfy/package.json # extensions/copilot-proxy/package.json # extensions/deepgram/package.json # extensions/deepinfra/package.json # extensions/deepseek/package.json # extensions/diagnostics-otel/package.json # extensions/diagnostics-prometheus/package.json # extensions/diffs/package.json # extensions/discord/package.json # extensions/document-extract/package.json # extensions/duckduckgo/package.json # extensions/elevenlabs/package.json # extensions/exa/package.json # extensions/fal/package.json # extensions/feishu/package.json # extensions/file-transfer/package.json # extensions/firecrawl/package.json # extensions/fireworks/package.json # extensions/github-copilot/package.json # extensions/google-meet/package.json # extensions/google/package.json # extensions/googlechat/package.json # extensions/gradium/package.json # extensions/groq/package.json # extensions/huggingface/package.json # extensions/image-generation-core/package.json # extensions/imessage/package.json # extensions/inworld/package.json # extensions/irc/package.json # extensions/kilocode/package.json # extensions/kimi-coding/package.json # extensions/line/package.json # extensions/litellm/package.json # extensions/llm-task/package.json # extensions/lmstudio/package.json # extensions/lobster/package.json # extensions/matrix/package.json # extensions/mattermost/package.json # extensions/media-understanding-core/package.json # extensions/memory-core/package.json # extensions/memory-lancedb/package.json # extensions/memory-wiki/package.json # extensions/microsoft-foundry/package.json # extensions/microsoft/package.json # extensions/migrate-claude/package.json # extensions/migrate-hermes/package.json # extensions/minimax/package.json # extensions/mistral/package.json # extensions/moonshot/package.json # extensions/msteams/package.json # extensions/nextcloud-talk/package.json # extensions/nostr/package.json # extensions/nvidia/package.json # extensions/oc-path/package.json # extensions/ollama/package.json # extensions/open-prose/package.json # extensions/openai/package.json # extensions/opencode-go/package.json # extensions/opencode/package.json # extensions/openrouter/package.json # extensions/openshell/package.json # extensions/perplexity/package.json # extensions/qa-channel/package.json # extensions/qa-lab/package.json # extensions/qa-matrix/package.json # extensions/qianfan/package.json # extensions/qqbot/package.json # extensions/qqbot/src/bridge/tools/remind.test.ts # extensions/qqbot/src/engine/gateway/outbound-dispatch.test.ts # extensions/qwen/package.json # extensions/runway/package.json # extensions/searxng/package.json # extensions/senseaudio/package.json # extensions/sglang/package.json # extensions/signal/package.json # extensions/skill-workshop/package.json # extensions/slack/package.json # extensions/speech-core/package.json # extensions/stepfun/package.json # extensions/synology-chat/package.json # extensions/synthetic/package.json # extensions/tavily/package.json # extensions/tavily/src/tavily-tools.test.ts # extensions/telegram/package.json # extensions/tencent/package.json # extensions/tlon/package.json # extensions/together/package.json # extensions/tokenjuice/package.json # extensions/tts-local-cli/package.json # extensions/twitch/package.json # extensions/venice/package.json # extensions/vercel-ai-gateway/package.json # extensions/video-generation-core/package.json # extensions/vllm/package.json # extensions/voice-call/package.json # extensions/volcengine/package.json # extensions/voyage/package.json # extensions/vydra/package.json # extensions/web-readability/package.json # extensions/webhooks/package.json # extensions/whatsapp/package.json # extensions/xai/package.json # extensions/xiaomi/package.json # extensions/zai/package.json # extensions/zalo/package.json # extensions/zalouser/package.json # package.json # pnpm-lock.yaml # src/agents/provider-transport-fetch.test.ts # src/config/bundled-channel-config-metadata.generated.ts
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
|
Landed the current-main-compatible heartbeat capture slice from this PR on Commit: 55edadf What landed:
Proof:
I am not closing this PR wholesale: the remaining phase 2-5 branch delta is still too broad/stale against current |
|
Landed a current-main slice from this PR: Scope landed:
Verification:
Leaving this PR open. The remaining phase 2-5 branch delta is still broad/stale and should continue landing as smaller current-main slices with separate proof. |
|
Landed another current-main slice from this PR: Scope landed:
Verification:
Review note:
Leaving this PR open for the remaining runtime-parity harness/reporting/tool-matrix slices. |
|
Landed another current-main slice from this PR: Scope landed:
Verification:
Leaving this PR open for the remaining harness/report/tool-matrix slices. |
|
Landed another scoped slice from this PR on
What landed:
Verification:
PR remains open for the remaining QA-Lab coverage slices. |
|
Landed another current-main slice from this PR: Scope landed:
Companion baseline fix landed: That second commit fixes a current-main type/schema drift where Verification:
Post-push exact SHA state for Leaving this PR open for the remaining broad runtime-parity harness/reporting deltas. |
|
Feel free to wrap and take over from here it, takes a lot of inference to get right and keep updated with betas-- works well I built over 3 days and ran manually + with api keys to find the bugs I found last week and I'm sure more will popup as we keep betas rolling 🫡 @vincentkoc |
|
Follow-up after the post-landing deadcode failure:
Verification:
|
|
Slice landed on This pulls the #80339 hard-gate piece out of the broad PR: required OpenClaw dynamic direct runtime-tool rows are now evaluated by a blocking release-check verifier, while the Codex-native/searchable fidelity work remains tracked under #80319 and adjacent issues. Proof summary: local focused QA-Lab/runtime tests passed (7 files / 181 tests), autoreview clean, Testbox-through-Crabbox |
|
Narrow token-efficiency slice related to #81093 has landed on That commit keeps the main-compatible part from this area small:
Proof is on the issue closeout: #81093 (comment). This PR is still open and currently reports |
|
Phase 4 token-efficiency report and scheduled artifact slice has now landed on
For any rebase of this PR, drop or reconcile the Phase 4/token-efficiency pieces against |
|
Landed the scoped JSONL replay slice on What landed:
Verification:
Known proof gap: Blacksmith Testbox direct I am leaving this PR open for the remaining real runtime-cell replay work unless a maintainer wants this broad branch closed as superseded by landed slices. |
|
ClawSweeper PR egg 🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat. Where did the egg go?
|
|
Landed the confidence-report slice from this PR on Credited slice: @100yenadmin's QA-Lab confidence/reporting work from #80323. I kept this PR open because the branch still contains broader runtime-parity work beyond the slice that is now on Behavior addressed: QA-Lab now has Real environment tested: Blacksmith Testbox through Crabbox, provider Exact steps or command run after this patch: local Evidence after fix: Testbox focused tests passed 5 files / 322 tests, confidence self-test verdict was Observed result after fix: the confidence gate now rejects missing proof rows, malformed suite summaries, count/scenario mismatches, missing JSONL drift data, empty self-test canaries, optional-lane blockers, skipped rows without backfill, and absolute local artifact paths in persisted summaries. What was not tested: true live-provider QA confidence manifests; this slice only adds the confidence gate/self-test plumbing and mock/proof validation path. |
|
This assigned pull request has been automatically marked as stale after being open for 27 days. |
Summary
Adds the Codex-vs-Pi runtime parity QA harness across
extensions/qa-lab, including runtime-pair execution, runtime suite selectors, harness-prompt parity, token-efficiency reporting, tool-default fixtures, JSONL replay, confidence self-tests, strict confidence reports, and release-check wiring.This branch now tests Codex at the right layer:
read,write,edit,apply_patch,exec,process,update_plan) are not expected to appear as duplicate OpenClaw dynamic tools.directmock lane.counts.skippedinqa-suite-summary.json.mock-estimatefromlive-usage.Why
OpenClaw needs a maintainer-runnable gate that compares the same scenario/model under Pi and Codex before Codex becomes the default runtime. The gate must surface real runtime drift without turning mock-provider limitations or intentional Codex-native tool ownership into production bug reports.
Current Verification
Latest validated branch head:
3336dec6419c9cc9a87dc7cfa6f48118ca2d838eOpenClaw baseline merged into this branch:
v2026.5.10-beta.5Remote proof run: https://github.com/electricsheephq/openclaw-local-test/actions/runs/25719383976
Static/unit proof passed remotely:
pnpm check:test-typespnpm lint --threads=8extensions/qa-lab/src/providers/mock-openai/server.test.ts,runtime-tool-fixture.test.ts,runtime-parity.test.ts,runtime-suite.test.ts,suite.test.ts,cli.runtime.test.ts,tool-coverage-report.test.ts,token-efficiency-report.test.ts,harness-parity.test.ts,jsonl-replay.test.ts,codex-plugin-lifecycle.test.ts,scenario-catalog.test.ts,confidence-report.test.ts, andextensions/codex/src/app-server/dynamic-tools.test.ts.Local surgical checks after the beta.5 merge also passed:
pnpm test extensions/minimax/index.test.ts extensions/telegram/src/bot.create-telegram-bot.test.ts extensions/whatsapp/src/login.coverage.test.tsOPENCLAW_VITEST_MAX_WORKERS=1 pnpm test extensions/qa-lab/src/confidence-report.test.ts extensions/qa-lab/src/runtime-parity.test.ts extensions/qa-lab/src/cli.runtime.test.tsOPENCLAW_VITEST_MAX_WORKERS=1 pnpm test extensions/qa-lab/src/scenario-catalog.test.ts extensions/qa-lab/src/suite.test.tsOPENCLAW_VITEST_MAX_WORKERS=1 pnpm test extensions/qa-lab/src/suite-summary.test.ts extensions/qa-lab/src/suite.summary-json.test.ts extensions/qa-lab/src/confidence-report.test.tsReal Behavior Proof
3336dec6419c9cc9a87dc7cfa6f48118ca2d838e, with OpenClawv2026.5.10-beta.5merged into the branch. Artifacts were downloaded and inspected from a local macOS OpenClaw checkout at/Volumes/LEXAR/repos/openclaw-1.gh workflow run qa-runtime-confidence-proof.yml \ --repo electricsheephq/openclaw-local-test \ --ref main \ -f target_ref=codex-vs-pi-runtime-parity-tools \ -f expected_sha=3336dec6419c9cc9a87dc7cfa6f48118ca2d838e \ -f run_soak=false \ -f run_live=false gh run view 25719383976 --repo electricsheephq/openclaw-local-test --json status,conclusion,jobs gh run download 25719383976 \ --repo electricsheephq/openclaw-local-test \ --dir /Volumes/LEXAR/Codex/qa-runtime-confidence-artifacts-25719383976 jq '{pass, zeroUnknowns, counts, failures}' \ /Volumes/LEXAR/Codex/qa-runtime-confidence-artifacts-25719383976/qa-runtime-confidence-mock-3336dec6419c9cc9a87dc7cfa6f48118ca2d838e/confidence-report/qa-confidence-summary.json{ "tool-defaults-direct": { "total": 20, "passed": 20, "skipped": 0, "failed": 0 }, "openclaw-dynamic-tools-direct": { "total": 8, "passed": 8, "skipped": 0, "failed": 0 }, "tool-defaults-searchable": { "total": 20, "passed": 15, "skipped": 5, "failed": 0 }, "first-hour-20-direct": { "total": 18, "passed": 15, "skipped": 3, "failed": 0 }, "fault-injection-mock": { "total": 5, "passed": 3, "skipped": 2, "failed": 0 }, "jsonl-expanded": { "curatedTranscripts": 7, "turnsCompared": 15, "driftedTurns": 0 }, "confidence-self-test": { "pass": true, "detectedCanaries": "7/7" }, "confidence-report": { "pass": true, "zeroUnknowns": true, "passedLanes": 8, "blockedLanes": 4, "unknown": 0, "failed": 0 } }Token-efficiency artifact excerpt:
{ "status": "estimated", "providerMode": "mock-openai", "usageSources": ["mock-estimate"], "rows": 18, "pass": true, "piTotalTokens": 245125, "codexTotalTokens": 130286, "deltaPercent": -46.849158592554815 }pass=true,zeroUnknowns=true,8passed lanes,4classifiedenvironment-blockedlanes,0unknown lanes, and0failed lanes. The deterministic OpenClaw dynamic integration gate is green, and mock-only native/searchable limitations are explicit report-only rows rather than product bug claims.usage, live/OAuth Codex-native approval/read/write/compaction proof, and scheduled/Testboxsoak-100. These are explicitly tracked in [QA-lab] Complete live-frontier token-efficiency and Testbox parity proof #80397, [QA-lab] Token-efficiency report marks failed live zero-usage run as pass #80411, and Wire optional soak-100 runtime parity lane to scheduled or Testbox proof #80433.Current Bug Classification
Confirmed fixed or corrected here:
Current product-bug verdict:
Not Claimed Complete
usageis still tracked in [QA-lab] Complete live-frontier token-efficiency and Testbox parity proof #80397 and guarded by [QA-lab] Token-efficiency report marks failed live zero-usage run as pass #80411.soak-100proof is still tracked in Wire optional soak-100 runtime parity lane to scheduled or Testbox proof #80433.Linked Issues