[qa-lab] Complete Codex vs Pi runtime parity harness phases 2-5 by 100yenadmin · Pull Request #80323 · openclaw/openclaw

100yenadmin · 2026-05-10T15:36:26Z

Summary

Adds the Codex-vs-Pi runtime parity QA harness across extensions/qa-lab, including runtime-pair execution, runtime suite selectors, harness-prompt parity, token-efficiency reporting, tool-default fixtures, JSONL replay, confidence self-tests, strict confidence reports, and release-check wiring.

This branch now tests Codex at the right layer:

Codex-native workspace tools (read, write, edit, apply_patch, exec, process, update_plan) are not expected to appear as duplicate OpenClaw dynamic tools.
OpenClaw-owned integration tools remain dynamic-tool parity rows and pass the deterministic direct mock lane.
Provider-plan intent is reported separately from actual runtime transcript tool calls.
Report-only/native-live/mock-limitation rows are explicit skips, including counts.skipped in qa-suite-summary.json.
Token-efficiency rows distinguish mock-estimate from live-usage.
Seeded confidence canaries prove the harness catches prompt, tool schema, tool-call, tool-result, failure-mode, token, and JSONL replay regressions.

Why

OpenClaw needs a maintainer-runnable gate that compares the same scenario/model under Pi and Codex before Codex becomes the default runtime. The gate must surface real runtime drift without turning mock-provider limitations or intentional Codex-native tool ownership into production bug reports.

Current Verification

Latest validated branch head: 3336dec6419c9cc9a87dc7cfa6f48118ca2d838e

OpenClaw baseline merged into this branch: v2026.5.10-beta.5

Remote proof run: https://github.com/electricsheephq/openclaw-local-test/actions/runs/25719383976

Static/unit proof passed remotely:

pnpm check:test-types
pnpm lint --threads=8
Targeted QA-lab/Codex dynamic-tools tests:
extensions/qa-lab/src/providers/mock-openai/server.test.ts, runtime-tool-fixture.test.ts, runtime-parity.test.ts, runtime-suite.test.ts, suite.test.ts, cli.runtime.test.ts, tool-coverage-report.test.ts, token-efficiency-report.test.ts, harness-parity.test.ts, jsonl-replay.test.ts, codex-plugin-lifecycle.test.ts, scenario-catalog.test.ts, confidence-report.test.ts, and extensions/codex/src/app-server/dynamic-tools.test.ts.

Local surgical checks after the beta.5 merge also passed:

pnpm test extensions/minimax/index.test.ts extensions/telegram/src/bot.create-telegram-bot.test.ts extensions/whatsapp/src/login.coverage.test.ts
OPENCLAW_VITEST_MAX_WORKERS=1 pnpm test extensions/qa-lab/src/confidence-report.test.ts extensions/qa-lab/src/runtime-parity.test.ts extensions/qa-lab/src/cli.runtime.test.ts
OPENCLAW_VITEST_MAX_WORKERS=1 pnpm test extensions/qa-lab/src/scenario-catalog.test.ts extensions/qa-lab/src/suite.test.ts
OPENCLAW_VITEST_MAX_WORKERS=1 pnpm test extensions/qa-lab/src/suite-summary.test.ts extensions/qa-lab/src/suite.summary-json.test.ts extensions/qa-lab/src/confidence-report.test.ts

Real Behavior Proof

Behavior or issue addressed: Running the new QA CLI against a real OpenClaw checkout should produce a trustworthy Codex-vs-Pi parity signal: Codex-native workspace tools are not false-failed for missing duplicate OpenClaw dynamic exposure, OpenClaw dynamic integration tools still execute through the dynamic bridge, report-only rows are explicit skips, token-efficiency output is labeled by usage source, JSONL replay artifacts are emitted, and seeded canaries prove the harness can catch regressions.
Real environment tested: GitHub Actions Ubuntu 24.04 remote runner checking out this PR head 3336dec6419c9cc9a87dc7cfa6f48118ca2d838e, with OpenClaw v2026.5.10-beta.5 merged into the branch. Artifacts were downloaded and inspected from a local macOS OpenClaw checkout at /Volumes/LEXAR/repos/openclaw-1.
Exact steps or command run after this patch:

gh workflow run qa-runtime-confidence-proof.yml \
  --repo electricsheephq/openclaw-local-test \
  --ref main \
  -f target_ref=codex-vs-pi-runtime-parity-tools \
  -f expected_sha=3336dec6419c9cc9a87dc7cfa6f48118ca2d838e \
  -f run_soak=false \
  -f run_live=false

gh run view 25719383976 --repo electricsheephq/openclaw-local-test --json status,conclusion,jobs

gh run download 25719383976 \
  --repo electricsheephq/openclaw-local-test \
  --dir /Volumes/LEXAR/Codex/qa-runtime-confidence-artifacts-25719383976

jq '{pass, zeroUnknowns, counts, failures}' \
  /Volumes/LEXAR/Codex/qa-runtime-confidence-artifacts-25719383976/qa-runtime-confidence-mock-3336dec6419c9cc9a87dc7cfa6f48118ca2d838e/confidence-report/qa-confidence-summary.json

Evidence after fix: Terminal capture from the downloaded artifact summaries:

{
  "tool-defaults-direct": { "total": 20, "passed": 20, "skipped": 0, "failed": 0 },
  "openclaw-dynamic-tools-direct": { "total": 8, "passed": 8, "skipped": 0, "failed": 0 },
  "tool-defaults-searchable": { "total": 20, "passed": 15, "skipped": 5, "failed": 0 },
  "first-hour-20-direct": { "total": 18, "passed": 15, "skipped": 3, "failed": 0 },
  "fault-injection-mock": { "total": 5, "passed": 3, "skipped": 2, "failed": 0 },
  "jsonl-expanded": { "curatedTranscripts": 7, "turnsCompared": 15, "driftedTurns": 0 },
  "confidence-self-test": { "pass": true, "detectedCanaries": "7/7" },
  "confidence-report": { "pass": true, "zeroUnknowns": true, "passedLanes": 8, "blockedLanes": 4, "unknown": 0, "failed": 0 }
}

Token-efficiency artifact excerpt:

{
  "status": "estimated",
  "providerMode": "mock-openai",
  "usageSources": ["mock-estimate"],
  "rows": 18,
  "pass": true,
  "piTotalTokens": 245125,
  "codexTotalTokens": 130286,
  "deltaPercent": -46.849158592554815
}

Observed result after fix: The remote proof workflow completed successfully. The strict confidence report has pass=true, zeroUnknowns=true, 8 passed lanes, 4 classified environment-blocked lanes, 0 unknown lanes, and 0 failed lanes. The deterministic OpenClaw dynamic integration gate is green, and mock-only native/searchable limitations are explicit report-only rows rather than product bug claims.
What was not tested: Live frontier token-efficiency with real assistant-message usage, live/OAuth Codex-native approval/read/write/compaction proof, and scheduled/Testbox soak-100. These are explicitly tracked in [QA-lab] Complete live-frontier token-efficiency and Testbox parity proof #80397, [QA-lab] Token-efficiency report marks failed live zero-usage run as pass #80411, and Wire optional soak-100 runtime parity lane to scheduled or Testbox proof #80433.

Current Bug Classification

Confirmed fixed or corrected here:

QA harness conflated Codex-native workspace tools with OpenClaw dynamic tool parity.
QA harness/reporting conflated provider-plan intent with runtime transcript tool calls.
Mock/searchable rows lacked clear report-only classification.
Report summaries omitted skipped/report-only counts.
The confidence gate lacked seeded negative controls.

Current product-bug verdict:

No confirmed Codex runner product bug remains from the mock proof lanes.
Native/live proof is still required before filing product bugs for Codex-native approval/read/write/compaction behavior.

Not Claimed Complete

Live frontier token-efficiency proof with real assistant-message usage is still tracked in [QA-lab] Complete live-frontier token-efficiency and Testbox parity proof #80397 and guarded by [QA-lab] Token-efficiency report marks failed live zero-usage run as pass #80411.
Scheduled/Testbox soak-100 proof is still tracked in Wire optional soak-100 runtime parity lane to scheduled or Testbox proof #80433.
Searchable/deferred mock-provider fidelity remains tracked in QA tool-defaults suite conflates Codex-native tools with OpenClaw dynamic tool parity #80319, but the deterministic direct OpenClaw dynamic tool gate is green.

Linked Issues

clawsweeper · 2026-05-10T15:39:48Z

Codex review: needs real behavior proof before merge. Reviewed June 7, 2026, 1:18 AM ET / 05:18 UTC.

Summary
The PR adds a broad QA-Lab Codex-vs-Pi runtime parity harness, tool coverage, Codex lifecycle checks, token-efficiency reporting, JSONL replay, release workflows, and supporting tests/docs.

Reproducibility: not applicable. this is a broad QA harness feature PR rather than a single reported runtime bug, and current-head real-behavior proof is still missing.

Review metrics: 2 noteworthy metrics.

Changed surface breadth: 137 files changed in the PR branch. The branch spans QA runtime code, scenarios, reports, scripts, tests, and workflows, so stale-base reconciliation is a maintainer-visible merge concern.
Runtime axis drift: 1 renamed runtime pair contract. Current main uses openclaw,codex while the PR branch still uses pi,codex, which must be resolved before merging automation or CLI surfaces.

Merge readiness
Overall: 🦪 silver shellfish
Proof: 🦪 silver shellfish
Patch quality: 🦐 gold shrimp
Result: blocked until stronger real behavior proof is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

Rebase or split the remaining branch work onto current main and the openclaw,codex runtime axis.
[P1] Add current-head real behavior proof for the exact branch SHA after reconciliation.
Limit the next merge candidate to the remaining unique QA-Lab slice rather than the full stale branch.

Proof guidance:

[P1] Needs stronger real behavior proof before merge: The PR body supplies terminal/remote proof for older heads, but not for the current PR head, so contributor-visible after-fix proof is still insufficient. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

[P1] The branch is broad and stale against current main; direct merge would require resolving the old pi,codex runtime naming against the landed openclaw,codex contract.
[P1] The supplied real-behavior proof is for older heads, not the current PR head 7755e898bc3e1696032f7adc7561a94f02e68778.
[P1] The PR changes release and QA automation, so stale workflow wiring could break or duplicate maintainer proof lanes if merged without a fresh current-main reconciliation.

Maintainer options:

Continue Slice-First Integration (recommended)
Extract or rebase only the remaining QA-Lab pieces that still differ from current main, then verify them at the current PR head before merge.
Accept Maintainer Takeover Risk
A maintainer can take over the broad branch manually, but should treat the release workflow and runtime-axis rename as explicit reconciliation work.
Close After Remaining Work Is Tracked
If no unique branch remainder is still wanted after the already-landed slices, close this PR with links to the landed commits and any narrower follow-up issues.

Next step before merge

[P1] The remaining work is maintainer-owned slicing, stale-branch reconciliation, and exact-head proof review rather than a narrow autonomous fix.

Security
Cleared: The security pass found workflow-sensitive automation changes but no concrete secret, dependency, permission, or supply-chain regression in the inspected PR diff.

Review details

Best possible solution:

Keep the PR as a source branch until maintainers either extract the remaining current-main-compatible slices or rebase it to the openclaw,codex contract with exact-head proof.

Do we have a high-confidence way to reproduce the issue?

Not applicable: this is a broad QA harness feature PR rather than a single reported runtime bug, and current-head real-behavior proof is still missing.

Is this the best way to solve the issue?

No for direct merge: the sliced current-main direction is better than landing the stale branch wholesale because the branch predates the openclaw,codex runtime contract and current release-check layout.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 7e7ea0fed17c.

Label changes

Label justifications:

P2: This is important QA/release infrastructure, but it is not a current user-facing outage or security emergency.
merge-risk: 🚨 automation: The PR changes release and QA workflow lanes, and stale runtime-pair wiring could break or duplicate maintainer automation after merge.
rating: 🦪 silver shellfish: Overall readiness is 🦪 silver shellfish; proof is 🦪 silver shellfish and patch quality is 🦐 gold shrimp.
feature: ✨ showcase: ClawSweeper spotlight: unusually compelling feature idea for maintainer attention. A runtime parity harness that compares OpenClaw and Codex behavior, tool coverage, lifecycle stress, and token efficiency is strategically useful release infrastructure.
status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs stronger real behavior proof before merge: The PR body supplies terminal/remote proof for older heads, but not for the current PR head, so contributor-visible after-fix proof is still insufficient. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Evidence reviewed

Acceptance criteria:

[P1] git diff --check 7b4fd3d 7755e89.
[P1] After rebase or slicing, rerun the relevant QA-Lab runtime parity proof at the exact PR head.
[P1] Verify release workflow changes against current .github/workflows/openclaw-release-checks.yml runtime parity lanes.

What I checked:

Current main uses the renamed runtime-pair contract: Current main parses runtime pairs through normalizeRuntimePairValue, accepts openclaw,codex, and builds RuntimeId values as openclaw or codex, so the implementation has moved past the PR's original pi,codex naming. (extensions/qa-lab/src/cli.runtime.ts:183, 61bb7d5523b8)
PR head still carries the older runtime-pair names: The PR branch version of the runtime CLI still reports --runtime-pair must be pi,codex or codex,pi and defaults to ['pi','codex'], which needs reconciliation before any current-main merge path. (extensions/qa-lab/src/cli.runtime.ts:186, 7755e898bc3e)
Current main has release-check runtime parity lanes: The release workflow already contains runtime parity jobs that run --runtime-pair openclaw,codex, showing several central CI slices from the PR's intent have landed on main. (.github/workflows/openclaw-release-checks.yml:1009, 61bb7d5523b8)
PR head workflow still uses old parity wiring: The PR workflow adds runtime parity release checks using --runtime-pair pi,codex, so it cannot be taken as-is over the current workflow contract. (.github/workflows/openclaw-release-checks.yml:872, 7755e898bc3e)
Current main includes JSONL replay on the renamed runtime axis: Current jsonl-replay.ts defines runtimePair: ['openclaw','codex'] and cells keyed by openclaw and codex, covering a major requested Phase 5 surface but with the current contract. (extensions/qa-lab/src/jsonl-replay.ts:12, cf0657852f65)
Codex dynamic-tool contract checked directly: Sibling Codex source defines thread-start dynamic tool specs and dynamic tool request/response plumbing, supporting that OpenClaw's Codex bridge and deferred-tool behavior are the relevant dependency contract for this review. (../codex/codex-rs/app-server-protocol/src/protocol/v2/thread.rs:39, b89ce9a2bced)

Likely related people:

vincentkoc: Current main history and the provided PR discussion show repeated landing and maintenance of runtime parity, token efficiency, confidence report, JSONL replay, and release workflow slices. (role: recent area contributor and QA-Lab runtime parity follow-up owner; confidence: high; commits: 61bb7d5523b8, 03f1bf9a4df0, f6a49a4e8a13; files: extensions/qa-lab/src/runtime-parity.ts, extensions/qa-lab/src/cli.runtime.ts, extensions/qa-lab/src/jsonl-replay.ts)

What the crustacean ranks mean

🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works

ClawSweeper keeps one durable marker-backed review comment per issue or PR.
Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
Maintainers can also comment @clawsweeper review to request a fresh review only.
Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

…-parity-tools

…-parity-tools # Conflicts: # docs/.generated/config-baseline.sha256 # docs/.generated/plugin-sdk-api-baseline.sha256 # extensions/acpx/package.json # extensions/alibaba/package.json # extensions/amazon-bedrock-mantle/package.json # extensions/amazon-bedrock/package.json # extensions/anthropic-vertex/package.json # extensions/anthropic/package.json # extensions/arcee/package.json # extensions/azure-speech/package.json # extensions/bonjour/package.json # extensions/brave/package.json # extensions/browser/package.json # extensions/byteplus/package.json # extensions/canvas/package.json # extensions/cerebras/package.json # extensions/chutes/package.json # extensions/clickclack/package.json # extensions/cloudflare-ai-gateway/package.json # extensions/codex/package.json # extensions/comfy/package.json # extensions/copilot-proxy/package.json # extensions/deepgram/package.json # extensions/deepinfra/package.json # extensions/deepseek/package.json # extensions/diagnostics-otel/package.json # extensions/diagnostics-prometheus/package.json # extensions/diffs/package.json # extensions/discord/package.json # extensions/document-extract/package.json # extensions/duckduckgo/package.json # extensions/elevenlabs/package.json # extensions/exa/package.json # extensions/fal/package.json # extensions/feishu/package.json # extensions/file-transfer/package.json # extensions/firecrawl/package.json # extensions/fireworks/package.json # extensions/github-copilot/package.json # extensions/google-meet/package.json # extensions/google/package.json # extensions/googlechat/package.json # extensions/gradium/package.json # extensions/groq/package.json # extensions/huggingface/package.json # extensions/image-generation-core/package.json # extensions/imessage/package.json # extensions/inworld/package.json # extensions/irc/package.json # extensions/kilocode/package.json # extensions/kimi-coding/package.json # extensions/line/package.json # extensions/litellm/package.json # extensions/llm-task/package.json # extensions/lmstudio/package.json # extensions/lobster/package.json # extensions/matrix/package.json # extensions/mattermost/package.json # extensions/media-understanding-core/package.json # extensions/memory-core/package.json # extensions/memory-lancedb/package.json # extensions/memory-wiki/package.json # extensions/microsoft-foundry/package.json # extensions/microsoft/package.json # extensions/migrate-claude/package.json # extensions/migrate-hermes/package.json # extensions/minimax/package.json # extensions/mistral/package.json # extensions/moonshot/package.json # extensions/msteams/package.json # extensions/nextcloud-talk/package.json # extensions/nostr/package.json # extensions/nvidia/package.json # extensions/oc-path/package.json # extensions/ollama/package.json # extensions/open-prose/package.json # extensions/openai/package.json # extensions/opencode-go/package.json # extensions/opencode/package.json # extensions/openrouter/package.json # extensions/openshell/package.json # extensions/perplexity/package.json # extensions/qa-channel/package.json # extensions/qa-lab/package.json # extensions/qa-matrix/package.json # extensions/qianfan/package.json # extensions/qqbot/package.json # extensions/qqbot/src/bridge/tools/remind.test.ts # extensions/qqbot/src/engine/gateway/outbound-dispatch.test.ts # extensions/qwen/package.json # extensions/runway/package.json # extensions/searxng/package.json # extensions/senseaudio/package.json # extensions/sglang/package.json # extensions/signal/package.json # extensions/skill-workshop/package.json # extensions/slack/package.json # extensions/speech-core/package.json # extensions/stepfun/package.json # extensions/synology-chat/package.json # extensions/synthetic/package.json # extensions/tavily/package.json # extensions/tavily/src/tavily-tools.test.ts # extensions/telegram/package.json # extensions/tencent/package.json # extensions/tlon/package.json # extensions/together/package.json # extensions/tokenjuice/package.json # extensions/tts-local-cli/package.json # extensions/twitch/package.json # extensions/venice/package.json # extensions/vercel-ai-gateway/package.json # extensions/video-generation-core/package.json # extensions/vllm/package.json # extensions/voice-call/package.json # extensions/volcengine/package.json # extensions/voyage/package.json # extensions/vydra/package.json # extensions/web-readability/package.json # extensions/webhooks/package.json # extensions/whatsapp/package.json # extensions/xai/package.json # extensions/xiaomi/package.json # extensions/zai/package.json # extensions/zalo/package.json # extensions/zalouser/package.json # package.json # pnpm-lock.yaml # src/agents/provider-transport-fetch.test.ts # src/config/bundled-channel-config-metadata.generated.ts

socket-security · 2026-05-12T22:29:46Z

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff	Package	Supply Chain Security	Vulnerability	Quality	Maintenance	License
	npm/jszip@3.10.1
	npm/ws@8.20.0
	npm/zod@4.4.3
	npm/kysely@0.29.0

View full report

vincentkoc · 2026-05-16T20:27:13Z

Landed the current-main-compatible heartbeat capture slice from this PR on main.

Commit: 55edadf

What landed:

Runtime parity capture now scans newest root-session transcripts and skips heartbeat-only operational transcripts before selecting the scenario reply.
Covered heartbeat shapes include legacy HEARTBEAT_OK, [OpenClaw heartbeat poll], heartbeat_respond, due-task heartbeats that run ordinary tools first, and user-role tool_result rows.
Added the changelog entry crediting @100yenadmin.

Proof:

Codex review: codex review --uncommitted final pass reported no actionable correctness issues.
Local after final rebase: git diff --check HEAD~1..HEAD; node scripts/run-vitest.mjs extensions/qa-lab/src/runtime-parity.test.ts (16 tests passed).
Testbox pre-push changed gate: tbx_01krs7152m0yjpbfmkc86yz4s5, run https://github.com/openclaw/openclaw/actions/runs/25971893086, focused runtime parity test plus pnpm check:changed exited 0.
Exact landed-SHA Testbox: tbx_01krs7cs45aws71c5yk14bfj7r, run https://github.com/openclaw/openclaw/actions/runs/25972036173, confirmed HEAD=55edadf86fdf0b3238137b0f7a10a73ded8352ae, focused runtime parity test passed.

I am not closing this PR wholesale: the remaining phase 2-5 branch delta is still too broad/stale against current main and needs more current-main slices rather than a direct merge.

vincentkoc · 2026-05-17T03:07:47Z

Landed a current-main slice from this PR:
d801d27

Scope landed:

Added QA-Lab gateway log sentinels for plugin hook failures, plugin contract errors, Codex app-server stalls/timeouts, stalled agent runs, cron allowlist drift, live quota/subscription blockers, and direct-reply self-message transcripts.
Wired sentinel findings into runtime parity cell capture/reporting so self-health regressions become hard QA-Lab failures instead of only unit-tested helpers.
Added changelog credit for this slice.

Verification:

Local focused tests: node scripts/run-vitest.mjs extensions/qa-lab/src/gateway-log-sentinel.test.ts extensions/qa-lab/src/runtime-parity.test.ts passed, 2 files / 23 tests.
Local formatting/diff hygiene: oxfmt --check on touched QA-Lab files passed; git diff --check --cached passed before commit.
Testbox focused proof: tbx_01krsw71w04w0efg7nxvs388w7, Actions run https://github.com/openclaw/openclaw/actions/runs/25979142410, focused QA-Lab tests passed, 2 files / 23 tests.

pnpm check:changed --base HEAD^ --head HEAD was attempted after push but is not claimed green. Remote gate attempts were blocked by runner/provider issues:

tbx_01krsx6stvtvjfd4y7mx8s5z3r: discarded; cleanup removed Testbox deps before command.
tbx_01krsxbbahpf0dra97mm2nvm92: Blacksmith status API TLS handshake timeout before user command.
tbx_01krsxn9r1qsymgb8xcsgw6v60: Blacksmith warmup API deadline exceeded.
cbx_174304a42148 / run_103bffbe0f42: AWS Crabbox synced, then SSH dropped before command execution.
Reuse of tbx_01krsw71w04w0efg7nxvs388w7: SSH/rsync refused, likely expired/unreachable.

Leaving this PR open. The remaining phase 2-5 branch delta is still broad/stale and should continue landing as smaller current-main slices with separate proof.

vincentkoc · 2026-05-17T05:51:02Z

Landed another current-main slice from this PR:
e66a6c8

Scope landed:

Added runtime-first-hour-20-turn and runtime-soak-100-turn scenario-pack entries.
Added runtimeParityTier metadata parsing for standard/optional/live-only/soak lanes.
Documented runtime parity tier metadata in the scenario-pack index.
Added an Unreleased changelog entry with PR credit.

Verification:

node scripts/run-vitest.mjs extensions/qa-lab/src/scenario-catalog.test.ts extensions/qa-lab/src/scenario-packs.test.ts passed, 2 files / 22 tests.
oxfmt --check on touched TS files passed.
git diff --check --cached passed before commit.
Verified new scenario docsRefs and codeRefs point at current-main paths after the stale PR refs were corrected.

Review note:

Codex review started, but wandered into broad repo exploration. The actionable issue it surfaced was a stale docs/code ref from the PR branch; that was fixed before landing and retested.

Leaving this PR open for the remaining runtime-parity harness/reporting/tool-matrix slices.

vincentkoc · 2026-05-17T06:13:13Z

Landed another current-main slice from this PR:
826c2f4

Scope landed:

Added codex-pi-shaped-read-vocabulary, a live-only runtime parity canary for Codex-native workspace reads when prompts use legacy Pi-shaped Read tool wording.
Added catalog assertions for the scenario, marker, unavailable-tool needles, and runtimeParityTier.
Added an Unreleased changelog entry with PR credit.

Verification:

node scripts/run-vitest.mjs extensions/qa-lab/src/scenario-catalog.test.ts extensions/qa-lab/src/scenario-packs.test.ts passed, 2 files / 23 tests.
oxfmt --check on the touched TS test file passed.
git diff --check --cached passed before commit.
Verified docs/code refs point at current-main files.

Leaving this PR open for the remaining harness/report/tool-matrix slices.

vincentkoc · 2026-05-17T08:20:27Z

Landed another scoped slice from this PR on main:

Commit: 2547e35
Source PR head reviewed: https://github.com/electricsheephq/openclaw/commit/7755e898bc3e1696032f7adc7561a94f02e68778

What landed:

Added live-only harness self-health scenarios for plugin hook crash sentinels, plugin manifest contracts.tools diagnostics, and WebChat direct-reply self-message routing.
Exposed gateway-log sentinel helpers and session transcript summaries to QA flow scenarios.
Widened the manifest-contract sentinel to catch the actual runtime diagnostic shape: plugin must declare contracts.tools for: ....
Added the matching changelog entry crediting @100yenadmin.

Verification:

node scripts/run-vitest.mjs extensions/qa-lab/src/gateway-log-sentinel.test.ts extensions/qa-lab/src/suite-runtime-agent-session.test.ts extensions/qa-lab/src/scenario-runtime-api.test.ts extensions/qa-lab/src/suite-runtime-flow.test.ts extensions/qa-lab/src/scenario-catalog.test.ts extensions/qa-lab/src/scenario-packs.test.ts passed: 6 files / 39 tests.
git diff --check HEAD~1..HEAD passed after rebase.
Blacksmith Testbox pnpm check:changed passed on tbx_01krtg3agjpbvm4813t2cr7xk9; Actions run: https://github.com/openclaw/openclaw/actions/runs/25985619752

PR remains open for the remaining QA-Lab coverage slices.

vincentkoc · 2026-05-17T08:56:44Z

Landed another current-main slice from this PR:
d217fd7

Scope landed:

Added 20 runtime tool fixture scenarios under qa/scenarios/runtime/tools/ covering Codex-native workspace tools, OpenClaw dynamic tools, and optional plugin-backed tools.
Added runtime tool fixture execution and coverage-report helpers for happy/failure-path mock planning, known harness gaps, optional/profile rows, and Codex-native workspace report-only rows.
Wired runRuntimeToolFixture into QA-Lab markdown-flow runtime APIs and expanded mock OpenAI tool-search planning for explicit fixture targets and denied-input failure probes.
Added the Unreleased changelog entry crediting @100yenadmin.

Companion baseline fix landed:
9ca98a6

That second commit fixes a current-main type/schema drift where ModelCompatSchema accepted thinkingFormat: "together" but ModelCompatConfig did not, which blocked the final changed gate after main moved.

Verification:

Testbox tbx_01krtj5654stgcbg8e39300mqg, Actions run https://github.com/openclaw/openclaw/actions/runs/25986354006, exited 0.
In that Testbox run, focused QA-Lab tests passed: runtime-tool-fixture.test.ts, tool-coverage-report.test.ts, scenario-runtime-api.test.ts, suite-runtime-flow.test.ts, providers/mock-openai/server.test.ts, and scenario-catalog.test.ts => 6 files / 102 tests.
The same run then executed pnpm check:changed; because full sync widened the detected surface, it ran lanes=all, including all typecheck, oxlint, and runtime import-cycle checks, and exited 0.
Local narrow check while debugging the moving-main baseline: OPENCLAW_LOCAL_CHECK_MODE=throttled node scripts/run-tsgo.mjs -p tsconfig.extensions.json --incremental --tsBuildInfoFile .artifacts/tsgo-cache/extensions.tsbuildinfo passed after the thinkingFormat type fix.

Post-push exact SHA state for 9ca98a6d399b96a7adecad23693fe1585a40a222 when checked: Workflow Sanity and ClawSweeper Dispatch were green; CI/Docs/Plugin NPM Release were still in progress or pending.

Leaving this PR open for the remaining broad runtime-parity harness/reporting deltas.

100yenadmin · 2026-05-17T08:57:53Z

Feel free to wrap and take over from here it, takes a lot of inference to get right and keep updated with betas-- works well I built over 3 days and ran manually + with api keys to find the bugs I found last week and I'm sure more will popup as we keep betas rolling 🫡 @vincentkoc

vincentkoc · 2026-05-17T09:20:08Z

Follow-up after the post-landing deadcode failure:

1f9d8c1e9d559e2cc6e1d38de00953bc057b9e94 wires the runtime tool coverage report into openclaw qa coverage --tools [--summary], so extensions/qa-lab/src/tool-coverage-report.ts is now production-reachable instead of test-only.
2c9f68f42b1f41d9a6d9ef140ca141954e88693c backfills the missing changelog entry for that QA-Lab operator surface.

Verification:

Exact SHA 1f9d8c1e9d559e2cc6e1d38de00953bc057b9e94: CI passed, including CI run https://github.com/openclaw/openclaw/actions/runs/25986824400 plus Workflow Sanity and Plugin NPM Release.
Exact SHA 2c9f68f42b1f41d9a6d9ef140ca141954e88693c: changelog-only follow-up; Workflow Sanity, Docs, and ClawSweeper Dispatch passed.

vincentkoc · 2026-05-18T02:14:06Z

Slice landed on main in 58e1351.

This pulls the #80339 hard-gate piece out of the broad PR: required OpenClaw dynamic direct runtime-tool rows are now evaluated by a blocking release-check verifier, while the Codex-native/searchable fidelity work remains tracked under #80319 and adjacent issues.

Proof summary: local focused QA-Lab/runtime tests passed (7 files / 181 tests), autoreview clean, Testbox-through-Crabbox pnpm check:changed passed on tbx_01krwcck260em48avwzfs7kf65, and exact-SHA CI/Workflow Sanity/Docs are green for 58e13518633f6df8fe6be304a95eef3ab485bebc.

vincentkoc · 2026-05-18T03:16:44Z

Narrow token-efficiency slice related to #81093 has landed on main in 1300b22.

That commit keeps the main-compatible part from this area small:

adds qa parity-report --runtime-axis --token-efficiency reports and JSON summaries
classifies Codex savings separately from regressions
fails only positive Codex-over-Pi live token deltas above threshold
wires the source fallback wrapper and changelog

Proof is on the issue closeout: #81093 (comment).

This PR is still open and currently reports CONFLICTING at head 7755e898bc3e1696032f7adc7561a94f02e68778, so any rebase should drop/adjust the token-efficiency semantics already on main and keep only the broader runtime harness pieces that are still distinct.

vincentkoc · 2026-05-18T03:38:08Z

Phase 4 token-efficiency report and scheduled artifact slice has now landed on main:

report/CLI/source-wrapper: 1300b22
schedule-only live-frontier artifact lane: a642ca9
closeout proof: [Codex×Pi parity Phase 4] Token-efficiency report #80175 (comment)

For any rebase of this PR, drop or reconcile the Phase 4/token-efficiency pieces against main; the remaining useful scope is the distinct runtime harness/proof work not covered by those two commits.

vincentkoc · 2026-05-21T15:08:00Z

Landed the scoped JSONL replay slice on main in cf06578.

What landed:

qa jsonl-replay CLI wiring for curated mock JSONL transcript replay.
Seven synthetic replay fixtures under qa/scenarios/jsonl-replay/.
First-drift reporting through the existing runtime-parity drift classes.
Changelog credit for @100yenadmin.

Verification:

node_modules/.bin/oxfmt --check CHANGELOG.md extensions/qa-lab/src/cli.ts extensions/qa-lab/src/cli.runtime.ts extensions/qa-lab/src/cli.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/jsonl-replay.ts extensions/qa-lab/src/jsonl-replay.test.ts
git diff --check origin/main...HEAD
node scripts/run-vitest.mjs extensions/qa-lab/src/jsonl-replay.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/cli.test.ts -> 3 files, 89 tests passed.

Known proof gap: Blacksmith Testbox direct run failed before command execution with rsync: unexpected end of file; local pnpm check:changed was aborted because pnpm tried to reconcile the shared node_modules from the Codex worktree. Post-push CI is running for the landed SHA: https://github.com/openclaw/openclaw/actions/runs/26234573078.

I am leaving this PR open for the remaining real runtime-cell replay work unless a maintainer wants this broad branch closed as superseded by landed slices.

clawsweeper · 2026-05-23T00:00:01Z

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?

The egg game starts only after the PR passes the real-behavior proof check.
Before that, no creature or rarity is rolled. The treat waits for real proof.
This is still just collectible flavor: proof affects review readiness, not creature quality.

vincentkoc · 2026-05-25T21:01:31Z

Landed the confidence-report slice from this PR on main in f6a49a4.

Credited slice: @100yenadmin's QA-Lab confidence/reporting work from #80323. I kept this PR open because the branch still contains broader runtime-parity work beyond the slice that is now on main.

Behavior addressed: QA-Lab now has qa confidence-report and qa confidence-self-test, a codex-100 confidence profile, harness-parity helpers, production prompt/tool content hashes for Pi and Codex report producers, and stricter artifact classification so empty, malformed, partial, skipped, inconsistent, or privacy-leaky proof artifacts do not false-green the confidence gate.

Real environment tested: Blacksmith Testbox through Crabbox, provider blacksmith-testbox, id tbx_01ksgef032e049f2zqjstdwybe, Actions run https://github.com/openclaw/openclaw/actions/runs/26419284334.

Exact steps or command run after this patch: local node scripts/run-vitest.mjs extensions/qa-lab/src/confidence-report.test.ts extensions/qa-lab/src/harness-parity.test.ts extensions/qa-lab/src/cli.runtime.test.ts src/agents/system-prompt-report.test.ts extensions/codex/src/app-server/run-attempt.test.ts; local node_modules/.bin/oxfmt --check ...; local git diff --check; Testbox pnpm test extensions/qa-lab/src/confidence-report.test.ts extensions/qa-lab/src/harness-parity.test.ts extensions/qa-lab/src/cli.runtime.test.ts src/agents/system-prompt-report.test.ts extensions/codex/src/app-server/run-attempt.test.ts; Testbox pnpm openclaw qa confidence-self-test --output-dir .artifacts/qa-e2e/confidence-self-test-testbox; Testbox pnpm check:changed.

Evidence after fix: Testbox focused tests passed 5 files / 322 tests, confidence self-test verdict was pass, and pnpm check:changed exited 0.

Observed result after fix: the confidence gate now rejects missing proof rows, malformed suite summaries, count/scenario mismatches, missing JSONL drift data, empty self-test canaries, optional-lane blockers, skipped rows without backfill, and absolute local artifact paths in persisted summaries.

What was not tested: true live-provider QA confidence manifests; this slice only adds the confidence gate/self-test plumbing and mock/proof validation path.

openclaw-barnacle · 2026-06-07T05:07:48Z

This assigned pull request has been automatically marked as stale after being open for 27 days.
Please add updates or it will be closed.

Eva (agent) added 3 commits May 10, 2026 18:33

test(qa-lab): add runtime parity harness

dc50911

ci(qa-lab): always publish runtime parity report

249e90b

test(qa-lab): expand runtime parity phases

8348be8

openclaw-barnacle Bot added agents Agent runtime and tooling extensions: qa-lab size: XL labels May 10, 2026

100yenadmin mentioned this pull request May 10, 2026

Codex-vs-Pi runtime parity QA harness (RFC + tracking) #80171

Closed

openclaw-barnacle Bot added the triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. label May 10, 2026

fix(qa-lab): satisfy runtime parity CI guards

900df28

openclaw-barnacle Bot added scripts Repository scripts docker Docker and sandbox tooling proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 10, 2026

Merge remote-tracking branch 'upstream/main' into codex-vs-pi-runtime…

2ba224f

…-parity-tools

clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 10, 2026

test: clear current-main lint and type guards

76bb348

100yenadmin requested a review from a team as a code owner May 10, 2026 16:07

openclaw-barnacle Bot added channel: discord Channel integration: discord channel: googlechat Channel integration: googlechat channel: line Channel integration: line and removed docker Docker and sandbox tooling labels May 10, 2026

clawsweeper Bot mentioned this pull request May 12, 2026

[QA-lab] Live token-efficiency gate flags Codex savings and exposes fs.read overhead #81093

Closed

Eva (agent) added 9 commits May 13, 2026 00:05

qa: harden runtime confidence proof

d10b03a

qa: add pi-shaped codex read canary

0ce6029

Add QA gateway log sentinels

baab536

Cover log sentinel tests in proof workflow

2dd53f4

Regenerate protocol Swift baseline

78ac4fb

Fix extension ANSI helper boundary

f441664

Ignore heartbeat transcripts in runtime parity capture

b87554d

Guard compaction test message content narrowing

7755e89

Kaspre mentioned this pull request May 14, 2026

Cross-runtime model changes (Codex→Pi or Pi→Codex) silently degrades agent tool use capability #81734

Closed

clawsweeper Bot mentioned this pull request May 16, 2026

docs(qa-lab): runtime-parity gate design (Pi vs Codex harness) #80179

Closed

4 tasks

kilo-code-bot Bot mentioned this pull request May 17, 2026

[pull] main from openclaw:main qq958691165/openclaw#4

Open

nbarthelemy mentioned this pull request May 26, 2026

[Bug]: Codex OAuth profile in main agentDir does not propagate to subagent sessions — silent fallback hallucinates tool calls in production #87051

Open

Uh oh!

Conversation

100yenadmin commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Current Verification

Real Behavior Proof

Current Bug Classification

Not Claimed Complete

Linked Issues

Uh oh!

clawsweeper Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

socket-security Bot commented May 12, 2026

Uh oh!

vincentkoc commented May 16, 2026

Uh oh!

vincentkoc commented May 17, 2026

Uh oh!

vincentkoc commented May 17, 2026

Uh oh!

vincentkoc commented May 17, 2026

Uh oh!

vincentkoc commented May 17, 2026

Uh oh!

vincentkoc commented May 17, 2026

Uh oh!

100yenadmin commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vincentkoc commented May 17, 2026

Uh oh!

vincentkoc commented May 18, 2026

Uh oh!

vincentkoc commented May 18, 2026

Uh oh!

vincentkoc commented May 18, 2026

Uh oh!

vincentkoc commented May 21, 2026

Uh oh!

clawsweeper Bot commented May 23, 2026

Uh oh!

vincentkoc commented May 25, 2026

Uh oh!

openclaw-barnacle Bot commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

100yenadmin commented May 10, 2026 •

edited

Loading

clawsweeper Bot commented May 10, 2026 •

edited

Loading

100yenadmin commented May 17, 2026 •

edited

Loading