Skip to content

[qa-lab] Complete Codex vs Pi runtime parity harness phases 2-5#80323

Closed
100yenadmin wants to merge 82 commits into
openclaw:mainfrom
electricsheephq:codex-vs-pi-runtime-parity-tools
Closed

[qa-lab] Complete Codex vs Pi runtime parity harness phases 2-5#80323
100yenadmin wants to merge 82 commits into
openclaw:mainfrom
electricsheephq:codex-vs-pi-runtime-parity-tools

Conversation

@100yenadmin

@100yenadmin 100yenadmin commented May 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds the Codex-vs-Pi runtime parity QA harness across extensions/qa-lab, including runtime-pair execution, runtime suite selectors, harness-prompt parity, token-efficiency reporting, tool-default fixtures, JSONL replay, confidence self-tests, strict confidence reports, and release-check wiring.

This branch now tests Codex at the right layer:

  • Codex-native workspace tools (read, write, edit, apply_patch, exec, process, update_plan) are not expected to appear as duplicate OpenClaw dynamic tools.
  • OpenClaw-owned integration tools remain dynamic-tool parity rows and pass the deterministic direct mock lane.
  • Provider-plan intent is reported separately from actual runtime transcript tool calls.
  • Report-only/native-live/mock-limitation rows are explicit skips, including counts.skipped in qa-suite-summary.json.
  • Token-efficiency rows distinguish mock-estimate from live-usage.
  • Seeded confidence canaries prove the harness catches prompt, tool schema, tool-call, tool-result, failure-mode, token, and JSONL replay regressions.

Why

OpenClaw needs a maintainer-runnable gate that compares the same scenario/model under Pi and Codex before Codex becomes the default runtime. The gate must surface real runtime drift without turning mock-provider limitations or intentional Codex-native tool ownership into production bug reports.

Current Verification

Latest validated branch head: 3336dec6419c9cc9a87dc7cfa6f48118ca2d838e

OpenClaw baseline merged into this branch: v2026.5.10-beta.5

Remote proof run: https://github.com/electricsheephq/openclaw-local-test/actions/runs/25719383976

Static/unit proof passed remotely:

  • pnpm check:test-types
  • pnpm lint --threads=8
  • Targeted QA-lab/Codex dynamic-tools tests:
    extensions/qa-lab/src/providers/mock-openai/server.test.ts, runtime-tool-fixture.test.ts, runtime-parity.test.ts, runtime-suite.test.ts, suite.test.ts, cli.runtime.test.ts, tool-coverage-report.test.ts, token-efficiency-report.test.ts, harness-parity.test.ts, jsonl-replay.test.ts, codex-plugin-lifecycle.test.ts, scenario-catalog.test.ts, confidence-report.test.ts, and extensions/codex/src/app-server/dynamic-tools.test.ts.

Local surgical checks after the beta.5 merge also passed:

  • pnpm test extensions/minimax/index.test.ts extensions/telegram/src/bot.create-telegram-bot.test.ts extensions/whatsapp/src/login.coverage.test.ts
  • OPENCLAW_VITEST_MAX_WORKERS=1 pnpm test extensions/qa-lab/src/confidence-report.test.ts extensions/qa-lab/src/runtime-parity.test.ts extensions/qa-lab/src/cli.runtime.test.ts
  • OPENCLAW_VITEST_MAX_WORKERS=1 pnpm test extensions/qa-lab/src/scenario-catalog.test.ts extensions/qa-lab/src/suite.test.ts
  • OPENCLAW_VITEST_MAX_WORKERS=1 pnpm test extensions/qa-lab/src/suite-summary.test.ts extensions/qa-lab/src/suite.summary-json.test.ts extensions/qa-lab/src/confidence-report.test.ts

Real Behavior Proof

  • Behavior or issue addressed: Running the new QA CLI against a real OpenClaw checkout should produce a trustworthy Codex-vs-Pi parity signal: Codex-native workspace tools are not false-failed for missing duplicate OpenClaw dynamic exposure, OpenClaw dynamic integration tools still execute through the dynamic bridge, report-only rows are explicit skips, token-efficiency output is labeled by usage source, JSONL replay artifacts are emitted, and seeded canaries prove the harness can catch regressions.
  • Real environment tested: GitHub Actions Ubuntu 24.04 remote runner checking out this PR head 3336dec6419c9cc9a87dc7cfa6f48118ca2d838e, with OpenClaw v2026.5.10-beta.5 merged into the branch. Artifacts were downloaded and inspected from a local macOS OpenClaw checkout at /Volumes/LEXAR/repos/openclaw-1.
  • Exact steps or command run after this patch:
gh workflow run qa-runtime-confidence-proof.yml \
  --repo electricsheephq/openclaw-local-test \
  --ref main \
  -f target_ref=codex-vs-pi-runtime-parity-tools \
  -f expected_sha=3336dec6419c9cc9a87dc7cfa6f48118ca2d838e \
  -f run_soak=false \
  -f run_live=false

gh run view 25719383976 --repo electricsheephq/openclaw-local-test --json status,conclusion,jobs

gh run download 25719383976 \
  --repo electricsheephq/openclaw-local-test \
  --dir /Volumes/LEXAR/Codex/qa-runtime-confidence-artifacts-25719383976

jq '{pass, zeroUnknowns, counts, failures}' \
  /Volumes/LEXAR/Codex/qa-runtime-confidence-artifacts-25719383976/qa-runtime-confidence-mock-3336dec6419c9cc9a87dc7cfa6f48118ca2d838e/confidence-report/qa-confidence-summary.json
  • Evidence after fix: Terminal capture from the downloaded artifact summaries:
{
  "tool-defaults-direct": { "total": 20, "passed": 20, "skipped": 0, "failed": 0 },
  "openclaw-dynamic-tools-direct": { "total": 8, "passed": 8, "skipped": 0, "failed": 0 },
  "tool-defaults-searchable": { "total": 20, "passed": 15, "skipped": 5, "failed": 0 },
  "first-hour-20-direct": { "total": 18, "passed": 15, "skipped": 3, "failed": 0 },
  "fault-injection-mock": { "total": 5, "passed": 3, "skipped": 2, "failed": 0 },
  "jsonl-expanded": { "curatedTranscripts": 7, "turnsCompared": 15, "driftedTurns": 0 },
  "confidence-self-test": { "pass": true, "detectedCanaries": "7/7" },
  "confidence-report": { "pass": true, "zeroUnknowns": true, "passedLanes": 8, "blockedLanes": 4, "unknown": 0, "failed": 0 }
}

Token-efficiency artifact excerpt:

{
  "status": "estimated",
  "providerMode": "mock-openai",
  "usageSources": ["mock-estimate"],
  "rows": 18,
  "pass": true,
  "piTotalTokens": 245125,
  "codexTotalTokens": 130286,
  "deltaPercent": -46.849158592554815
}

Current Bug Classification

Confirmed fixed or corrected here:

  • QA harness conflated Codex-native workspace tools with OpenClaw dynamic tool parity.
  • QA harness/reporting conflated provider-plan intent with runtime transcript tool calls.
  • Mock/searchable rows lacked clear report-only classification.
  • Report summaries omitted skipped/report-only counts.
  • The confidence gate lacked seeded negative controls.

Current product-bug verdict:

  • No confirmed Codex runner product bug remains from the mock proof lanes.
  • Native/live proof is still required before filing product bugs for Codex-native approval/read/write/compaction behavior.

Not Claimed Complete

Linked Issues

@clawsweeper

clawsweeper Bot commented May 10, 2026

Copy link
Copy Markdown
Contributor

Codex review: needs real behavior proof before merge. Reviewed June 7, 2026, 1:18 AM ET / 05:18 UTC.

Summary
The PR adds a broad QA-Lab Codex-vs-Pi runtime parity harness, tool coverage, Codex lifecycle checks, token-efficiency reporting, JSONL replay, release workflows, and supporting tests/docs.

Reproducibility: not applicable. this is a broad QA harness feature PR rather than a single reported runtime bug, and current-head real-behavior proof is still missing.

Review metrics: 2 noteworthy metrics.

  • Changed surface breadth: 137 files changed in the PR branch. The branch spans QA runtime code, scenarios, reports, scripts, tests, and workflows, so stale-base reconciliation is a maintainer-visible merge concern.
  • Runtime axis drift: 1 renamed runtime pair contract. Current main uses openclaw,codex while the PR branch still uses pi,codex, which must be resolved before merging automation or CLI surfaces.

Merge readiness
Overall: 🦪 silver shellfish
Proof: 🦪 silver shellfish
Patch quality: 🦐 gold shrimp
Result: blocked until stronger real behavior proof is added.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • Rebase or split the remaining branch work onto current main and the openclaw,codex runtime axis.
  • [P1] Add current-head real behavior proof for the exact branch SHA after reconciliation.
  • Limit the next merge candidate to the remaining unique QA-Lab slice rather than the full stale branch.

Proof guidance:

  • [P1] Needs stronger real behavior proof before merge: The PR body supplies terminal/remote proof for older heads, but not for the current PR head, so contributor-visible after-fix proof is still insufficient. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.

Risk before merge

  • [P1] The branch is broad and stale against current main; direct merge would require resolving the old pi,codex runtime naming against the landed openclaw,codex contract.
  • [P1] The supplied real-behavior proof is for older heads, not the current PR head 7755e898bc3e1696032f7adc7561a94f02e68778.
  • [P1] The PR changes release and QA automation, so stale workflow wiring could break or duplicate maintainer proof lanes if merged without a fresh current-main reconciliation.

Maintainer options:

  1. Continue Slice-First Integration (recommended)
    Extract or rebase only the remaining QA-Lab pieces that still differ from current main, then verify them at the current PR head before merge.
  2. Accept Maintainer Takeover Risk
    A maintainer can take over the broad branch manually, but should treat the release workflow and runtime-axis rename as explicit reconciliation work.
  3. Close After Remaining Work Is Tracked
    If no unique branch remainder is still wanted after the already-landed slices, close this PR with links to the landed commits and any narrower follow-up issues.

Next step before merge

  • [P1] The remaining work is maintainer-owned slicing, stale-branch reconciliation, and exact-head proof review rather than a narrow autonomous fix.

Security
Cleared: The security pass found workflow-sensitive automation changes but no concrete secret, dependency, permission, or supply-chain regression in the inspected PR diff.

Review details

Best possible solution:

Keep the PR as a source branch until maintainers either extract the remaining current-main-compatible slices or rebase it to the openclaw,codex contract with exact-head proof.

Do we have a high-confidence way to reproduce the issue?

Not applicable: this is a broad QA harness feature PR rather than a single reported runtime bug, and current-head real-behavior proof is still missing.

Is this the best way to solve the issue?

No for direct merge: the sliced current-main direction is better than landing the stale branch wholesale because the branch predates the openclaw,codex runtime contract and current release-check layout.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 7e7ea0fed17c.

Label changes

Label justifications:

  • P2: This is important QA/release infrastructure, but it is not a current user-facing outage or security emergency.
  • merge-risk: 🚨 automation: The PR changes release and QA workflow lanes, and stale runtime-pair wiring could break or duplicate maintainer automation after merge.
  • rating: 🦪 silver shellfish: Overall readiness is 🦪 silver shellfish; proof is 🦪 silver shellfish and patch quality is 🦐 gold shrimp.
  • feature: ✨ showcase: ClawSweeper spotlight: unusually compelling feature idea for maintainer attention. A runtime parity harness that compares OpenClaw and Codex behavior, tool coverage, lifecycle stress, and token efficiency is strategically useful release infrastructure.
  • status: 📣 needs proof: The PR needs real behavior proof before ClawSweeper can clear the contributor ask. Needs stronger real behavior proof before merge: The PR body supplies terminal/remote proof for older heads, but not for the current PR head, so contributor-visible after-fix proof is still insufficient. After adding proof, update the PR body; ClawSweeper should re-review automatically. If it does not, the PR author or someone with repository write access can comment @clawsweeper re-review.
Evidence reviewed

Acceptance criteria:

  • [P1] git diff --check 7b4fd3d 7755e89.
  • [P1] After rebase or slicing, rerun the relevant QA-Lab runtime parity proof at the exact PR head.
  • [P1] Verify release workflow changes against current .github/workflows/openclaw-release-checks.yml runtime parity lanes.

What I checked:

  • Current main uses the renamed runtime-pair contract: Current main parses runtime pairs through normalizeRuntimePairValue, accepts openclaw,codex, and builds RuntimeId values as openclaw or codex, so the implementation has moved past the PR's original pi,codex naming. (extensions/qa-lab/src/cli.runtime.ts:183, 61bb7d5523b8)
  • PR head still carries the older runtime-pair names: The PR branch version of the runtime CLI still reports --runtime-pair must be pi,codex or codex,pi and defaults to ['pi','codex'], which needs reconciliation before any current-main merge path. (extensions/qa-lab/src/cli.runtime.ts:186, 7755e898bc3e)
  • Current main has release-check runtime parity lanes: The release workflow already contains runtime parity jobs that run --runtime-pair openclaw,codex, showing several central CI slices from the PR's intent have landed on main. (.github/workflows/openclaw-release-checks.yml:1009, 61bb7d5523b8)
  • PR head workflow still uses old parity wiring: The PR workflow adds runtime parity release checks using --runtime-pair pi,codex, so it cannot be taken as-is over the current workflow contract. (.github/workflows/openclaw-release-checks.yml:872, 7755e898bc3e)
  • Current main includes JSONL replay on the renamed runtime axis: Current jsonl-replay.ts defines runtimePair: ['openclaw','codex'] and cells keyed by openclaw and codex, covering a major requested Phase 5 surface but with the current contract. (extensions/qa-lab/src/jsonl-replay.ts:12, cf0657852f65)
  • Codex dynamic-tool contract checked directly: Sibling Codex source defines thread-start dynamic tool specs and dynamic tool request/response plumbing, supporting that OpenClaw's Codex bridge and deferred-tool behavior are the relevant dependency contract for this review. (../codex/codex-rs/app-server-protocol/src/protocol/v2/thread.rs:39, b89ce9a2bced)

Likely related people:

  • vincentkoc: Current main history and the provided PR discussion show repeated landing and maintenance of runtime parity, token efficiency, confidence report, JSONL replay, and release workflow slices. (role: recent area contributor and QA-Lab runtime parity follow-up owner; confidence: high; commits: 61bb7d5523b8, 03f1bf9a4df0, f6a49a4e8a13; files: extensions/qa-lab/src/runtime-parity.ts, extensions/qa-lab/src/cli.runtime.ts, extensions/qa-lab/src/jsonl-replay.ts)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@openclaw-barnacle openclaw-barnacle Bot added scripts Repository scripts docker Docker and sandbox tooling proof: supplied External PR includes structured after-fix real behavior proof. and removed triage: needs-real-behavior-proof Candidate: external PR needs after-fix proof from a real setup. labels May 10, 2026
@clawsweeper clawsweeper Bot added the proof: sufficient ClawSweeper judged the real behavior proof convincing. label May 10, 2026
@100yenadmin 100yenadmin requested a review from a team as a code owner May 10, 2026 16:07
@openclaw-barnacle openclaw-barnacle Bot added channel: discord Channel integration: discord channel: googlechat Channel integration: googlechat channel: line Channel integration: line and removed docker Docker and sandbox tooling labels May 10, 2026
Eva (agent) added 9 commits May 13, 2026 00:05
…-parity-tools

# Conflicts:
#	docs/.generated/config-baseline.sha256
#	docs/.generated/plugin-sdk-api-baseline.sha256
#	extensions/acpx/package.json
#	extensions/alibaba/package.json
#	extensions/amazon-bedrock-mantle/package.json
#	extensions/amazon-bedrock/package.json
#	extensions/anthropic-vertex/package.json
#	extensions/anthropic/package.json
#	extensions/arcee/package.json
#	extensions/azure-speech/package.json
#	extensions/bonjour/package.json
#	extensions/brave/package.json
#	extensions/browser/package.json
#	extensions/byteplus/package.json
#	extensions/canvas/package.json
#	extensions/cerebras/package.json
#	extensions/chutes/package.json
#	extensions/clickclack/package.json
#	extensions/cloudflare-ai-gateway/package.json
#	extensions/codex/package.json
#	extensions/comfy/package.json
#	extensions/copilot-proxy/package.json
#	extensions/deepgram/package.json
#	extensions/deepinfra/package.json
#	extensions/deepseek/package.json
#	extensions/diagnostics-otel/package.json
#	extensions/diagnostics-prometheus/package.json
#	extensions/diffs/package.json
#	extensions/discord/package.json
#	extensions/document-extract/package.json
#	extensions/duckduckgo/package.json
#	extensions/elevenlabs/package.json
#	extensions/exa/package.json
#	extensions/fal/package.json
#	extensions/feishu/package.json
#	extensions/file-transfer/package.json
#	extensions/firecrawl/package.json
#	extensions/fireworks/package.json
#	extensions/github-copilot/package.json
#	extensions/google-meet/package.json
#	extensions/google/package.json
#	extensions/googlechat/package.json
#	extensions/gradium/package.json
#	extensions/groq/package.json
#	extensions/huggingface/package.json
#	extensions/image-generation-core/package.json
#	extensions/imessage/package.json
#	extensions/inworld/package.json
#	extensions/irc/package.json
#	extensions/kilocode/package.json
#	extensions/kimi-coding/package.json
#	extensions/line/package.json
#	extensions/litellm/package.json
#	extensions/llm-task/package.json
#	extensions/lmstudio/package.json
#	extensions/lobster/package.json
#	extensions/matrix/package.json
#	extensions/mattermost/package.json
#	extensions/media-understanding-core/package.json
#	extensions/memory-core/package.json
#	extensions/memory-lancedb/package.json
#	extensions/memory-wiki/package.json
#	extensions/microsoft-foundry/package.json
#	extensions/microsoft/package.json
#	extensions/migrate-claude/package.json
#	extensions/migrate-hermes/package.json
#	extensions/minimax/package.json
#	extensions/mistral/package.json
#	extensions/moonshot/package.json
#	extensions/msteams/package.json
#	extensions/nextcloud-talk/package.json
#	extensions/nostr/package.json
#	extensions/nvidia/package.json
#	extensions/oc-path/package.json
#	extensions/ollama/package.json
#	extensions/open-prose/package.json
#	extensions/openai/package.json
#	extensions/opencode-go/package.json
#	extensions/opencode/package.json
#	extensions/openrouter/package.json
#	extensions/openshell/package.json
#	extensions/perplexity/package.json
#	extensions/qa-channel/package.json
#	extensions/qa-lab/package.json
#	extensions/qa-matrix/package.json
#	extensions/qianfan/package.json
#	extensions/qqbot/package.json
#	extensions/qqbot/src/bridge/tools/remind.test.ts
#	extensions/qqbot/src/engine/gateway/outbound-dispatch.test.ts
#	extensions/qwen/package.json
#	extensions/runway/package.json
#	extensions/searxng/package.json
#	extensions/senseaudio/package.json
#	extensions/sglang/package.json
#	extensions/signal/package.json
#	extensions/skill-workshop/package.json
#	extensions/slack/package.json
#	extensions/speech-core/package.json
#	extensions/stepfun/package.json
#	extensions/synology-chat/package.json
#	extensions/synthetic/package.json
#	extensions/tavily/package.json
#	extensions/tavily/src/tavily-tools.test.ts
#	extensions/telegram/package.json
#	extensions/tencent/package.json
#	extensions/tlon/package.json
#	extensions/together/package.json
#	extensions/tokenjuice/package.json
#	extensions/tts-local-cli/package.json
#	extensions/twitch/package.json
#	extensions/venice/package.json
#	extensions/vercel-ai-gateway/package.json
#	extensions/video-generation-core/package.json
#	extensions/vllm/package.json
#	extensions/voice-call/package.json
#	extensions/volcengine/package.json
#	extensions/voyage/package.json
#	extensions/vydra/package.json
#	extensions/web-readability/package.json
#	extensions/webhooks/package.json
#	extensions/whatsapp/package.json
#	extensions/xai/package.json
#	extensions/xiaomi/package.json
#	extensions/zai/package.json
#	extensions/zalo/package.json
#	extensions/zalouser/package.json
#	package.json
#	pnpm-lock.yaml
#	src/agents/provider-transport-fetch.test.ts
#	src/config/bundled-channel-config-metadata.generated.ts
@socket-security

Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addednpm/​jszip@​3.10.199100848070
Addednpm/​ws@​8.20.09810010092100
Addednpm/​zod@​4.4.310010010095100
Addednpm/​kysely@​0.29.09810010096100

View full report

@vincentkoc

Copy link
Copy Markdown
Member

Landed the current-main-compatible heartbeat capture slice from this PR on main.

Commit: 55edadf

What landed:

  • Runtime parity capture now scans newest root-session transcripts and skips heartbeat-only operational transcripts before selecting the scenario reply.
  • Covered heartbeat shapes include legacy HEARTBEAT_OK, [OpenClaw heartbeat poll], heartbeat_respond, due-task heartbeats that run ordinary tools first, and user-role tool_result rows.
  • Added the changelog entry crediting @100yenadmin.

Proof:

  • Codex review: codex review --uncommitted final pass reported no actionable correctness issues.
  • Local after final rebase: git diff --check HEAD~1..HEAD; node scripts/run-vitest.mjs extensions/qa-lab/src/runtime-parity.test.ts (16 tests passed).
  • Testbox pre-push changed gate: tbx_01krs7152m0yjpbfmkc86yz4s5, run https://github.com/openclaw/openclaw/actions/runs/25971893086, focused runtime parity test plus pnpm check:changed exited 0.
  • Exact landed-SHA Testbox: tbx_01krs7cs45aws71c5yk14bfj7r, run https://github.com/openclaw/openclaw/actions/runs/25972036173, confirmed HEAD=55edadf86fdf0b3238137b0f7a10a73ded8352ae, focused runtime parity test passed.

I am not closing this PR wholesale: the remaining phase 2-5 branch delta is still too broad/stale against current main and needs more current-main slices rather than a direct merge.

@vincentkoc

Copy link
Copy Markdown
Member

Landed a current-main slice from this PR:
d801d27

Scope landed:

  • Added QA-Lab gateway log sentinels for plugin hook failures, plugin contract errors, Codex app-server stalls/timeouts, stalled agent runs, cron allowlist drift, live quota/subscription blockers, and direct-reply self-message transcripts.
  • Wired sentinel findings into runtime parity cell capture/reporting so self-health regressions become hard QA-Lab failures instead of only unit-tested helpers.
  • Added changelog credit for this slice.

Verification:

  • Local focused tests: node scripts/run-vitest.mjs extensions/qa-lab/src/gateway-log-sentinel.test.ts extensions/qa-lab/src/runtime-parity.test.ts passed, 2 files / 23 tests.
  • Local formatting/diff hygiene: oxfmt --check on touched QA-Lab files passed; git diff --check --cached passed before commit.
  • Testbox focused proof: tbx_01krsw71w04w0efg7nxvs388w7, Actions run https://github.com/openclaw/openclaw/actions/runs/25979142410, focused QA-Lab tests passed, 2 files / 23 tests.

pnpm check:changed --base HEAD^ --head HEAD was attempted after push but is not claimed green. Remote gate attempts were blocked by runner/provider issues:

  • tbx_01krsx6stvtvjfd4y7mx8s5z3r: discarded; cleanup removed Testbox deps before command.
  • tbx_01krsxbbahpf0dra97mm2nvm92: Blacksmith status API TLS handshake timeout before user command.
  • tbx_01krsxn9r1qsymgb8xcsgw6v60: Blacksmith warmup API deadline exceeded.
  • cbx_174304a42148 / run_103bffbe0f42: AWS Crabbox synced, then SSH dropped before command execution.
  • Reuse of tbx_01krsw71w04w0efg7nxvs388w7: SSH/rsync refused, likely expired/unreachable.

Leaving this PR open. The remaining phase 2-5 branch delta is still broad/stale and should continue landing as smaller current-main slices with separate proof.

@vincentkoc

Copy link
Copy Markdown
Member

Landed another current-main slice from this PR:
e66a6c8

Scope landed:

  • Added runtime-first-hour-20-turn and runtime-soak-100-turn scenario-pack entries.
  • Added runtimeParityTier metadata parsing for standard/optional/live-only/soak lanes.
  • Documented runtime parity tier metadata in the scenario-pack index.
  • Added an Unreleased changelog entry with PR credit.

Verification:

  • node scripts/run-vitest.mjs extensions/qa-lab/src/scenario-catalog.test.ts extensions/qa-lab/src/scenario-packs.test.ts passed, 2 files / 22 tests.
  • oxfmt --check on touched TS files passed.
  • git diff --check --cached passed before commit.
  • Verified new scenario docsRefs and codeRefs point at current-main paths after the stale PR refs were corrected.

Review note:

  • Codex review started, but wandered into broad repo exploration. The actionable issue it surfaced was a stale docs/code ref from the PR branch; that was fixed before landing and retested.

Leaving this PR open for the remaining runtime-parity harness/reporting/tool-matrix slices.

@vincentkoc

Copy link
Copy Markdown
Member

Landed another current-main slice from this PR:
826c2f4

Scope landed:

  • Added codex-pi-shaped-read-vocabulary, a live-only runtime parity canary for Codex-native workspace reads when prompts use legacy Pi-shaped Read tool wording.
  • Added catalog assertions for the scenario, marker, unavailable-tool needles, and runtimeParityTier.
  • Added an Unreleased changelog entry with PR credit.

Verification:

  • node scripts/run-vitest.mjs extensions/qa-lab/src/scenario-catalog.test.ts extensions/qa-lab/src/scenario-packs.test.ts passed, 2 files / 23 tests.
  • oxfmt --check on the touched TS test file passed.
  • git diff --check --cached passed before commit.
  • Verified docs/code refs point at current-main files.

Leaving this PR open for the remaining harness/report/tool-matrix slices.

@vincentkoc

Copy link
Copy Markdown
Member

Landed another scoped slice from this PR on main:

What landed:

  • Added live-only harness self-health scenarios for plugin hook crash sentinels, plugin manifest contracts.tools diagnostics, and WebChat direct-reply self-message routing.
  • Exposed gateway-log sentinel helpers and session transcript summaries to QA flow scenarios.
  • Widened the manifest-contract sentinel to catch the actual runtime diagnostic shape: plugin must declare contracts.tools for: ....
  • Added the matching changelog entry crediting @100yenadmin.

Verification:

  • node scripts/run-vitest.mjs extensions/qa-lab/src/gateway-log-sentinel.test.ts extensions/qa-lab/src/suite-runtime-agent-session.test.ts extensions/qa-lab/src/scenario-runtime-api.test.ts extensions/qa-lab/src/suite-runtime-flow.test.ts extensions/qa-lab/src/scenario-catalog.test.ts extensions/qa-lab/src/scenario-packs.test.ts passed: 6 files / 39 tests.
  • git diff --check HEAD~1..HEAD passed after rebase.
  • Blacksmith Testbox pnpm check:changed passed on tbx_01krtg3agjpbvm4813t2cr7xk9; Actions run: https://github.com/openclaw/openclaw/actions/runs/25985619752

PR remains open for the remaining QA-Lab coverage slices.

@vincentkoc

Copy link
Copy Markdown
Member

Landed another current-main slice from this PR:
d217fd7

Scope landed:

  • Added 20 runtime tool fixture scenarios under qa/scenarios/runtime/tools/ covering Codex-native workspace tools, OpenClaw dynamic tools, and optional plugin-backed tools.
  • Added runtime tool fixture execution and coverage-report helpers for happy/failure-path mock planning, known harness gaps, optional/profile rows, and Codex-native workspace report-only rows.
  • Wired runRuntimeToolFixture into QA-Lab markdown-flow runtime APIs and expanded mock OpenAI tool-search planning for explicit fixture targets and denied-input failure probes.
  • Added the Unreleased changelog entry crediting @100yenadmin.

Companion baseline fix landed:
9ca98a6

That second commit fixes a current-main type/schema drift where ModelCompatSchema accepted thinkingFormat: "together" but ModelCompatConfig did not, which blocked the final changed gate after main moved.

Verification:

  • Testbox tbx_01krtj5654stgcbg8e39300mqg, Actions run https://github.com/openclaw/openclaw/actions/runs/25986354006, exited 0.
  • In that Testbox run, focused QA-Lab tests passed: runtime-tool-fixture.test.ts, tool-coverage-report.test.ts, scenario-runtime-api.test.ts, suite-runtime-flow.test.ts, providers/mock-openai/server.test.ts, and scenario-catalog.test.ts => 6 files / 102 tests.
  • The same run then executed pnpm check:changed; because full sync widened the detected surface, it ran lanes=all, including all typecheck, oxlint, and runtime import-cycle checks, and exited 0.
  • Local narrow check while debugging the moving-main baseline: OPENCLAW_LOCAL_CHECK_MODE=throttled node scripts/run-tsgo.mjs -p tsconfig.extensions.json --incremental --tsBuildInfoFile .artifacts/tsgo-cache/extensions.tsbuildinfo passed after the thinkingFormat type fix.

Post-push exact SHA state for 9ca98a6d399b96a7adecad23693fe1585a40a222 when checked: Workflow Sanity and ClawSweeper Dispatch were green; CI/Docs/Plugin NPM Release were still in progress or pending.

Leaving this PR open for the remaining broad runtime-parity harness/reporting deltas.

@100yenadmin

100yenadmin commented May 17, 2026

Copy link
Copy Markdown
Contributor Author

Feel free to wrap and take over from here it, takes a lot of inference to get right and keep updated with betas-- works well I built over 3 days and ran manually + with api keys to find the bugs I found last week and I'm sure more will popup as we keep betas rolling 🫡 @vincentkoc

@vincentkoc

Copy link
Copy Markdown
Member

Follow-up after the post-landing deadcode failure:

  • 1f9d8c1e9d559e2cc6e1d38de00953bc057b9e94 wires the runtime tool coverage report into openclaw qa coverage --tools [--summary], so extensions/qa-lab/src/tool-coverage-report.ts is now production-reachable instead of test-only.
  • 2c9f68f42b1f41d9a6d9ef140ca141954e88693c backfills the missing changelog entry for that QA-Lab operator surface.

Verification:

  • Exact SHA 1f9d8c1e9d559e2cc6e1d38de00953bc057b9e94: CI passed, including CI run https://github.com/openclaw/openclaw/actions/runs/25986824400 plus Workflow Sanity and Plugin NPM Release.
  • Exact SHA 2c9f68f42b1f41d9a6d9ef140ca141954e88693c: changelog-only follow-up; Workflow Sanity, Docs, and ClawSweeper Dispatch passed.

@vincentkoc

Copy link
Copy Markdown
Member

Slice landed on main in 58e1351.

This pulls the #80339 hard-gate piece out of the broad PR: required OpenClaw dynamic direct runtime-tool rows are now evaluated by a blocking release-check verifier, while the Codex-native/searchable fidelity work remains tracked under #80319 and adjacent issues.

Proof summary: local focused QA-Lab/runtime tests passed (7 files / 181 tests), autoreview clean, Testbox-through-Crabbox pnpm check:changed passed on tbx_01krwcck260em48avwzfs7kf65, and exact-SHA CI/Workflow Sanity/Docs are green for 58e13518633f6df8fe6be304a95eef3ab485bebc.

@vincentkoc

Copy link
Copy Markdown
Member

Narrow token-efficiency slice related to #81093 has landed on main in 1300b22.

That commit keeps the main-compatible part from this area small:

  • adds qa parity-report --runtime-axis --token-efficiency reports and JSON summaries
  • classifies Codex savings separately from regressions
  • fails only positive Codex-over-Pi live token deltas above threshold
  • wires the source fallback wrapper and changelog

Proof is on the issue closeout: #81093 (comment).

This PR is still open and currently reports CONFLICTING at head 7755e898bc3e1696032f7adc7561a94f02e68778, so any rebase should drop/adjust the token-efficiency semantics already on main and keep only the broader runtime harness pieces that are still distinct.

@vincentkoc

Copy link
Copy Markdown
Member

Phase 4 token-efficiency report and scheduled artifact slice has now landed on main:

For any rebase of this PR, drop or reconcile the Phase 4/token-efficiency pieces against main; the remaining useful scope is the distinct runtime harness/proof work not covered by those two commits.

@vincentkoc

Copy link
Copy Markdown
Member

Landed the scoped JSONL replay slice on main in cf06578.

What landed:

  • qa jsonl-replay CLI wiring for curated mock JSONL transcript replay.
  • Seven synthetic replay fixtures under qa/scenarios/jsonl-replay/.
  • First-drift reporting through the existing runtime-parity drift classes.
  • Changelog credit for @100yenadmin.

Verification:

  • node_modules/.bin/oxfmt --check CHANGELOG.md extensions/qa-lab/src/cli.ts extensions/qa-lab/src/cli.runtime.ts extensions/qa-lab/src/cli.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/jsonl-replay.ts extensions/qa-lab/src/jsonl-replay.test.ts
  • git diff --check origin/main...HEAD
  • node scripts/run-vitest.mjs extensions/qa-lab/src/jsonl-replay.test.ts extensions/qa-lab/src/cli.runtime.test.ts extensions/qa-lab/src/cli.test.ts -> 3 files, 89 tests passed.

Known proof gap: Blacksmith Testbox direct run failed before command execution with rsync: unexpected end of file; local pnpm check:changed was aborted because pnpm tried to reconcile the shared node_modules from the Codex worktree. Post-push CI is running for the landed SHA: https://github.com/openclaw/openclaw/actions/runs/26234573078.

I am leaving this PR open for the remaining real runtime-cell replay work unless a maintainer wants this broad branch closed as superseded by landed slices.

@clawsweeper

clawsweeper Bot commented May 23, 2026

Copy link
Copy Markdown
Contributor

ClawSweeper PR egg

🎁 Pass real behavior proof to wake the egg and unlock a hatchable treat.

Where did the egg go?
  • The egg game starts only after the PR passes the real-behavior proof check.
  • Before that, no creature or rarity is rolled. The treat waits for real proof.
  • This is still just collectible flavor: proof affects review readiness, not creature quality.

@vincentkoc

Copy link
Copy Markdown
Member

Landed the confidence-report slice from this PR on main in f6a49a4.

Credited slice: @100yenadmin's QA-Lab confidence/reporting work from #80323. I kept this PR open because the branch still contains broader runtime-parity work beyond the slice that is now on main.

Behavior addressed: QA-Lab now has qa confidence-report and qa confidence-self-test, a codex-100 confidence profile, harness-parity helpers, production prompt/tool content hashes for Pi and Codex report producers, and stricter artifact classification so empty, malformed, partial, skipped, inconsistent, or privacy-leaky proof artifacts do not false-green the confidence gate.

Real environment tested: Blacksmith Testbox through Crabbox, provider blacksmith-testbox, id tbx_01ksgef032e049f2zqjstdwybe, Actions run https://github.com/openclaw/openclaw/actions/runs/26419284334.

Exact steps or command run after this patch: local node scripts/run-vitest.mjs extensions/qa-lab/src/confidence-report.test.ts extensions/qa-lab/src/harness-parity.test.ts extensions/qa-lab/src/cli.runtime.test.ts src/agents/system-prompt-report.test.ts extensions/codex/src/app-server/run-attempt.test.ts; local node_modules/.bin/oxfmt --check ...; local git diff --check; Testbox pnpm test extensions/qa-lab/src/confidence-report.test.ts extensions/qa-lab/src/harness-parity.test.ts extensions/qa-lab/src/cli.runtime.test.ts src/agents/system-prompt-report.test.ts extensions/codex/src/app-server/run-attempt.test.ts; Testbox pnpm openclaw qa confidence-self-test --output-dir .artifacts/qa-e2e/confidence-self-test-testbox; Testbox pnpm check:changed.

Evidence after fix: Testbox focused tests passed 5 files / 322 tests, confidence self-test verdict was pass, and pnpm check:changed exited 0.

Observed result after fix: the confidence gate now rejects missing proof rows, malformed suite summaries, count/scenario mismatches, missing JSONL drift data, empty self-test canaries, optional-lane blockers, skipped rows without backfill, and absolute local artifact paths in persisted summaries.

What was not tested: true live-provider QA confidence manifests; this slice only adds the confidence gate/self-test plumbing and mock/proof validation path.

@openclaw-barnacle

Copy link
Copy Markdown

This assigned pull request has been automatically marked as stale after being open for 27 days.
Please add updates or it will be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling channel: qqbot channel: telegram Channel integration: telegram channel: whatsapp-web Channel integration: whatsapp-web cli CLI command changes extensions: codex extensions: lmstudio extensions: minimax extensions: qa-lab feature: ✨ showcase ClawSweeper spotlight: unusually compelling feature idea for maintainer attention. merge-risk: 🚨 automation 🚨 May affect CI, automerge, proof capture, label sync, or maintainer automation. P2 Normal backlog priority with limited blast radius. proof: supplied External PR includes structured after-fix real behavior proof. rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. scripts Repository scripts size: XL stale Marked as stale due to inactivity status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants