Tracking parent: #80171
Depends on: Phase 1 #80172
Goal
Build a deterministic per-tool fixture set so the runtime-parity harness can surface "tool X breaks under codex" at the tool granularity, not just session-level. This is the deliverable Eva called out: "test all tools and long runs in harness to get to 100% parity and use to debug all the edge cases."
Scope
One fixture per tool family. Each fixture is deterministic: the prompt forces exactly one tool call with predictable arguments. The harness asserts the tool was invoked, completed, and result shape matches between runtimes.
Tool families to cover
(Source: src/agents/pi-tools.create-openclaw-coding-tools.ts and Codex harness contract — finalise the list in the PR by reading both surfaces.)
bash — bash echo hello
exec — approval-required exec "ls -la /tmp" flow
fs.read, fs.write, fs.list — read/write/list a temp file
grep — grep for a literal in a fixture file
edit / apply-patch — apply a small unified diff
web_search — search for a fixed query (mock provider returns fixed results)
web_fetch — fetch a fixed URL (mock provider returns fixed body)
tavily_search, tavily_extract
image_generate — generate against the qa-lab mock image provider
tts — synth a fixed phrase against the mock TTS provider
message-tool — message-tool send to a mock channel; media variant
session_status, sessions_spawn
memory.recall, memory.add (if pi-only, mark as expected drift with a known-broken marker)
skill_* invocations
For each tool family, also one fixture for the failure mode (denied input, oversized payload, etc.) so error-path drift is captured.
Concrete deliverables
Fixtures
qa/scenarios/runtime/tools/<tool>.md — one file per family. Reuse the existing scenario format already used by approval-turn-tool-followthrough.md.
- Each fixture exports both a happy-path and a failure-path scenario.
Code
- Extend
extensions/qa-lab/src/runtime-parity.ts (from Phase 1) — add toolBreakdown field to the report so per-tool drift surfaces alongside per-scenario drift.
- New
extensions/qa-lab/src/tool-coverage-report.ts — generates a Markdown coverage table:
| tool | pi | codex | drift | tracking |
|------|----|-------|-------|----------|
| bash | ✅ | ✅ | none | |
| exec | ✅ | ❌ | tool-result-shape | #issue |
- Extend
extensions/qa-lab/src/cli.ts — new qa tool-coverage --runtime-pair pi,codex command.
Tests
- Each fixture has a self-test running it through the mock provider on both runtimes (no qa-lab harness dependency for the self-test — keeps fixtures portable).
- Coverage report rendering test.
Acceptance criteria
Out of scope
- Plugin-lifecycle stress (Phase 3).
- Token efficiency (Phase 4).
- Live-mode runs — fixtures must be hermetic in this PR.
References
Tracking parent: #80171
Depends on: Phase 1 #80172
Goal
Build a deterministic per-tool fixture set so the runtime-parity harness can surface "tool X breaks under codex" at the tool granularity, not just session-level. This is the deliverable Eva called out: "test all tools and long runs in harness to get to 100% parity and use to debug all the edge cases."
Scope
One fixture per tool family. Each fixture is deterministic: the prompt forces exactly one tool call with predictable arguments. The harness asserts the tool was invoked, completed, and result shape matches between runtimes.
Tool families to cover
(Source:
src/agents/pi-tools.create-openclaw-coding-tools.tsand Codex harness contract — finalise the list in the PR by reading both surfaces.)bash—bash echo helloexec— approval-requiredexec "ls -la /tmp"flowfs.read,fs.write,fs.list— read/write/list a temp filegrep— grep for a literal in a fixture fileedit/apply-patch— apply a small unified diffweb_search— search for a fixed query (mock provider returns fixed results)web_fetch— fetch a fixed URL (mock provider returns fixed body)tavily_search,tavily_extractimage_generate— generate against the qa-lab mock image providertts— synth a fixed phrase against the mock TTS providermessage-tool—message-tool sendto a mock channel; media variantsession_status,sessions_spawnmemory.recall,memory.add(if pi-only, mark as expected drift with a known-broken marker)skill_*invocationsFor each tool family, also one fixture for the failure mode (denied input, oversized payload, etc.) so error-path drift is captured.
Concrete deliverables
Fixtures
qa/scenarios/runtime/tools/<tool>.md— one file per family. Reuse the existing scenario format already used byapproval-turn-tool-followthrough.md.Code
extensions/qa-lab/src/runtime-parity.ts(from Phase 1) — addtoolBreakdownfield to the report so per-tool drift surfaces alongside per-scenario drift.extensions/qa-lab/src/tool-coverage-report.ts— generates a Markdown coverage table:extensions/qa-lab/src/cli.ts— newqa tool-coverage --runtime-pair pi,codexcommand.Tests
Acceptance criteria
qa/scenarios/runtime/tools/<tool>.mdfixture.--runtime-pair pi,codexagainst current main, OR is annotated with aknown-brokenmarker pointing at a tracking issue (file the tracking issue as part of this PR if discovered).pnpm openclaw qa tool-coverage --runtime-pair pi,codexproduces a Markdown table suitable for the README of the harness.pnpm check:test-typesandpnpm exec oxlintclean.Out of scope
References
qa/scenarios/runtime/approval-turn-tool-followthrough.mdsrc/agents/pi-tools.create-openclaw-coding-tools.tsextensions/codex/src/