Skip to content

[Codex×Pi parity Phase 2] Per-tool fixture set #80173

@100yenadmin

Description

@100yenadmin

Tracking parent: #80171
Depends on: Phase 1 #80172

Goal

Build a deterministic per-tool fixture set so the runtime-parity harness can surface "tool X breaks under codex" at the tool granularity, not just session-level. This is the deliverable Eva called out: "test all tools and long runs in harness to get to 100% parity and use to debug all the edge cases."

Scope

One fixture per tool family. Each fixture is deterministic: the prompt forces exactly one tool call with predictable arguments. The harness asserts the tool was invoked, completed, and result shape matches between runtimes.

Tool families to cover

(Source: src/agents/pi-tools.create-openclaw-coding-tools.ts and Codex harness contract — finalise the list in the PR by reading both surfaces.)

  • bashbash echo hello
  • exec — approval-required exec "ls -la /tmp" flow
  • fs.read, fs.write, fs.list — read/write/list a temp file
  • grep — grep for a literal in a fixture file
  • edit / apply-patch — apply a small unified diff
  • web_search — search for a fixed query (mock provider returns fixed results)
  • web_fetch — fetch a fixed URL (mock provider returns fixed body)
  • tavily_search, tavily_extract
  • image_generate — generate against the qa-lab mock image provider
  • tts — synth a fixed phrase against the mock TTS provider
  • message-toolmessage-tool send to a mock channel; media variant
  • session_status, sessions_spawn
  • memory.recall, memory.add (if pi-only, mark as expected drift with a known-broken marker)
  • skill_* invocations

For each tool family, also one fixture for the failure mode (denied input, oversized payload, etc.) so error-path drift is captured.

Concrete deliverables

Fixtures

  • qa/scenarios/runtime/tools/<tool>.md — one file per family. Reuse the existing scenario format already used by approval-turn-tool-followthrough.md.
  • Each fixture exports both a happy-path and a failure-path scenario.

Code

  • Extend extensions/qa-lab/src/runtime-parity.ts (from Phase 1) — add toolBreakdown field to the report so per-tool drift surfaces alongside per-scenario drift.
  • New extensions/qa-lab/src/tool-coverage-report.ts — generates a Markdown coverage table:
    | tool | pi | codex | drift | tracking |
    |------|----|-------|-------|----------|
    | bash | ✅  | ✅     | none  |          |
    | exec | ✅  | ❌     | tool-result-shape | #issue |
    
  • Extend extensions/qa-lab/src/cli.ts — new qa tool-coverage --runtime-pair pi,codex command.

Tests

  • Each fixture has a self-test running it through the mock provider on both runtimes (no qa-lab harness dependency for the self-test — keeps fixtures portable).
  • Coverage report rendering test.

Acceptance criteria

  • Each tool family in the list above has a qa/scenarios/runtime/tools/<tool>.md fixture.
  • Each fixture passes both cells under --runtime-pair pi,codex against current main, OR is annotated with a known-broken marker pointing at a tracking issue (file the tracking issue as part of this PR if discovered).
  • The runtime-parity report enumerates per-tool drift, not just per-scenario drift.
  • pnpm openclaw qa tool-coverage --runtime-pair pi,codex produces a Markdown table suitable for the README of the harness.
  • pnpm check:test-types and pnpm exec oxlint clean.

Out of scope

  • Plugin-lifecycle stress (Phase 3).
  • Token efficiency (Phase 4).
  • Live-mode runs — fixtures must be hermetic in this PR.

References

Metadata

Metadata

Assignees

Labels

P2Normal backlog priority with limited blast radius.impact:auth-providerAuth, provider routing, model choice, or SecretRef resolution may break.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions