first-hour-20 runtime parity has 3 native-live proof gaps after harness correction

# TLDR

**Resolved as a hard-gate failure by PR #80323's beta.5 correction.**

The old body said `first-hour-20` had **18 total / 6 pass / 12 fail**. That was the pre-correction mock/harness state. The current beta.5 proof shows:

```json
{
  "total": 18,
  "passed": 15,
  "skipped": 3,
  "failed": 0
}
```

**Product impact if OpenClaw moved fully to Codex today: P4 from this issue as filed.** The remaining 3 rows are native/live proof gaps, not mock-proven Codex product bugs.

**QA impact: P0 resolved for the maintainer mock gate.** The gate now has zero hard failures and explicit report-only rows.

# Latest Evidence

```text
OpenClaw baseline: v2026.5.10-beta.5
PR: #80323
PR head: 3336dec6419c9cc9a87dc7cfa6f48118ca2d838e
Remote proof run: https://github.com/electricsheephq/openclaw-local-test/actions/runs/25719383976
Confidence tracker: #80936
Artifact: first-hour-20-direct/qa-suite-summary.json
```

Remote workflow step `Run first-hour-20 direct lane` completed successfully.

# Report-Only Rows

These rows are intentionally not hard-failed in mock mode:

- `Instruction followthrough repo contract`: mock-openai cannot exercise Codex-native read/write tools.
- `Approval turn tool followthrough`: mock-openai still models approval followthrough as a Pi-style `read` call; Codex-native approval/read behavior needs native/live proof.
- `Compaction retry after mutating tool`: mock-openai cannot create files through Codex-native read/write; compaction replay safety remains a native/live proof lane.

# Current Verdict

This issue should stay closed as the stale failing-gate report. Follow remaining proof work in:

- #80936 for the beta.5 confidence proof summary.
- #80397 for live-frontier token/native proof.
- #80433 for scheduled/Testbox `soak-100` proof.
- #80319 for searchable/deferred mock-provider fidelity.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

first-hour-20 runtime parity has 3 native-live proof gaps after harness correction #80434

TLDR

Latest Evidence

Report-Only Rows

Current Verdict

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

first-hour-20 runtime parity has 3 native-live proof gaps after harness correction #80434

Description

TLDR

Latest Evidence

Report-Only Rows

Current Verdict

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions