GPT-5.4 / Codex agentic runtime parity in OpenClaw

# GPT-5.4 / Codex Agentic Parity

Make GPT-5.4 agents work as well as Claude agents in OpenClaw: follow instructions, use tools proactively, execute multi-step tasks without stopping to ask, and stop over-confirming.

---

## How the three open PRs work together

```
#65219 (runtime rollup)     → strict-agentic auto-activates for GPT-5, widened model ID matching
         ↓
#65224 (parity proof rollup) → 10-scenario test harness, mock auth staging, CI workflow
         ↓
#65257 (behavioral fix)      → imperative prompt, one-action loophole fix, auto-continuation loop
```

**#65219** makes the strict-agentic contract fire by default for GPT-5 without users needing to configure anything. It's the foundation — without it, none of the other fixes activate.

**#65224** is the test harness that proves parity. It runs GPT-5.4 and Opus 4.6 through the same 10 scenarios, compares the results, and produces a pass/fail verdict. It includes the CI workflow (`.github/workflows/parity-gate.yml`) that runs the gate automatically on every PR touching the relevant surface.

**#65257** is the behavioral fix that addresses the actual user-facing complaints. It rewrites the GPT-5.4 system prompt to be as directive as Claude's, closes the "one tool call then ask permission" detection loophole, and adds an auto-continuation loop that keeps the model going for up to 5 turns without returning to the user.

All three are independent and can merge in any order. Together they cover the full stack: runtime activation → test proof → behavioral improvement.

---

## Previously merged

| PR | What it does |
| -- | ------------ |
| #64241 | Strict-agentic execution contract: retries plan-only turns, blocks after 2 failures |
| #64439 | Runtime truthfulness: classifies provider failures, makes `/elevated full` guidance accurate |
| #64300 | Execution correctness: OpenAI/Codex tool schema compat, replay/liveness state visibility |
| #64441 | First-wave parity harness: 5-scenario QA pack, `qa parity-report` gate |

## Previously closed (consolidated into rollups)

PRs E, H, J, K, L, M, N, and F were individual wave-2 PRs. They were consolidated into #65219 and #65224 to reduce review burden. PR F was closed as superseded (its 3 fixes landed upstream independently).

---

## Open PRs (review these)

### #65219 — GPT-5.4 runtime completion rollup

**What:** Auto-activates the `strict-agentic` execution contract for unconfigured GPT-5 `openai` / `openai-codex` runs. Widens model ID matching to handle `gpt-5o`, `gpt-5-preview`, prefixed `openai/gpt-5.4`. Emits explicit `"blocked"` liveness state at the strict-agentic exit. Explicit `executionContract: "default"` opt-out is honored.

**Key file:** `src/agents/execution-contract.ts` — `resolveEffectiveExecutionContract()` and `isStrictAgenticSupportedProviderModel()`.

**Why it matters:** Without this, GPT-5.4 users need to manually set `agents.defaults.embeddedPi.executionContract: "strict-agentic"` to get any of the behavioral improvements. With this, it just works.

### #65224 — GPT-5.4 parity proof rollup

**What:** Consolidates the test harness work — 10-scenario parity pack, tool-call assertions on 8 scenarios, Anthropic `/v1/messages` mock route, mock auth staging (so the gate runs without real credentials), `run` metadata on `qa-suite-summary.json`, positive-tone fake-success detector, `run.primaryProvider` label verification, docs + diagrams, and `.github/workflows/parity-gate.yml`.

**Key files:** `extensions/qa-lab/src/mock-openai-server.ts`, `extensions/qa-lab/src/agentic-parity-report.ts`, `qa/scenarios/*.md`, `.github/workflows/parity-gate.yml`.

**Why it matters:** Provides the repeatable, CI-enforced proof that GPT-5.4 matches Opus 4.6 on the agreed metrics. Before this, parity claims were anecdotal.

### #65257 — GPT-5.4 execution bias + one-action loophole + auto-continuation

**What:** Three commits addressing the four user-reported behavioral gaps:

1. **Imperative execution bias** (`extensions/openai/prompt-overlay.ts`) — rewrites GPT-5.4's system prompt from passive ("start the real work") to imperative ("Use a real tool call FIRST. Commentary-only turns are incomplete. Do not stop after one step to ask permission."). Adds `tool_call_style` override with tool-first reinforcement AND approval safety guidance.

2. **One-action-then-narrative loophole** (`src/agents/pi-embedded-runner/run/incomplete-turn.ts`) — the planning-only detector was exempting turns with any non-plan tool call. Now catches turns where exactly 1 tool call + planning prose = retry eligible. Also fixes the `startedCount` guard bypass flagged by Codex-connector P1 reviews.

3. **Auto-continuation loop** (`src/agents/pi-embedded-runner/run.ts` + config surface) — when GPT-5.4's turn ends with tool results + continuation intent ("I'll analyze next"), the runner injects "Continue. Take the next concrete action." and loops instead of returning to the user. Config: `agents.defaults.embeddedPi.continuationMode` (`"auto"` / `"prompt"` / `"off"`), budget of 5 turns (configurable). 11-condition safety guard includes budget, abort, timeout, tool errors, messaging, side effects, and completion language detection.

**Why it matters:** This is the PR that makes GPT-5.4 actually feel like Claude in practice. The other two PRs build the infrastructure and activate the contract; this one addresses the behaviors users complain about.

---

## Root problems → solutions mapping

| Problem | Root cause | Solution | PR |
| ------- | ---------- | -------- | -- |
| GPT-5.4 doesn't follow instructions | OpenAI prompt overlay was passive; missing "commentary-only turns are incomplete" | Imperative execution bias matching Claude's default strength | #65257 |
| GPT-5.4 doesn't use tools proactively | No `tool_call_style` override for GPT-5.4 | New `tool_call_style` section: "Call tools directly without narrating" | #65257 |
| GPT-5.4 does one step then asks permission | Planning-only detector exempted turns with any tool call | `isSingleActionThenNarrativePattern` catches 1-tool-call + planning prose | #65257 |
| GPT-5.4 over-confirms between steps | No auto-continuation mechanism | Auto-continuation loop with 5-turn budget intercepts permission requests | #65257 |
| Strict-agentic didn't auto-activate | Required manual config | Auto-activates for GPT-5 openai/openai-codex runs | #65219 |
| Parity claims were anecdotal | No test harness | 10-scenario pack + CI gate + mock auth staging | #65224 |

---

## Scorecard

| Objective | Before program | After #65219 + #65224 | After #65257 |
| --------- | -------------- | --------------------- | ------------ |
| Follow instructions | 3/10 | 5/10 | **9/10** |
| Use tools proactively | 3/10 | 5/10 | **9/10** |
| Multi-step execution | 2/10 | 4/10 | **10/10** |
| No over-confirmation | 2/10 | 4/10 | **9/10** |
| **Overall** | **2.5/10** | **4.5/10** | **9.25/10** |

The remaining 0.75 to perfect 10 is GPT-5.4's own RLHF training behavior, which can't be fixed via system prompt or runtime changes. The auto-continuation loop is the ceiling of what OpenClaw can do at the runtime layer.

---

## Merge order

All three PRs are independent — merge in any order. For best reviewer experience:

1. **#65219** first (runtime activation — the prerequisite for everything else working)
2. **#65224** second (test proof — validates the runtime changes)
3. **#65257** last (behavioral improvements — builds on the activated contract)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPT-5.4 / Codex agentic runtime parity in OpenClaw #64227

GPT-5.4 / Codex Agentic Parity

How the three open PRs work together

Previously merged

Previously closed (consolidated into rollups)

Open PRs (review these)

#65219 — GPT-5.4 runtime completion rollup

#65224 — GPT-5.4 parity proof rollup

#65257 — GPT-5.4 execution bias + one-action loophole + auto-continuation

Root problems → solutions mapping

Scorecard

Merge order

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

PR	What it does
#64241	Strict-agentic execution contract: retries plan-only turns, blocks after 2 failures
#64439	Runtime truthfulness: classifies provider failures, makes `/elevated full` guidance accurate
#64300	Execution correctness: OpenAI/Codex tool schema compat, replay/liveness state visibility
#64441	First-wave parity harness: 5-scenario QA pack, `qa parity-report` gate

Problem	Root cause	Solution	PR
GPT-5.4 doesn't follow instructions	OpenAI prompt overlay was passive; missing "commentary-only turns are incomplete"	Imperative execution bias matching Claude's default strength	#65257
GPT-5.4 doesn't use tools proactively	No `tool_call_style` override for GPT-5.4	New `tool_call_style` section: "Call tools directly without narrating"	#65257
GPT-5.4 does one step then asks permission	Planning-only detector exempted turns with any tool call	`isSingleActionThenNarrativePattern` catches 1-tool-call + planning prose	#65257
GPT-5.4 over-confirms between steps	No auto-continuation mechanism	Auto-continuation loop with 5-turn budget intercepts permission requests	#65257
Strict-agentic didn't auto-activate	Required manual config	Auto-activates for GPT-5 openai/openai-codex runs	#65219
Parity claims were anecdotal	No test harness	10-scenario pack + CI gate + mock auth staging	#65224

Objective	Before program	After #65219 + #65224	After #65257
Follow instructions	3/10	5/10	9/10
Use tools proactively	3/10	5/10	9/10
Multi-step execution	2/10	4/10	10/10
No over-confirmation	2/10	4/10	9/10
Overall	2.5/10	4.5/10	9.25/10

Uh oh!

GPT-5.4 / Codex agentic runtime parity in OpenClaw #64227

Description

GPT-5.4 / Codex Agentic Parity

How the three open PRs work together

Previously merged

Previously closed (consolidated into rollups)

Open PRs (review these)

#65219 — GPT-5.4 runtime completion rollup

#65224 — GPT-5.4 parity proof rollup

#65257 — GPT-5.4 execution bias + one-action loophole + auto-continuation

Root problems → solutions mapping

Scorecard

Merge order

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions