Skip to content

GPT-5.4 / Codex agentic runtime parity in OpenClaw #64227

@100yenadmin

Description

@100yenadmin

GPT-5.4 / Codex Agentic Parity

Make GPT-5.4 agents work as well as Claude agents in OpenClaw: follow instructions, use tools proactively, execute multi-step tasks without stopping to ask, and stop over-confirming.


How the three open PRs work together

#65219 (runtime rollup)     → strict-agentic auto-activates for GPT-5, widened model ID matching
         ↓
#65224 (parity proof rollup) → 10-scenario test harness, mock auth staging, CI workflow
         ↓
#65257 (behavioral fix)      → imperative prompt, one-action loophole fix, auto-continuation loop

#65219 makes the strict-agentic contract fire by default for GPT-5 without users needing to configure anything. It's the foundation — without it, none of the other fixes activate.

#65224 is the test harness that proves parity. It runs GPT-5.4 and Opus 4.6 through the same 10 scenarios, compares the results, and produces a pass/fail verdict. It includes the CI workflow (.github/workflows/parity-gate.yml) that runs the gate automatically on every PR touching the relevant surface.

#65257 is the behavioral fix that addresses the actual user-facing complaints. It rewrites the GPT-5.4 system prompt to be as directive as Claude's, closes the "one tool call then ask permission" detection loophole, and adds an auto-continuation loop that keeps the model going for up to 5 turns without returning to the user.

All three are independent and can merge in any order. Together they cover the full stack: runtime activation → test proof → behavioral improvement.


Previously merged

PR What it does
#64241 Strict-agentic execution contract: retries plan-only turns, blocks after 2 failures
#64439 Runtime truthfulness: classifies provider failures, makes /elevated full guidance accurate
#64300 Execution correctness: OpenAI/Codex tool schema compat, replay/liveness state visibility
#64441 First-wave parity harness: 5-scenario QA pack, qa parity-report gate

Previously closed (consolidated into rollups)

PRs E, H, J, K, L, M, N, and F were individual wave-2 PRs. They were consolidated into #65219 and #65224 to reduce review burden. PR F was closed as superseded (its 3 fixes landed upstream independently).


Open PRs (review these)

#65219 — GPT-5.4 runtime completion rollup

What: Auto-activates the strict-agentic execution contract for unconfigured GPT-5 openai / openai-codex runs. Widens model ID matching to handle gpt-5o, gpt-5-preview, prefixed openai/gpt-5.4. Emits explicit "blocked" liveness state at the strict-agentic exit. Explicit executionContract: "default" opt-out is honored.

Key file: src/agents/execution-contract.tsresolveEffectiveExecutionContract() and isStrictAgenticSupportedProviderModel().

Why it matters: Without this, GPT-5.4 users need to manually set agents.defaults.embeddedPi.executionContract: "strict-agentic" to get any of the behavioral improvements. With this, it just works.

#65224 — GPT-5.4 parity proof rollup

What: Consolidates the test harness work — 10-scenario parity pack, tool-call assertions on 8 scenarios, Anthropic /v1/messages mock route, mock auth staging (so the gate runs without real credentials), run metadata on qa-suite-summary.json, positive-tone fake-success detector, run.primaryProvider label verification, docs + diagrams, and .github/workflows/parity-gate.yml.

Key files: extensions/qa-lab/src/mock-openai-server.ts, extensions/qa-lab/src/agentic-parity-report.ts, qa/scenarios/*.md, .github/workflows/parity-gate.yml.

Why it matters: Provides the repeatable, CI-enforced proof that GPT-5.4 matches Opus 4.6 on the agreed metrics. Before this, parity claims were anecdotal.

#65257 — GPT-5.4 execution bias + one-action loophole + auto-continuation

What: Three commits addressing the four user-reported behavioral gaps:

  1. Imperative execution bias (extensions/openai/prompt-overlay.ts) — rewrites GPT-5.4's system prompt from passive ("start the real work") to imperative ("Use a real tool call FIRST. Commentary-only turns are incomplete. Do not stop after one step to ask permission."). Adds tool_call_style override with tool-first reinforcement AND approval safety guidance.

  2. One-action-then-narrative loophole (src/agents/pi-embedded-runner/run/incomplete-turn.ts) — the planning-only detector was exempting turns with any non-plan tool call. Now catches turns where exactly 1 tool call + planning prose = retry eligible. Also fixes the startedCount guard bypass flagged by Codex-connector P1 reviews.

  3. Auto-continuation loop (src/agents/pi-embedded-runner/run.ts + config surface) — when GPT-5.4's turn ends with tool results + continuation intent ("I'll analyze next"), the runner injects "Continue. Take the next concrete action." and loops instead of returning to the user. Config: agents.defaults.embeddedPi.continuationMode ("auto" / "prompt" / "off"), budget of 5 turns (configurable). 11-condition safety guard includes budget, abort, timeout, tool errors, messaging, side effects, and completion language detection.

Why it matters: This is the PR that makes GPT-5.4 actually feel like Claude in practice. The other two PRs build the infrastructure and activate the contract; this one addresses the behaviors users complain about.


Root problems → solutions mapping

Problem Root cause Solution PR
GPT-5.4 doesn't follow instructions OpenAI prompt overlay was passive; missing "commentary-only turns are incomplete" Imperative execution bias matching Claude's default strength #65257
GPT-5.4 doesn't use tools proactively No tool_call_style override for GPT-5.4 New tool_call_style section: "Call tools directly without narrating" #65257
GPT-5.4 does one step then asks permission Planning-only detector exempted turns with any tool call isSingleActionThenNarrativePattern catches 1-tool-call + planning prose #65257
GPT-5.4 over-confirms between steps No auto-continuation mechanism Auto-continuation loop with 5-turn budget intercepts permission requests #65257
Strict-agentic didn't auto-activate Required manual config Auto-activates for GPT-5 openai/openai-codex runs #65219
Parity claims were anecdotal No test harness 10-scenario pack + CI gate + mock auth staging #65224

Scorecard

Objective Before program After #65219 + #65224 After #65257
Follow instructions 3/10 5/10 9/10
Use tools proactively 3/10 5/10 9/10
Multi-step execution 2/10 4/10 10/10
No over-confirmation 2/10 4/10 9/10
Overall 2.5/10 4.5/10 9.25/10

The remaining 0.75 to perfect 10 is GPT-5.4's own RLHF training behavior, which can't be fixed via system prompt or runtime changes. The auto-continuation loop is the ceiling of what OpenClaw can do at the runtime layer.


Merge order

All three PRs are independent — merge in any order. For best reviewer experience:

  1. agents: GPT-5.4 runtime completion rollup  #65219 first (runtime activation — the prerequisite for everything else working)
  2. agents: GPT-5.4 parity proof rollup  #65224 second (test proof — validates the runtime changes)
  3. agents: strengthen GPT-5.4 execution bias and close the one-action-then-narrative loophole #65257 last (behavioral improvements — builds on the activated contract)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions