You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Make GPT-5.4 agents work as well as Claude agents in OpenClaw: follow instructions, use tools proactively, execute multi-step tasks without stopping to ask, and stop over-confirming.
How the three open PRs work together
#65219 (runtime rollup) → strict-agentic auto-activates for GPT-5, widened model ID matching
↓
#65224 (parity proof rollup) → 10-scenario test harness, mock auth staging, CI workflow
↓
#65257 (behavioral fix) → imperative prompt, one-action loophole fix, auto-continuation loop
#65219 makes the strict-agentic contract fire by default for GPT-5 without users needing to configure anything. It's the foundation — without it, none of the other fixes activate.
#65224 is the test harness that proves parity. It runs GPT-5.4 and Opus 4.6 through the same 10 scenarios, compares the results, and produces a pass/fail verdict. It includes the CI workflow (.github/workflows/parity-gate.yml) that runs the gate automatically on every PR touching the relevant surface.
#65257 is the behavioral fix that addresses the actual user-facing complaints. It rewrites the GPT-5.4 system prompt to be as directive as Claude's, closes the "one tool call then ask permission" detection loophole, and adds an auto-continuation loop that keeps the model going for up to 5 turns without returning to the user.
All three are independent and can merge in any order. Together they cover the full stack: runtime activation → test proof → behavioral improvement.
PRs E, H, J, K, L, M, N, and F were individual wave-2 PRs. They were consolidated into #65219 and #65224 to reduce review burden. PR F was closed as superseded (its 3 fixes landed upstream independently).
What: Auto-activates the strict-agentic execution contract for unconfigured GPT-5 openai / openai-codex runs. Widens model ID matching to handle gpt-5o, gpt-5-preview, prefixed openai/gpt-5.4. Emits explicit "blocked" liveness state at the strict-agentic exit. Explicit executionContract: "default" opt-out is honored.
Key file:src/agents/execution-contract.ts — resolveEffectiveExecutionContract() and isStrictAgenticSupportedProviderModel().
Why it matters: Without this, GPT-5.4 users need to manually set agents.defaults.embeddedPi.executionContract: "strict-agentic" to get any of the behavioral improvements. With this, it just works.
What: Consolidates the test harness work — 10-scenario parity pack, tool-call assertions on 8 scenarios, Anthropic /v1/messages mock route, mock auth staging (so the gate runs without real credentials), run metadata on qa-suite-summary.json, positive-tone fake-success detector, run.primaryProvider label verification, docs + diagrams, and .github/workflows/parity-gate.yml.
Why it matters: Provides the repeatable, CI-enforced proof that GPT-5.4 matches Opus 4.6 on the agreed metrics. Before this, parity claims were anecdotal.
What: Three commits addressing the four user-reported behavioral gaps:
Imperative execution bias (extensions/openai/prompt-overlay.ts) — rewrites GPT-5.4's system prompt from passive ("start the real work") to imperative ("Use a real tool call FIRST. Commentary-only turns are incomplete. Do not stop after one step to ask permission."). Adds tool_call_style override with tool-first reinforcement AND approval safety guidance.
One-action-then-narrative loophole (src/agents/pi-embedded-runner/run/incomplete-turn.ts) — the planning-only detector was exempting turns with any non-plan tool call. Now catches turns where exactly 1 tool call + planning prose = retry eligible. Also fixes the startedCount guard bypass flagged by Codex-connector P1 reviews.
Auto-continuation loop (src/agents/pi-embedded-runner/run.ts + config surface) — when GPT-5.4's turn ends with tool results + continuation intent ("I'll analyze next"), the runner injects "Continue. Take the next concrete action." and loops instead of returning to the user. Config: agents.defaults.embeddedPi.continuationMode ("auto" / "prompt" / "off"), budget of 5 turns (configurable). 11-condition safety guard includes budget, abort, timeout, tool errors, messaging, side effects, and completion language detection.
Why it matters: This is the PR that makes GPT-5.4 actually feel like Claude in practice. The other two PRs build the infrastructure and activate the contract; this one addresses the behaviors users complain about.
Root problems → solutions mapping
Problem
Root cause
Solution
PR
GPT-5.4 doesn't follow instructions
OpenAI prompt overlay was passive; missing "commentary-only turns are incomplete"
The remaining 0.75 to perfect 10 is GPT-5.4's own RLHF training behavior, which can't be fixed via system prompt or runtime changes. The auto-continuation loop is the ceiling of what OpenClaw can do at the runtime layer.
Merge order
All three PRs are independent — merge in any order. For best reviewer experience:
GPT-5.4 / Codex Agentic Parity
Make GPT-5.4 agents work as well as Claude agents in OpenClaw: follow instructions, use tools proactively, execute multi-step tasks without stopping to ask, and stop over-confirming.
How the three open PRs work together
#65219 makes the strict-agentic contract fire by default for GPT-5 without users needing to configure anything. It's the foundation — without it, none of the other fixes activate.
#65224 is the test harness that proves parity. It runs GPT-5.4 and Opus 4.6 through the same 10 scenarios, compares the results, and produces a pass/fail verdict. It includes the CI workflow (
.github/workflows/parity-gate.yml) that runs the gate automatically on every PR touching the relevant surface.#65257 is the behavioral fix that addresses the actual user-facing complaints. It rewrites the GPT-5.4 system prompt to be as directive as Claude's, closes the "one tool call then ask permission" detection loophole, and adds an auto-continuation loop that keeps the model going for up to 5 turns without returning to the user.
All three are independent and can merge in any order. Together they cover the full stack: runtime activation → test proof → behavioral improvement.
Previously merged
/elevated fullguidance accurateqa parity-reportgatePreviously closed (consolidated into rollups)
PRs E, H, J, K, L, M, N, and F were individual wave-2 PRs. They were consolidated into #65219 and #65224 to reduce review burden. PR F was closed as superseded (its 3 fixes landed upstream independently).
Open PRs (review these)
#65219 — GPT-5.4 runtime completion rollup
What: Auto-activates the
strict-agenticexecution contract for unconfigured GPT-5openai/openai-codexruns. Widens model ID matching to handlegpt-5o,gpt-5-preview, prefixedopenai/gpt-5.4. Emits explicit"blocked"liveness state at the strict-agentic exit. ExplicitexecutionContract: "default"opt-out is honored.Key file:
src/agents/execution-contract.ts—resolveEffectiveExecutionContract()andisStrictAgenticSupportedProviderModel().Why it matters: Without this, GPT-5.4 users need to manually set
agents.defaults.embeddedPi.executionContract: "strict-agentic"to get any of the behavioral improvements. With this, it just works.#65224 — GPT-5.4 parity proof rollup
What: Consolidates the test harness work — 10-scenario parity pack, tool-call assertions on 8 scenarios, Anthropic
/v1/messagesmock route, mock auth staging (so the gate runs without real credentials),runmetadata onqa-suite-summary.json, positive-tone fake-success detector,run.primaryProviderlabel verification, docs + diagrams, and.github/workflows/parity-gate.yml.Key files:
extensions/qa-lab/src/mock-openai-server.ts,extensions/qa-lab/src/agentic-parity-report.ts,qa/scenarios/*.md,.github/workflows/parity-gate.yml.Why it matters: Provides the repeatable, CI-enforced proof that GPT-5.4 matches Opus 4.6 on the agreed metrics. Before this, parity claims were anecdotal.
#65257 — GPT-5.4 execution bias + one-action loophole + auto-continuation
What: Three commits addressing the four user-reported behavioral gaps:
Imperative execution bias (
extensions/openai/prompt-overlay.ts) — rewrites GPT-5.4's system prompt from passive ("start the real work") to imperative ("Use a real tool call FIRST. Commentary-only turns are incomplete. Do not stop after one step to ask permission."). Addstool_call_styleoverride with tool-first reinforcement AND approval safety guidance.One-action-then-narrative loophole (
src/agents/pi-embedded-runner/run/incomplete-turn.ts) — the planning-only detector was exempting turns with any non-plan tool call. Now catches turns where exactly 1 tool call + planning prose = retry eligible. Also fixes thestartedCountguard bypass flagged by Codex-connector P1 reviews.Auto-continuation loop (
src/agents/pi-embedded-runner/run.ts+ config surface) — when GPT-5.4's turn ends with tool results + continuation intent ("I'll analyze next"), the runner injects "Continue. Take the next concrete action." and loops instead of returning to the user. Config:agents.defaults.embeddedPi.continuationMode("auto"/"prompt"/"off"), budget of 5 turns (configurable). 11-condition safety guard includes budget, abort, timeout, tool errors, messaging, side effects, and completion language detection.Why it matters: This is the PR that makes GPT-5.4 actually feel like Claude in practice. The other two PRs build the infrastructure and activate the contract; this one addresses the behaviors users complain about.
Root problems → solutions mapping
tool_call_styleoverride for GPT-5.4tool_call_stylesection: "Call tools directly without narrating"isSingleActionThenNarrativePatterncatches 1-tool-call + planning proseScorecard
The remaining 0.75 to perfect 10 is GPT-5.4's own RLHF training behavior, which can't be fixed via system prompt or runtime changes. The auto-continuation loop is the ceiling of what OpenClaw can do at the runtime layer.
Merge order
All three PRs are independent — merge in any order. For best reviewer experience: