Context
User chatter on X surfaced enough attention to investigate parity differences between Hermes Agent (open-source, by Nous Research) and OpenClaw for GPT 5.4 agentic performance. Users reported that GPT 5.4 agents on Hermes "just work" — tools get called, tasks get completed, the agent has personality — while OpenClaw GPT 5.4 agents stall in plan mode, answer from training data instead of calling tools, and feel flat.
A full deep-dive into both codebases was conducted. The analysis compared every layer: harness architecture, system prompts, tool registration, model-specific adapters, planning infrastructure, error recovery, and prompt engineering.
What OpenClaw Already Has (Corrected Assessment)
The initial hypothesis that OpenClaw lacks GPT-5 infrastructure is wrong. OpenClaw has substantial GPT-5.4 support already:
| Feature |
Status |
File |
| Developer role for system prompt |
✅ |
src/agents/openai-transport-stream.ts |
| Strict-agentic execution contract (auto-enabled) |
✅ |
src/agents/execution-contract.ts |
| Planning-only detection + retry (up to 2 retries) |
✅ |
src/agents/pi-embedded-runner/run/incomplete-turn.ts |
| Ack-execution fast path ("ok do it" → skip recap) |
✅ |
Same file |
| GPT-5 execution bias prompt |
✅ |
extensions/openai/prompt-overlay.ts |
| GPT-5 output contract |
✅ |
Same file |
| GPT-5 tool call style guidance |
✅ |
Same file |
| Friendly personality overlay |
✅ |
Same file |
| Blocked-exit after repeated plan-only turns |
✅ |
STRICT_AGENTIC_BLOCKED_TEXT |
The gaps are more surgical than architectural.
Parity Scorecard
┌─────┬────────────────────────────────┬─────────┬──────────┬──────────┬──────────┐
│ # │ Dimension │ Hermes │ OpenClaw │ Gap │ Impact │
├─────┼────────────────────────────────┼─────────┼──────────┼──────────┼──────────┤
│ 1 │ Mandatory tool-use categories │ 10/10 │ 0/10 │ CRITICAL │ ■■■■■■■■ │
│ 2 │ Act-don't-ask guidance │ 10/10 │ 5/10 │ HIGH │ ■■■■■■ │
│ 3 │ Tool retry/persistence │ 10/10 │ 4/10 │ HIGH │ ■■■■■■ │
│ 4 │ Tool-use enforcement │ 10/10 │ 6/10 │ MEDIUM │ ■■■■ │
│ 5 │ Context file injection scan │ 10/10 │ 0/10 │ MEDIUM │ ■■■■ │
│ 6 │ Verification checklist depth │ 10/10 │ 7/10 │ LOW │ ■■ │
│ 7 │ Developer role │ 10/10 │ 10/10 │ PARITY │ │
│ 8 │ Planning-only retry guard │ 10/10 │ 10/10 │ PARITY │ │
│ 9 │ Default personality │ 10/10 │ 10/10 │ PARITY+ │ │
│ 10 │ Error recovery / failover │ 10/10 │ 10/10 │ PARITY │ │
│ 11 │ Context compression │ 10/10 │ 10/10 │ PARITY+ │ │
│ 12 │ Reasoning/thinking support │ 10/10 │ 10/10 │ PARITY │ │
└─────┴────────────────────────────────┴─────────┴──────────┴──────────┴──────────┘
Weighted Overall: 7.2/10
Root Cause Analysis
┌──────────────────────────────────────┐
│ GPT 5.4 Agent Turn Flow │
└──────────────┬───────────────────────┘
│
┌──────────────▼───────────────────────┐
│ User asks: "What time is it?" │
└──────────────┬───────────────────────┘
│
┌─────────────────┴─────────────────┐
│ │
┌────────▼────────┐ ┌─────────▼────────┐
│ Hermes │ │ OpenClaw │
│ │ │ │
│ Prompt says: │ │ Prompt says: │
│ "NEVER answer │ │ "Use a real tool │
│ from memory — │ │ call first when │
│ ALWAYS use a │ │ actionable" │
│ tool" │ │ │
│ + explicit list │ │ (no mandatory │
│ of categories │ │ tool categories) │
└────────┬─────────┘ └─────────┬────────┘
│ │
┌────────▼─────────┐ ┌─────────▼────────┐
│ Calls `date` │ │ Answers from │
│ tool → returns │ │ training data → │
│ live timestamp │ │ stale/wrong │
│ ✅ CORRECT │ │ ❌ HALLUCINATED │
└──────────────────┘ └──────────────────┘
Gap #1: No Mandatory Tool-Use Categories (THE #1 issue)
Hermes (agent/prompt_builder.py:207-218):
NEVER answer these from memory — ALWAYS use a tool:
- Arithmetic, math, calculations → use terminal
- Hashes, encodings, checksums → use terminal
- Current time, date, timezone → use terminal
- File contents, sizes, line counts → use read_file
- Git history, branches, diffs → use terminal
- Current facts (weather, news, versions) → use web_search
OpenClaw: No equivalent. GPT 5.4 answers factual/computational questions from training data.
Gap #2: Weak Act-Don't-Ask
Hermes provides concrete examples:
- 'Is port 443 open?' → check THIS machine (don't ask 'open where?')
- 'What OS am I running?' → check the live system
- 'What time is it?' → run `date` (don't guess)
OpenClaw only says: "Do prerequisite lookup or discovery before dependent actions." — too vague for GPT-5.
Gap #3: No Tool Retry Guidance
Hermes: "If a tool returns empty or partial results, retry with a different query or strategy before giving up."
OpenClaw: No retry-on-failure directive. GPT 5.4 surrenders on first tool failure.
v3 Sprint PRs
┌─────────────────────────────────────────────────────────────────┐
│ v3 Sprint Merge Order │
│ │
│ PR 1 ──► PR 2 ──► PR 3 ──► PR 4 ──► PR 5 ──► PR 6 │
│ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ │
│ │ P0 │ │ P0 │ │ P1 │ │ P2 │ │ P2 │ │ P2 │ │
│ │CRIT│ │HIGH│ │ MED│ │ SEC│ │ LOW│ │ LOW│ │
│ └────┘ └────┘ └────┘ └────┘ └────┘ └────┘ │
│ ~35% ~25% arch security polish gemini │
│ gap gap cleanup harden support │
│ close close │
│ │
│ ◄─── PRs 1+2 close ~60% of remaining gap ───► │
└─────────────────────────────────────────────────────────────────┘
| PR |
Title |
Priority |
Impact |
| 1 |
Add mandatory tool-use categories for GPT-5 |
P0 |
~35% gap close |
| 2 |
Strengthen execution bias: act-don't-ask + tool retry |
P0 |
~25% gap close |
| 3 |
Add tool_enforcement as first-class prompt section |
P1 |
Architecture |
| 4 |
Context file prompt injection scanning |
P2 |
Security |
| 5 |
Enhanced verification checklist |
P2 |
Quality |
| 6 |
Google Gemini execution guidance |
P2 |
Gemini support |
Verification Plan
After merging PRs 1-2, test GPT 5.4 with these prompts:
Target: plan-only turn rate drops below 5%.
References
Context
User chatter on X surfaced enough attention to investigate parity differences between Hermes Agent (open-source, by Nous Research) and OpenClaw for GPT 5.4 agentic performance. Users reported that GPT 5.4 agents on Hermes "just work" — tools get called, tasks get completed, the agent has personality — while OpenClaw GPT 5.4 agents stall in plan mode, answer from training data instead of calling tools, and feel flat.
A full deep-dive into both codebases was conducted. The analysis compared every layer: harness architecture, system prompts, tool registration, model-specific adapters, planning infrastructure, error recovery, and prompt engineering.
What OpenClaw Already Has (Corrected Assessment)
The initial hypothesis that OpenClaw lacks GPT-5 infrastructure is wrong. OpenClaw has substantial GPT-5.4 support already:
src/agents/openai-transport-stream.tssrc/agents/execution-contract.tssrc/agents/pi-embedded-runner/run/incomplete-turn.tsextensions/openai/prompt-overlay.tsSTRICT_AGENTIC_BLOCKED_TEXTThe gaps are more surgical than architectural.
Parity Scorecard
Root Cause Analysis
Gap #1: No Mandatory Tool-Use Categories (THE #1 issue)
Hermes (
agent/prompt_builder.py:207-218):OpenClaw: No equivalent. GPT 5.4 answers factual/computational questions from training data.
Gap #2: Weak Act-Don't-Ask
Hermes provides concrete examples:
OpenClaw only says: "Do prerequisite lookup or discovery before dependent actions." — too vague for GPT-5.
Gap #3: No Tool Retry Guidance
Hermes: "If a tool returns empty or partial results, retry with a different query or strategy before giving up."
OpenClaw: No retry-on-failure directive. GPT 5.4 surrenders on first tool failure.
v3 Sprint PRs
tool_enforcementas first-class prompt sectionVerification Plan
After merging PRs 1-2, test GPT 5.4 with these prompts:
Target: plan-only turn rate drops below 5%.
References
agent/prompt_builder.py(lines 196-276)extensions/openai/prompt-overlay.ts