Skip to content

GPT 5.4 Enhancement v3: Hermes Parity Sprint — Prompt-Level Tool Enforcement & Execution Discipline #66345

@100yenadmin

Description

@100yenadmin

Context

User chatter on X surfaced enough attention to investigate parity differences between Hermes Agent (open-source, by Nous Research) and OpenClaw for GPT 5.4 agentic performance. Users reported that GPT 5.4 agents on Hermes "just work" — tools get called, tasks get completed, the agent has personality — while OpenClaw GPT 5.4 agents stall in plan mode, answer from training data instead of calling tools, and feel flat.

A full deep-dive into both codebases was conducted. The analysis compared every layer: harness architecture, system prompts, tool registration, model-specific adapters, planning infrastructure, error recovery, and prompt engineering.

What OpenClaw Already Has (Corrected Assessment)

The initial hypothesis that OpenClaw lacks GPT-5 infrastructure is wrong. OpenClaw has substantial GPT-5.4 support already:

Feature Status File
Developer role for system prompt src/agents/openai-transport-stream.ts
Strict-agentic execution contract (auto-enabled) src/agents/execution-contract.ts
Planning-only detection + retry (up to 2 retries) src/agents/pi-embedded-runner/run/incomplete-turn.ts
Ack-execution fast path ("ok do it" → skip recap) Same file
GPT-5 execution bias prompt extensions/openai/prompt-overlay.ts
GPT-5 output contract Same file
GPT-5 tool call style guidance Same file
Friendly personality overlay Same file
Blocked-exit after repeated plan-only turns STRICT_AGENTIC_BLOCKED_TEXT

The gaps are more surgical than architectural.

Parity Scorecard

┌─────┬────────────────────────────────┬─────────┬──────────┬──────────┬──────────┐
│  #  │ Dimension                      │ Hermes  │ OpenClaw │   Gap    │  Impact  │
├─────┼────────────────────────────────┼─────────┼──────────┼──────────┼──────────┤
│  1  │ Mandatory tool-use categories  │  10/10  │   0/10   │ CRITICAL │ ■■■■■■■■ │
│  2  │ Act-don't-ask guidance         │  10/10  │   5/10   │ HIGH     │ ■■■■■■   │
│  3  │ Tool retry/persistence         │  10/10  │   4/10   │ HIGH     │ ■■■■■■   │
│  4  │ Tool-use enforcement           │  10/10  │   6/10   │ MEDIUM   │ ■■■■     │
│  5  │ Context file injection scan    │  10/10  │   0/10   │ MEDIUM   │ ■■■■     │
│  6  │ Verification checklist depth   │  10/10  │   7/10   │ LOW      │ ■■       │
│  7  │ Developer role                 │  10/10  │  10/10   │ PARITY   │          │
│  8  │ Planning-only retry guard      │  10/10  │  10/10   │ PARITY   │          │
│  9  │ Default personality            │  10/10  │  10/10   │ PARITY+  │          │
│ 10  │ Error recovery / failover      │  10/10  │  10/10   │ PARITY   │          │
│ 11  │ Context compression            │  10/10  │  10/10   │ PARITY+  │          │
│ 12  │ Reasoning/thinking support     │  10/10  │  10/10   │ PARITY   │          │
└─────┴────────────────────────────────┴─────────┴──────────┴──────────┴──────────┘

Weighted Overall: 7.2/10

Root Cause Analysis

                    ┌──────────────────────────────────────┐
                    │   GPT 5.4 Agent Turn Flow            │
                    └──────────────┬───────────────────────┘
                                   │
                    ┌──────────────▼───────────────────────┐
                    │  User asks: "What time is it?"       │
                    └──────────────┬───────────────────────┘
                                   │
                 ┌─────────────────┴─────────────────┐
                 │                                   │
        ┌────────▼────────┐                ┌─────────▼────────┐
        │    Hermes        │                │    OpenClaw       │
        │                  │                │                   │
        │ Prompt says:     │                │ Prompt says:      │
        │ "NEVER answer    │                │ "Use a real tool  │
        │  from memory —   │                │  call first when  │
        │  ALWAYS use a    │                │  actionable"      │
        │  tool"           │                │                   │
        │ + explicit list  │                │ (no mandatory     │
        │   of categories  │                │  tool categories) │
        └────────┬─────────┘                └─────────┬────────┘
                 │                                    │
        ┌────────▼─────────┐                ┌─────────▼────────┐
        │ Calls `date`     │                │ Answers from      │
        │ tool → returns   │                │ training data →   │
        │ live timestamp   │                │ stale/wrong       │
        │ ✅ CORRECT       │                │ ❌ HALLUCINATED   │
        └──────────────────┘                └──────────────────┘

Gap #1: No Mandatory Tool-Use Categories (THE #1 issue)

Hermes (agent/prompt_builder.py:207-218):

NEVER answer these from memory — ALWAYS use a tool:
- Arithmetic, math, calculations → use terminal
- Hashes, encodings, checksums → use terminal
- Current time, date, timezone → use terminal
- File contents, sizes, line counts → use read_file
- Git history, branches, diffs → use terminal
- Current facts (weather, news, versions) → use web_search

OpenClaw: No equivalent. GPT 5.4 answers factual/computational questions from training data.

Gap #2: Weak Act-Don't-Ask

Hermes provides concrete examples:

- 'Is port 443 open?' → check THIS machine (don't ask 'open where?')
- 'What OS am I running?' → check the live system
- 'What time is it?' → run `date` (don't guess)

OpenClaw only says: "Do prerequisite lookup or discovery before dependent actions." — too vague for GPT-5.

Gap #3: No Tool Retry Guidance

Hermes: "If a tool returns empty or partial results, retry with a different query or strategy before giving up."

OpenClaw: No retry-on-failure directive. GPT 5.4 surrenders on first tool failure.

v3 Sprint PRs

┌─────────────────────────────────────────────────────────────────┐
│                   v3 Sprint Merge Order                         │
│                                                                 │
│  PR 1 ──► PR 2 ──► PR 3 ──► PR 4 ──► PR 5 ──► PR 6           │
│  ┌────┐   ┌────┐   ┌────┐   ┌────┐   ┌────┐   ┌────┐         │
│  │ P0 │   │ P0 │   │ P1 │   │ P2 │   │ P2 │   │ P2 │         │
│  │CRIT│   │HIGH│   │ MED│   │ SEC│   │ LOW│   │ LOW│         │
│  └────┘   └────┘   └────┘   └────┘   └────┘   └────┘         │
│  ~35%     ~25%      arch     security  polish   gemini         │
│  gap      gap       cleanup  harden            support         │
│  close    close                                                 │
│                                                                 │
│  ◄─── PRs 1+2 close ~60% of remaining gap ───►                │
└─────────────────────────────────────────────────────────────────┘
PR Title Priority Impact
1 Add mandatory tool-use categories for GPT-5 P0 ~35% gap close
2 Strengthen execution bias: act-don't-ask + tool retry P0 ~25% gap close
3 Add tool_enforcement as first-class prompt section P1 Architecture
4 Context file prompt injection scanning P2 Security
5 Enhanced verification checklist P2 Quality
6 Google Gemini execution guidance P2 Gemini support

Verification Plan

After merging PRs 1-2, test GPT 5.4 with these prompts:

  • "What time is it?" → should call terminal, not answer from training data
  • "Is port 8080 open?" → should check, not ask "where?"
  • "What's 2^64?" → should use tool, not estimate
  • Multi-step: "Read package.json and update the version" → should not stall

Target: plan-only turn rate drops below 5%.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions