GPT 5.4 Enhancement v3: Hermes Parity Sprint — Prompt-Level Tool Enforcement & Execution Discipline

## Context

User chatter on X surfaced enough attention to investigate parity differences between **Hermes Agent** (open-source, by Nous Research) and **OpenClaw** for GPT 5.4 agentic performance. Users reported that GPT 5.4 agents on Hermes "just work" — tools get called, tasks get completed, the agent has personality — while OpenClaw GPT 5.4 agents stall in plan mode, answer from training data instead of calling tools, and feel flat.

A full deep-dive into both codebases was conducted. The analysis compared every layer: harness architecture, system prompts, tool registration, model-specific adapters, planning infrastructure, error recovery, and prompt engineering.

## What OpenClaw Already Has (Corrected Assessment)

The initial hypothesis that OpenClaw lacks GPT-5 infrastructure is **wrong**. OpenClaw has substantial GPT-5.4 support already:

| Feature | Status | File |
|---------|--------|------|
| Developer role for system prompt | ✅ | `src/agents/openai-transport-stream.ts` |
| Strict-agentic execution contract (auto-enabled) | ✅ | `src/agents/execution-contract.ts` |
| Planning-only detection + retry (up to 2 retries) | ✅ | `src/agents/pi-embedded-runner/run/incomplete-turn.ts` |
| Ack-execution fast path ("ok do it" → skip recap) | ✅ | Same file |
| GPT-5 execution bias prompt | ✅ | `extensions/openai/prompt-overlay.ts` |
| GPT-5 output contract | ✅ | Same file |
| GPT-5 tool call style guidance | ✅ | Same file |
| Friendly personality overlay | ✅ | Same file |
| Blocked-exit after repeated plan-only turns | ✅ | `STRICT_AGENTIC_BLOCKED_TEXT` |

**The gaps are more surgical than architectural.**

## Parity Scorecard

```
┌─────┬────────────────────────────────┬─────────┬──────────┬──────────┬──────────┐
│  #  │ Dimension                      │ Hermes  │ OpenClaw │   Gap    │  Impact  │
├─────┼────────────────────────────────┼─────────┼──────────┼──────────┼──────────┤
│  1  │ Mandatory tool-use categories  │  10/10  │   0/10   │ CRITICAL │ ■■■■■■■■ │
│  2  │ Act-don't-ask guidance         │  10/10  │   5/10   │ HIGH     │ ■■■■■■   │
│  3  │ Tool retry/persistence         │  10/10  │   4/10   │ HIGH     │ ■■■■■■   │
│  4  │ Tool-use enforcement           │  10/10  │   6/10   │ MEDIUM   │ ■■■■     │
│  5  │ Context file injection scan    │  10/10  │   0/10   │ MEDIUM   │ ■■■■     │
│  6  │ Verification checklist depth   │  10/10  │   7/10   │ LOW      │ ■■       │
│  7  │ Developer role                 │  10/10  │  10/10   │ PARITY   │          │
│  8  │ Planning-only retry guard      │  10/10  │  10/10   │ PARITY   │          │
│  9  │ Default personality            │  10/10  │  10/10   │ PARITY+  │          │
│ 10  │ Error recovery / failover      │  10/10  │  10/10   │ PARITY   │          │
│ 11  │ Context compression            │  10/10  │  10/10   │ PARITY+  │          │
│ 12  │ Reasoning/thinking support     │  10/10  │  10/10   │ PARITY   │          │
└─────┴────────────────────────────────┴─────────┴──────────┴──────────┴──────────┘

Weighted Overall: 7.2/10
```

## Root Cause Analysis

```
                    ┌──────────────────────────────────────┐
                    │   GPT 5.4 Agent Turn Flow            │
                    └──────────────┬───────────────────────┘
                                   │
                    ┌──────────────▼───────────────────────┐
                    │  User asks: "What time is it?"       │
                    └──────────────┬───────────────────────┘
                                   │
                 ┌─────────────────┴─────────────────┐
                 │                                   │
        ┌────────▼────────┐                ┌─────────▼────────┐
        │    Hermes        │                │    OpenClaw       │
        │                  │                │                   │
        │ Prompt says:     │                │ Prompt says:      │
        │ "NEVER answer    │                │ "Use a real tool  │
        │  from memory —   │                │  call first when  │
        │  ALWAYS use a    │                │  actionable"      │
        │  tool"           │                │                   │
        │ + explicit list  │                │ (no mandatory     │
        │   of categories  │                │  tool categories) │
        └────────┬─────────┘                └─────────┬────────┘
                 │                                    │
        ┌────────▼─────────┐                ┌─────────▼────────┐
        │ Calls `date`     │                │ Answers from      │
        │ tool → returns   │                │ training data →   │
        │ live timestamp   │                │ stale/wrong       │
        │ ✅ CORRECT       │                │ ❌ HALLUCINATED   │
        └──────────────────┘                └──────────────────┘
```

### Gap #1: No Mandatory Tool-Use Categories (THE #1 issue)

**Hermes** (`agent/prompt_builder.py:207-218`):
```
NEVER answer these from memory — ALWAYS use a tool:
- Arithmetic, math, calculations → use terminal
- Hashes, encodings, checksums → use terminal
- Current time, date, timezone → use terminal
- File contents, sizes, line counts → use read_file
- Git history, branches, diffs → use terminal
- Current facts (weather, news, versions) → use web_search
```

**OpenClaw**: No equivalent. GPT 5.4 answers factual/computational questions from training data.

### Gap #2: Weak Act-Don't-Ask

**Hermes** provides concrete examples:
```
- 'Is port 443 open?' → check THIS machine (don't ask 'open where?')
- 'What OS am I running?' → check the live system
- 'What time is it?' → run `date` (don't guess)
```

**OpenClaw** only says: "Do prerequisite lookup or discovery before dependent actions." — too vague for GPT-5.

### Gap #3: No Tool Retry Guidance

**Hermes**: "If a tool returns empty or partial results, retry with a different query or strategy before giving up."

**OpenClaw**: No retry-on-failure directive. GPT 5.4 surrenders on first tool failure.

## v3 Sprint PRs

```
┌─────────────────────────────────────────────────────────────────┐
│                   v3 Sprint Merge Order                         │
│                                                                 │
│  PR 1 ──► PR 2 ──► PR 3 ──► PR 4 ──► PR 5 ──► PR 6           │
│  ┌────┐   ┌────┐   ┌────┐   ┌────┐   ┌────┐   ┌────┐         │
│  │ P0 │   │ P0 │   │ P1 │   │ P2 │   │ P2 │   │ P2 │         │
│  │CRIT│   │HIGH│   │ MED│   │ SEC│   │ LOW│   │ LOW│         │
│  └────┘   └────┘   └────┘   └────┘   └────┘   └────┘         │
│  ~35%     ~25%      arch     security  polish   gemini         │
│  gap      gap       cleanup  harden            support         │
│  close    close                                                 │
│                                                                 │
│  ◄─── PRs 1+2 close ~60% of remaining gap ───►                │
└─────────────────────────────────────────────────────────────────┘
```

| PR | Title | Priority | Impact |
|----|-------|----------|--------|
| 1 | Add mandatory tool-use categories for GPT-5 | P0 | ~35% gap close |
| 2 | Strengthen execution bias: act-don't-ask + tool retry | P0 | ~25% gap close |
| 3 | Add `tool_enforcement` as first-class prompt section | P1 | Architecture |
| 4 | Context file prompt injection scanning | P2 | Security |
| 5 | Enhanced verification checklist | P2 | Quality |
| 6 | Google Gemini execution guidance | P2 | Gemini support |

## Verification Plan

After merging PRs 1-2, test GPT 5.4 with these prompts:
- [x] "What time is it?" → should call terminal, not answer from training data
- [x] "Is port 8080 open?" → should check, not ask "where?"
- [x] "What's 2^64?" → should use tool, not estimate
- [x] Multi-step: "Read package.json and update the version" → should not stall

Target: plan-only turn rate drops below 5%.

## References

- Hermes Agent: https://github.com/100yenadmin/hermes-agent
- OpenClaw: https://github.com/100yenadmin/openclaw-1
- Key Hermes file: `agent/prompt_builder.py` (lines 196-276)
- Key OpenClaw file: `extensions/openai/prompt-overlay.ts`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPT 5.4 Enhancement v3: Hermes Parity Sprint — Prompt-Level Tool Enforcement & Execution Discipline #66345

Context

What OpenClaw Already Has (Corrected Assessment)

Parity Scorecard

Root Cause Analysis

Gap #1: No Mandatory Tool-Use Categories (THE #1 issue)

Gap #2: Weak Act-Don't-Ask

Gap #3: No Tool Retry Guidance

v3 Sprint PRs

Verification Plan

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature	Status	File
Developer role for system prompt	✅	`src/agents/openai-transport-stream.ts`
Strict-agentic execution contract (auto-enabled)	✅	`src/agents/execution-contract.ts`
Planning-only detection + retry (up to 2 retries)	✅	`src/agents/pi-embedded-runner/run/incomplete-turn.ts`
Ack-execution fast path ("ok do it" → skip recap)	✅	Same file
GPT-5 execution bias prompt	✅	`extensions/openai/prompt-overlay.ts`
GPT-5 output contract	✅	Same file
GPT-5 tool call style guidance	✅	Same file
Friendly personality overlay	✅	Same file
Blocked-exit after repeated plan-only turns	✅	`STRICT_AGENTIC_BLOCKED_TEXT`

PR	Title	Priority	Impact
1	Add mandatory tool-use categories for GPT-5	P0	~35% gap close
2	Strengthen execution bias: act-don't-ask + tool retry	P0	~25% gap close
3	Add `tool_enforcement` as first-class prompt section	P1	Architecture
4	Context file prompt injection scanning	P2	Security
5	Enhanced verification checklist	P2	Quality
6	Google Gemini execution guidance	P2	Gemini support

Uh oh!

GPT 5.4 Enhancement v3: Hermes Parity Sprint — Prompt-Level Tool Enforcement & Execution Discipline #66345

Description

Context

What OpenClaw Already Has (Corrected Assessment)

Parity Scorecard

Root Cause Analysis

Gap #1: No Mandatory Tool-Use Categories (THE #1 issue)

Gap #2: Weak Act-Don't-Ask

Gap #3: No Tool Retry Guidance

v3 Sprint PRs

Verification Plan

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions