Research: default Hermes system prompt overwhelms tool-emission on mid-tier coding models

Long-form research issue, **not a fix-now item**. Stack itself is fully wired post-#77/#79/#81/#83/#85/#87 — this is a model-quality boundary discovered while debugging the residual tool-paralysis symptom.

## What we observed
With the full post-cascade stack deployed, certain models (`coding-groq`, `coding-gpt54`) still emit empty assistant responses when called via hermes with a tool attached. Symptom: model produces no `tool_call` AND no text — pure empty completion.

## What we ruled out
- **Not preamble bloat (G1):** reproduces with NO `HERMES_HOME` profile → no vertical-preamble injection → no grafted-context index → empty preamble path.
- **Not tool-count saturation:** reproduces with **a single tool attached** (well under the 52-tool ceiling we'd previously hypothesized).
- **Not MCP wiring:** reproduces with just a built-in tool — autowire / mcp_serve / G2-G4 paths aren't in the loop for this repro.
- **Not auth / transport:** the same request succeeds when issued via raw API curl.

## The actual variable
**The system prompt.** Replacing hermes' default system prompt (SOUL.md + identity block + capabilities block + behavior guidance + tool surface description, ~several thousand tokens at boot) with a minimal `"You are Hermes Agent"` — same model, same tool, same provider, same request shape otherwise — produces correct `tool_call` emission.

So the default hermes system prompt is overwhelming the tool-emission attention path on these models. The model has the *capacity* (raw API works), it just can't route through hermes' prompt structure to the tool-call output.

## Why this isn't a fix-now
1. The fusion stack is wired + verified. Workers can spawn, plugins load, MCP discovery works, HERMES_TOOLS_SUBSET narrows (built-in + MCP) — devagentic#203 G1–G4 + autowire are landed.
2. The boundary is model-quality, not engineering: stronger models (Claude/GPT-5-class) don't hit this ceiling with the same prompt.
3. Cutting the prompt is the wrong default — SOUL.md / identity / capabilities serve real purposes for the operator-facing UX.

## Possible directions (research-grade, not committed)

### Direction A — #210 R1/R2 flow-router with per-intent narrowing
The premise of #210 is dynamic per-turn classification of intent → narrowed tool surface + targeted prompt fragment. The same mechanism could narrow the **system-prompt surface** per intent: "this turn is a tool-call invocation" → strip identity/SOUL/behavior guidance for the model call, keep only tool-call-relevant context. Bridges nicely with the existing HERMES_TOOLS_SUBSET hook point (#75/#86 — same place in agent_init.py).

### Direction B — different upstream model
Confirmed: this is a model-quality boundary. Coding-groq + coding-gpt54 hit it. Larger-context / instruction-tuned models (Sonnet / GPT-5-class) handle the full prompt without paralysis. Operator-facing: a tier-classification matrix indicating which models can sustain the full hermes prompt + tools vs which need narrowing.

### Direction C — prompt-section ablation study
Before committing to A or B, run a structured ablation: what's the smallest subset of hermes' default system prompt that the affected models can sustain with tools? Cross-product (prompt-fragment × model × tool-count) might surface a Pareto frontier — e.g., "strip behavior guidance, keep identity + capabilities" might be enough for tier-2 models.

## Reproduction (for whoever picks this up)
```
# Fails (empty response, no tool_call)
HERMES_HOME=/tmp/empty-profile hermes --provider groq --model coding-groq \
  --enable-toolset core --enable-toolset kanban  # narrow surface, default prompt

# Same provider/model/tool, raw API → works
curl https://api.groq.com/openai/v1/chat/completions \
  -H "Authorization: Bearer $GROQ_API_KEY" \
  -d '{"model":"coding-groq","messages":[
    {"role":"system","content":"You are Hermes Agent"},
    {"role":"user","content":"..."}],"tools":[<single tool>]}'
```

## Severity
**Research / backlog.** Not blocking poly-explorer end-to-end with appropriate model choice. Worth opening for:
- Future #210 R1/R2 design — prompt-narrowing should be in scope alongside tool-narrowing
- Operator documentation — "models known to hit the prompt-saturation ceiling"
- A future ablation if/when a maintainer has bandwidth

## Closes the cascade story
This is the **boundary** at the end of the post-#67 cascade:
- #77 (yaml packaging) → #79 (register-ctx) → #81 (mcp_serve packaging) → #83 (autowire TUI path) → #85 (autowire CLI path) → #87 (HERMES_TOOLS_SUBSET extended to MCP) — all merged + deployed.
- Stack is fully wired. What remains is the model-quality boundary documented here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research: default Hermes system prompt overwhelms tool-emission on mid-tier coding models #89

What we observed

What we ruled out

The actual variable

Why this isn't a fix-now

Possible directions (research-grade, not committed)

Direction A — NousResearch#210 R1/R2 flow-router with per-intent narrowing

Direction B — different upstream model

Direction C — prompt-section ablation study

Reproduction (for whoever picks this up)

Severity

Closes the cascade story

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Research: default Hermes system prompt overwhelms tool-emission on mid-tier coding models #89

Description

What we observed

What we ruled out

The actual variable

Why this isn't a fix-now

Possible directions (research-grade, not committed)

Direction A — NousResearch#210 R1/R2 flow-router with per-intent narrowing

Direction B — different upstream model

Direction C — prompt-section ablation study

Reproduction (for whoever picks this up)

Severity

Closes the cascade story

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions