Summary
In a long-running gateway process (hermes gateway run for ~7+ hours), webhook routes configured with skills: start producing a 42-character stub ([Failed to load skill: <name>]) that silently overwrites the user's webhook payload. Reviewer agents receive the stub instead of the real prompt, lose all context, and either give up or fabricate responses from memory. The gateway logs no warning. Restarting the gateway cures the issue immediately.
Reproduction
- Configure a webhook route in
~/.hermes/profiles/<profile>/config.yaml with skills: [<any-skill>] and prompt: '{message}'
- Start the gateway:
hermes gateway run --profile <profile> (or via systemd)
- Run for 24+ hours (or possibly less — ~7h empirical onset)
- POST a webhook → observe inbound message becomes
[Failed to load skill: <name>] instead of the user's payload
- Restart gateway → next POST succeeds, real payload arrives, skill loads normally
Source-code path (Hermes-agent v0.11.0)
gateway/platforms/webhook.py:401-405 — calls build_skill_invocation_message(cmd_key, user_instruction=prompt) synchronously inside aiohttp request handler
agent/skill_commands.py:425-427 — when _load_skill_payload returns None, returns truthy stub string '[Failed to load skill: <name>]' (exactly 42 chars for skill name multi-model-review, mathematically verified)
webhook.py:404-405 — if skill_content: accepts stub as success → prompt = skill_content overwrites user's payload
agent/skill_commands.py:_load_skill_payload :147-148 — bare except Exception: return None swallows all exceptions, NO log line
agent/skill_commands.py:_load_skill_payload :150-151 — if not loaded_skill.get("success"): return None — NO log line
The outer webhook.py:412 except Exception as e: logger.warning("[webhook] Skill loading failed: %s", e) would fire if build_skill_invocation_message raised — it never does (verified via grep across all gateway log files: zero Skill loading failed warnings).
Empirical evidence
Lockstep failure across 3 independent reviewer profiles
| Timestamp UTC | Reviewer1 | Reviewer2 | Reviewer3 |
|--------------------------------|------------|------------|------------|
| 2026-04-28 09:35:36.713 | len=42 | len=42 | len=42 |
| 2026-04-28 10:23:10.451 | len=42 | len=42 | len=42 |
| 2026-04-28 11:36:30.558 | len=42 | len=42 | len=42 |
| 2026-04-29 01:04:34.315 | len=42 | len=42 | len=42 |
| 2026-04-29 03:26:49.629 | len=42 | len=42 | len=42 |
| 2026-04-29 04:13:37.353 | len=42 | len=42 | len=42 |
Identical millisecond timestamps + 6/6 identical events across 3 separate Python processes — kills hypotheses involving per-process state divergence (cache races, SQLite contention, profile session resume).
Restart-recovery correlation
- 2026-04-28 01:55:17 — reviewer1 last successful delivery (
prompt_len=21493), uptime ~25h since prior restart
- 2026-04-28 09:35:36 — reviewer1 first failure (~7.5h additional uptime later)
- All subsequent webhooks fail until...
- 2026-04-29 04:25:18 — reviewer1 SIGTERM/restart (manual, via
systemctl --user restart hermes-gateway-reviewer1.service)
- 2026-04-29 04:25:57 — reviewer1 succeeds (
prompt_len=23899), 39 seconds post-restart
- 2026-04-29 04:30:06 — reviewer1 succeeds again post-second-restart
Process state at investigation
| Profile |
Uptime |
RSS |
FD count |
| reviewer2 |
2d 1h 26m |
315 MB |
< 32 |
| reviewer3 |
2d 1h 26m |
302 MB |
< 32 |
| reviewer1 |
4 min (just restarted) |
84 MB |
< 32 |
| main gateway |
2d 1h 26m |
441 MB |
< 32 |
No OOM, no FD exhaustion. Memory growth ~150 → 441 MB on main gateway across 2 days (within reasonable bounds).
Direct CLI test proves filesystem + skill content are NOT at fault
cd ~/.hermes/hermes-agent && python3 -c "
from tools.skills_tool import skill_view
import json
result = json.loads(skill_view('multi-model-review'))
print(result.get('success')) # → True
print(len(result.get('content'))) # → 17172
"
Works perfectly from a fresh Python process while the long-running gateway returns the stub. Confirms the regression is runtime-deferred state inside the long-running aiohttp request handler context, not file-level or skill-content-level.
Self-recovery within session
Reviewer log 2026-04-29 01:04:48 (after the webhook handler injected the failure stub):
"The skill loaded successfully. It appears the initial 'Failed to load skill' message was a transient error. The multi-model-review skill is now available..."
The model recovers via in-conversation skill_view tool call AFTER the webhook handler failed to inject content. Two calls in the same Python process produce divergent results — confirms deferred-state corruption, not file-level.
Suspected root cause (not verified)
Module-level cache in agent/skill_commands.py (or one of its callees in the dispatch chain) accumulates state across long-running aiohttp request handler invocations and eventually breaks _load_skill_payload. The bare except Exception: return None at lines 147-148 swallows the underlying exception silently.
Suggested upstream fix (minimum)
Add diagnostic logging to surface the silent failure even before root cause is identified:
# agent/skill_commands.py:_load_skill_payload
try:
loaded_skill = ...
except Exception as e:
logger.warning("_load_skill_payload failed: %s", e, exc_info=True)
return None
if not loaded_skill.get("success"):
logger.warning("_load_skill_payload skill_view returned success=False: %r", loaded_skill)
return None
And in the caller (build_skill_invocation_message):
# agent/skill_commands.py:425-427
if skill_content is None:
return None # Don't return truthy stub that overwrites user_instruction
With the caller returning None, webhook.py:404 if skill_content: would correctly fall through to user's prompt. Truthy-stub-as-success is the silent failure mode that masks the underlying bug.
Project-side mitigation already deployed
- Track B (S131 commit
7e39ced): strict CUID validation at API boundary (/api/agent/multi-model-review) — rejects fabricated IDs from reviewers that hallucinate from holographic memory after losing webhook context. Prevents orphan AgentLearning rows.
- Track A (S131):
scripts/refire-panel-review.ts for manual recovery of stuck rows.
- Q1 stall monitor (S131 commit pending):
/api/panel-review-health endpoint + Hermes cron */15min — alerts via Telegram when multi_model_pending rows exceed 30 min.
These mitigate symptoms but not the root cause. Upstream fix is the only durable path.
Environment
- Hermes version: 0.11.0
- Python: 3.11.14
- OS: Linux (long-running production server, systemd-managed)
- Project: motherfish-ai-bot (XAUUSD trading agent, 3-reviewer panel pattern)
- Reviewer profiles: 3 isolated systemd services (
hermes-gateway-reviewer{1,2,3}.service)
- Models per profile: kimi-k2.5/bailian (R1), glm-5/bailian (R2), glm-5.1/z.ai (R3)
Cross-reference
- Project's S131 session memory (full investigation):
.claude/agent-memory/opus-4-7/sessions/131-track-d-skill-load-failure.md
- Project's Hermes troubleshooting doc:
hermes-doc.md § Skill Auto-Load Failure: prompt_len=42 (S131)
- Track B implementation:
src/app/api/agent/multi-model-review/route.ts:240-303 (commit 7e39ced)
- Track A recovery script:
scripts/refire-panel-review.ts
Summary
In a long-running gateway process (
hermes gateway runfor ~7+ hours), webhook routes configured withskills:start producing a 42-character stub ([Failed to load skill: <name>]) that silently overwrites the user's webhook payload. Reviewer agents receive the stub instead of the real prompt, lose all context, and either give up or fabricate responses from memory. The gateway logs no warning. Restarting the gateway cures the issue immediately.Reproduction
~/.hermes/profiles/<profile>/config.yamlwithskills: [<any-skill>]andprompt: '{message}'hermes gateway run --profile <profile>(or via systemd)[Failed to load skill: <name>]instead of the user's payloadSource-code path (Hermes-agent v0.11.0)
gateway/platforms/webhook.py:401-405— callsbuild_skill_invocation_message(cmd_key, user_instruction=prompt)synchronously inside aiohttp request handleragent/skill_commands.py:425-427— when_load_skill_payloadreturnsNone, returns truthy stub string'[Failed to load skill: <name>]'(exactly 42 chars for skill namemulti-model-review, mathematically verified)webhook.py:404-405—if skill_content:accepts stub as success →prompt = skill_contentoverwrites user's payloadagent/skill_commands.py:_load_skill_payload :147-148— bareexcept Exception: return Noneswallows all exceptions, NO log lineagent/skill_commands.py:_load_skill_payload :150-151—if not loaded_skill.get("success"): return None— NO log lineThe outer
webhook.py:412 except Exception as e: logger.warning("[webhook] Skill loading failed: %s", e)would fire ifbuild_skill_invocation_messageraised — it never does (verified via grep across all gateway log files: zeroSkill loading failedwarnings).Empirical evidence
Lockstep failure across 3 independent reviewer profiles
Identical millisecond timestamps + 6/6 identical events across 3 separate Python processes — kills hypotheses involving per-process state divergence (cache races, SQLite contention, profile session resume).
Restart-recovery correlation
prompt_len=21493), uptime ~25h since prior restartsystemctl --user restart hermes-gateway-reviewer1.service)prompt_len=23899), 39 seconds post-restartProcess state at investigation
No OOM, no FD exhaustion. Memory growth ~150 → 441 MB on main gateway across 2 days (within reasonable bounds).
Direct CLI test proves filesystem + skill content are NOT at fault
Works perfectly from a fresh Python process while the long-running gateway returns the stub. Confirms the regression is runtime-deferred state inside the long-running aiohttp request handler context, not file-level or skill-content-level.
Self-recovery within session
Reviewer log 2026-04-29 01:04:48 (after the webhook handler injected the failure stub):
The model recovers via in-conversation
skill_viewtool call AFTER the webhook handler failed to inject content. Two calls in the same Python process produce divergent results — confirms deferred-state corruption, not file-level.Suspected root cause (not verified)
Module-level cache in
agent/skill_commands.py(or one of its callees in the dispatch chain) accumulates state across long-running aiohttp request handler invocations and eventually breaks_load_skill_payload. The bareexcept Exception: return Noneat lines 147-148 swallows the underlying exception silently.Suggested upstream fix (minimum)
Add diagnostic logging to surface the silent failure even before root cause is identified:
And in the caller (
build_skill_invocation_message):With the caller returning
None,webhook.py:404 if skill_content:would correctly fall through to user's prompt. Truthy-stub-as-success is the silent failure mode that masks the underlying bug.Project-side mitigation already deployed
7e39ced): strict CUID validation at API boundary (/api/agent/multi-model-review) — rejects fabricated IDs from reviewers that hallucinate from holographic memory after losing webhook context. Prevents orphan AgentLearning rows.scripts/refire-panel-review.tsfor manual recovery of stuck rows./api/panel-review-healthendpoint + Hermes cron */15min — alerts via Telegram whenmulti_model_pendingrows exceed 30 min.These mitigate symptoms but not the root cause. Upstream fix is the only durable path.
Environment
hermes-gateway-reviewer{1,2,3}.service)Cross-reference
.claude/agent-memory/opus-4-7/sessions/131-track-d-skill-load-failure.mdhermes-doc.md § Skill Auto-Load Failure: prompt_len=42 (S131)src/app/api/agent/multi-model-review/route.ts:240-303(commit7e39ced)scripts/refire-panel-review.ts