Skip to content

[Bug]: Webhook gateway — skill auto-loader returns 'Failed to load skill' stub instead of None, silently dropping user prompt after extended uptime #17283

@DUMPDUMPY

Description

@DUMPDUMPY

Summary

In a long-running gateway process (hermes gateway run for ~7+ hours), webhook routes configured with skills: start producing a 42-character stub ([Failed to load skill: <name>]) that silently overwrites the user's webhook payload. Reviewer agents receive the stub instead of the real prompt, lose all context, and either give up or fabricate responses from memory. The gateway logs no warning. Restarting the gateway cures the issue immediately.

Reproduction

  1. Configure a webhook route in ~/.hermes/profiles/<profile>/config.yaml with skills: [<any-skill>] and prompt: '{message}'
  2. Start the gateway: hermes gateway run --profile <profile> (or via systemd)
  3. Run for 24+ hours (or possibly less — ~7h empirical onset)
  4. POST a webhook → observe inbound message becomes [Failed to load skill: <name>] instead of the user's payload
  5. Restart gateway → next POST succeeds, real payload arrives, skill loads normally

Source-code path (Hermes-agent v0.11.0)

  1. gateway/platforms/webhook.py:401-405 — calls build_skill_invocation_message(cmd_key, user_instruction=prompt) synchronously inside aiohttp request handler
  2. agent/skill_commands.py:425-427 — when _load_skill_payload returns None, returns truthy stub string '[Failed to load skill: <name>]' (exactly 42 chars for skill name multi-model-review, mathematically verified)
  3. webhook.py:404-405if skill_content: accepts stub as success → prompt = skill_content overwrites user's payload
  4. agent/skill_commands.py:_load_skill_payload :147-148 — bare except Exception: return None swallows all exceptions, NO log line
  5. agent/skill_commands.py:_load_skill_payload :150-151if not loaded_skill.get("success"): return NoneNO log line

The outer webhook.py:412 except Exception as e: logger.warning("[webhook] Skill loading failed: %s", e) would fire if build_skill_invocation_message raised — it never does (verified via grep across all gateway log files: zero Skill loading failed warnings).

Empirical evidence

Lockstep failure across 3 independent reviewer profiles

| Timestamp UTC                  | Reviewer1  | Reviewer2  | Reviewer3  |
|--------------------------------|------------|------------|------------|
| 2026-04-28 09:35:36.713        | len=42     | len=42     | len=42     |
| 2026-04-28 10:23:10.451        | len=42     | len=42     | len=42     |
| 2026-04-28 11:36:30.558        | len=42     | len=42     | len=42     |
| 2026-04-29 01:04:34.315        | len=42     | len=42     | len=42     |
| 2026-04-29 03:26:49.629        | len=42     | len=42     | len=42     |
| 2026-04-29 04:13:37.353        | len=42     | len=42     | len=42     |

Identical millisecond timestamps + 6/6 identical events across 3 separate Python processes — kills hypotheses involving per-process state divergence (cache races, SQLite contention, profile session resume).

Restart-recovery correlation

  • 2026-04-28 01:55:17 — reviewer1 last successful delivery (prompt_len=21493), uptime ~25h since prior restart
  • 2026-04-28 09:35:36 — reviewer1 first failure (~7.5h additional uptime later)
  • All subsequent webhooks fail until...
  • 2026-04-29 04:25:18 — reviewer1 SIGTERM/restart (manual, via systemctl --user restart hermes-gateway-reviewer1.service)
  • 2026-04-29 04:25:57 — reviewer1 succeeds (prompt_len=23899), 39 seconds post-restart
  • 2026-04-29 04:30:06 — reviewer1 succeeds again post-second-restart

Process state at investigation

Profile Uptime RSS FD count
reviewer2 2d 1h 26m 315 MB < 32
reviewer3 2d 1h 26m 302 MB < 32
reviewer1 4 min (just restarted) 84 MB < 32
main gateway 2d 1h 26m 441 MB < 32

No OOM, no FD exhaustion. Memory growth ~150 → 441 MB on main gateway across 2 days (within reasonable bounds).

Direct CLI test proves filesystem + skill content are NOT at fault

cd ~/.hermes/hermes-agent && python3 -c "
from tools.skills_tool import skill_view
import json
result = json.loads(skill_view('multi-model-review'))
print(result.get('success'))    # → True
print(len(result.get('content')))  # → 17172
"

Works perfectly from a fresh Python process while the long-running gateway returns the stub. Confirms the regression is runtime-deferred state inside the long-running aiohttp request handler context, not file-level or skill-content-level.

Self-recovery within session

Reviewer log 2026-04-29 01:04:48 (after the webhook handler injected the failure stub):

"The skill loaded successfully. It appears the initial 'Failed to load skill' message was a transient error. The multi-model-review skill is now available..."

The model recovers via in-conversation skill_view tool call AFTER the webhook handler failed to inject content. Two calls in the same Python process produce divergent results — confirms deferred-state corruption, not file-level.

Suspected root cause (not verified)

Module-level cache in agent/skill_commands.py (or one of its callees in the dispatch chain) accumulates state across long-running aiohttp request handler invocations and eventually breaks _load_skill_payload. The bare except Exception: return None at lines 147-148 swallows the underlying exception silently.

Suggested upstream fix (minimum)

Add diagnostic logging to surface the silent failure even before root cause is identified:

# agent/skill_commands.py:_load_skill_payload
try:
    loaded_skill = ...
except Exception as e:
    logger.warning("_load_skill_payload failed: %s", e, exc_info=True)
    return None

if not loaded_skill.get("success"):
    logger.warning("_load_skill_payload skill_view returned success=False: %r", loaded_skill)
    return None

And in the caller (build_skill_invocation_message):

# agent/skill_commands.py:425-427
if skill_content is None:
    return None  # Don't return truthy stub that overwrites user_instruction

With the caller returning None, webhook.py:404 if skill_content: would correctly fall through to user's prompt. Truthy-stub-as-success is the silent failure mode that masks the underlying bug.

Project-side mitigation already deployed

  • Track B (S131 commit 7e39ced): strict CUID validation at API boundary (/api/agent/multi-model-review) — rejects fabricated IDs from reviewers that hallucinate from holographic memory after losing webhook context. Prevents orphan AgentLearning rows.
  • Track A (S131): scripts/refire-panel-review.ts for manual recovery of stuck rows.
  • Q1 stall monitor (S131 commit pending): /api/panel-review-health endpoint + Hermes cron */15min — alerts via Telegram when multi_model_pending rows exceed 30 min.

These mitigate symptoms but not the root cause. Upstream fix is the only durable path.

Environment

  • Hermes version: 0.11.0
  • Python: 3.11.14
  • OS: Linux (long-running production server, systemd-managed)
  • Project: motherfish-ai-bot (XAUUSD trading agent, 3-reviewer panel pattern)
  • Reviewer profiles: 3 isolated systemd services (hermes-gateway-reviewer{1,2,3}.service)
  • Models per profile: kimi-k2.5/bailian (R1), glm-5/bailian (R2), glm-5.1/z.ai (R3)

Cross-reference

  • Project's S131 session memory (full investigation): .claude/agent-memory/opus-4-7/sessions/131-track-d-skill-load-failure.md
  • Project's Hermes troubleshooting doc: hermes-doc.md § Skill Auto-Load Failure: prompt_len=42 (S131)
  • Track B implementation: src/app/api/agent/multi-model-review/route.ts:240-303 (commit 7e39ced)
  • Track A recovery script: scripts/refire-panel-review.ts

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/agentCore agent loop, run_agent.py, prompt buildercomp/gatewayGateway runner, session dispatch, deliverytool/skillsSkills system (list, view, manage)type/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions