[Bug]: Webhook gateway — skill auto-loader returns 'Failed to load skill' stub instead of None, silently dropping user prompt after extended uptime

### Summary

In a long-running gateway process (`hermes gateway run` for ~7+ hours), webhook routes configured with `skills:` start producing a 42-character stub (`[Failed to load skill: <name>]`) that **silently overwrites the user's webhook payload**. Reviewer agents receive the stub instead of the real prompt, lose all context, and either give up or fabricate responses from memory. The gateway logs no warning. Restarting the gateway cures the issue immediately.

### Reproduction

1. Configure a webhook route in `~/.hermes/profiles/<profile>/config.yaml` with `skills: [<any-skill>]` and `prompt: '{message}'`
2. Start the gateway: `hermes gateway run --profile <profile>` (or via systemd)
3. Run for 24+ hours (or possibly less — ~7h empirical onset)
4. POST a webhook → observe inbound message becomes `[Failed to load skill: <name>]` instead of the user's payload
5. Restart gateway → next POST succeeds, real payload arrives, skill loads normally

### Source-code path (Hermes-agent v0.11.0)

1. **`gateway/platforms/webhook.py:401-405`** — calls `build_skill_invocation_message(cmd_key, user_instruction=prompt)` synchronously inside aiohttp request handler
2. **`agent/skill_commands.py:425-427`** — when `_load_skill_payload` returns `None`, **returns truthy stub string** `'[Failed to load skill: <name>]'` (exactly 42 chars for skill name `multi-model-review`, mathematically verified)
3. **`webhook.py:404-405`** — `if skill_content:` accepts stub as success → `prompt = skill_content` overwrites user's payload
4. **`agent/skill_commands.py:_load_skill_payload :147-148`** — bare `except Exception: return None` swallows all exceptions, **NO log line**
5. **`agent/skill_commands.py:_load_skill_payload :150-151`** — `if not loaded_skill.get("success"): return None` — **NO log line**

The outer `webhook.py:412 except Exception as e: logger.warning("[webhook] Skill loading failed: %s", e)` would fire if `build_skill_invocation_message` raised — **it never does** (verified via grep across all gateway log files: zero `Skill loading failed` warnings).

### Empirical evidence

#### Lockstep failure across 3 independent reviewer profiles

```
| Timestamp UTC                  | Reviewer1  | Reviewer2  | Reviewer3  |
|--------------------------------|------------|------------|------------|
| 2026-04-28 09:35:36.713        | len=42     | len=42     | len=42     |
| 2026-04-28 10:23:10.451        | len=42     | len=42     | len=42     |
| 2026-04-28 11:36:30.558        | len=42     | len=42     | len=42     |
| 2026-04-29 01:04:34.315        | len=42     | len=42     | len=42     |
| 2026-04-29 03:26:49.629        | len=42     | len=42     | len=42     |
| 2026-04-29 04:13:37.353        | len=42     | len=42     | len=42     |
```

**Identical millisecond timestamps + 6/6 identical events across 3 separate Python processes** — kills hypotheses involving per-process state divergence (cache races, SQLite contention, profile session resume).

#### Restart-recovery correlation

- 2026-04-28 01:55:17 — reviewer1 last successful delivery (`prompt_len=21493`), uptime ~25h since prior restart
- 2026-04-28 09:35:36 — reviewer1 **first failure** (~7.5h additional uptime later)
- All subsequent webhooks fail until...
- 2026-04-29 04:25:18 — reviewer1 SIGTERM/restart (manual, via `systemctl --user restart hermes-gateway-reviewer1.service`)
- 2026-04-29 04:25:57 — reviewer1 **succeeds** (`prompt_len=23899`), 39 seconds post-restart
- 2026-04-29 04:30:06 — reviewer1 **succeeds again** post-second-restart

#### Process state at investigation

| Profile | Uptime | RSS | FD count |
|---------|--------|-----|----------|
| reviewer2 | 2d 1h 26m | 315 MB | < 32 |
| reviewer3 | 2d 1h 26m | 302 MB | < 32 |
| reviewer1 | 4 min (just restarted) | 84 MB | < 32 |
| main gateway | 2d 1h 26m | 441 MB | < 32 |

No OOM, no FD exhaustion. Memory growth ~150 → 441 MB on main gateway across 2 days (within reasonable bounds).

#### Direct CLI test proves filesystem + skill content are NOT at fault

```bash
cd ~/.hermes/hermes-agent && python3 -c "
from tools.skills_tool import skill_view
import json
result = json.loads(skill_view('multi-model-review'))
print(result.get('success'))    # → True
print(len(result.get('content')))  # → 17172
"
```

Works perfectly from a fresh Python process while the long-running gateway returns the stub. Confirms the regression is **runtime-deferred state inside the long-running aiohttp request handler context**, not file-level or skill-content-level.

#### Self-recovery within session

Reviewer log 2026-04-29 01:04:48 (after the webhook handler injected the failure stub):

> *"The skill loaded successfully. It appears the initial 'Failed to load skill' message was a transient error. The multi-model-review skill is now available..."*

The model recovers via in-conversation `skill_view` tool call AFTER the webhook handler failed to inject content. Two calls in the same Python process produce divergent results — confirms deferred-state corruption, not file-level.

### Suspected root cause (not verified)

Module-level cache in `agent/skill_commands.py` (or one of its callees in the dispatch chain) accumulates state across long-running aiohttp request handler invocations and eventually breaks `_load_skill_payload`. The bare `except Exception: return None` at lines 147-148 swallows the underlying exception silently.

### Suggested upstream fix (minimum)

Add diagnostic logging to surface the silent failure even before root cause is identified:

```python
# agent/skill_commands.py:_load_skill_payload
try:
    loaded_skill = ...
except Exception as e:
    logger.warning("_load_skill_payload failed: %s", e, exc_info=True)
    return None

if not loaded_skill.get("success"):
    logger.warning("_load_skill_payload skill_view returned success=False: %r", loaded_skill)
    return None
```

And in the caller (`build_skill_invocation_message`):

```python
# agent/skill_commands.py:425-427
if skill_content is None:
    return None  # Don't return truthy stub that overwrites user_instruction
```

With the caller returning `None`, `webhook.py:404 if skill_content:` would correctly fall through to user's prompt. Truthy-stub-as-success is the silent failure mode that masks the underlying bug.

### Project-side mitigation already deployed

- **Track B (S131 commit `7e39ced`):** strict CUID validation at API boundary (`/api/agent/multi-model-review`) — rejects fabricated IDs from reviewers that hallucinate from holographic memory after losing webhook context. Prevents orphan AgentLearning rows.
- **Track A (S131):** `scripts/refire-panel-review.ts` for manual recovery of stuck rows.
- **Q1 stall monitor (S131 commit pending):** `/api/panel-review-health` endpoint + Hermes cron */15min — alerts via Telegram when `multi_model_pending` rows exceed 30 min.

These mitigate symptoms but not the root cause. Upstream fix is the only durable path.

### Environment

- Hermes version: **0.11.0**
- Python: 3.11.14
- OS: Linux (long-running production server, systemd-managed)
- Project: motherfish-ai-bot (XAUUSD trading agent, 3-reviewer panel pattern)
- Reviewer profiles: 3 isolated systemd services (`hermes-gateway-reviewer{1,2,3}.service`)
- Models per profile: kimi-k2.5/bailian (R1), glm-5/bailian (R2), glm-5.1/z.ai (R3)

### Cross-reference

- Project's S131 session memory (full investigation): `.claude/agent-memory/opus-4-7/sessions/131-track-d-skill-load-failure.md`
- Project's Hermes troubleshooting doc: `hermes-doc.md § Skill Auto-Load Failure: prompt_len=42 (S131)`
- Track B implementation: `src/app/api/agent/multi-model-review/route.ts:240-303` (commit `7e39ced`)
- Track A recovery script: `scripts/refire-panel-review.ts`

---



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Webhook gateway — skill auto-loader returns 'Failed to load skill' stub instead of None, silently dropping user prompt after extended uptime #17283

Summary

Reproduction

Source-code path (Hermes-agent v0.11.0)

Empirical evidence

Lockstep failure across 3 independent reviewer profiles

Restart-recovery correlation

Process state at investigation

Direct CLI test proves filesystem + skill content are NOT at fault

Self-recovery within session

Suspected root cause (not verified)

Suggested upstream fix (minimum)

Project-side mitigation already deployed

Environment

Cross-reference

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Profile	Uptime	RSS	FD count
reviewer2	2d 1h 26m	315 MB	< 32
reviewer3	2d 1h 26m	302 MB	< 32
reviewer1	4 min (just restarted)	84 MB	< 32
main gateway	2d 1h 26m	441 MB	< 32

[Bug]: Webhook gateway — skill auto-loader returns 'Failed to load skill' stub instead of None, silently dropping user prompt after extended uptime #17283

Description

Summary

Reproduction

Source-code path (Hermes-agent v0.11.0)

Empirical evidence

Lockstep failure across 3 independent reviewer profiles

Restart-recovery correlation

Process state at investigation

Direct CLI test proves filesystem + skill content are NOT at fault

Self-recovery within session

Suspected root cause (not verified)

Suggested upstream fix (minimum)

Project-side mitigation already deployed

Environment

Cross-reference

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions