Problem
When a Hermes worker hits an upstream rate-limit (HTTP 429 from OpenAI Codex, Anthropic, etc.) and exits cleanly (rc=0) WITHOUT calling kanban_complete or kanban_block, the dispatcher classifies this as protocol_violation and auto-blocks the task. After 1-2 retries hit the same 429, the task is permanently blocked and the chain stalls.
Evidence
v6.6 chain run 2026-06-07 ~10:55 PDT:
- Tony+Tchalla Block C re-reviews (t_77ac35a7, t_a30a88db)
- Worker stderr:
HTTP 429: The usage limit has been reached. plan_type: team, resets_at: 1780861905
- Dispatcher event:
gave_up { error: 'worker exited cleanly (rc=0) without calling kanban_complete or kanban_block — protocol violation' }
- Result: both tasks permanently blocked; v6.6 chain stalled for 2 hours until quota reset
- Required manual restart of gateway + (would have required) manual re-promotion to recover
See ~/.claude/projects/-Users-jarvis--hermes/memory/project_marvel_swarm_v6_6_rate_limit.md for full incident write-up.
Why this matters
Banner v1 research (2026-06-07, see ~/.hermes/skills/devops/agent-onboarding/_research/banner-v1/findings.md) confirmed that NO major agent framework (CrewAI, LangGraph, AutoGen, Claude Agent SDK) does smarter-than-this rate-limit handling. Fixing this is differentiating — it makes us robust to upstream incidents in a way nobody else is.
Also: our profiles now have grok-4.3 as final fallback (config change 2026-06-07), but the dispatcher doesn't actually try the fallback chain on 429 — it just auto-blocks.
Acceptance criteria
- Worker stderr is scanned post-mortem for known rate-limit signatures (HTTP 429,
usage limit reached, rate_limit_exceeded, etc.)
- On match: task is requeued with
delayed_retry_until = resets_at_from_stderr_or_default_300s instead of auto-blocked
- Delayed retry counts against
max-retries but with a longer window
- Provider that issued the 429 is marked unhealthy for the cooldown period; fallback providers tried first
- Test: simulate 429 in worker stderr → assert dispatcher requeues, doesn't auto-block
Related
- Builds on: existing v6.4 Fix B (no auto-block on keep_running protocol_violation)
- Memory:
project_marvel_swarm_v6_6_rate_limit.md
- Banner findings:
~/.hermes/skills/devops/agent-onboarding/_research/banner-v1/
Problem
When a Hermes worker hits an upstream rate-limit (HTTP 429 from OpenAI Codex, Anthropic, etc.) and exits cleanly (rc=0) WITHOUT calling
kanban_completeorkanban_block, the dispatcher classifies this asprotocol_violationand auto-blocks the task. After 1-2 retries hit the same 429, the task is permanently blocked and the chain stalls.Evidence
v6.6 chain run 2026-06-07 ~10:55 PDT:
HTTP 429: The usage limit has been reached. plan_type: team, resets_at: 1780861905gave_up { error: 'worker exited cleanly (rc=0) without calling kanban_complete or kanban_block — protocol violation' }See
~/.claude/projects/-Users-jarvis--hermes/memory/project_marvel_swarm_v6_6_rate_limit.mdfor full incident write-up.Why this matters
Banner v1 research (2026-06-07, see
~/.hermes/skills/devops/agent-onboarding/_research/banner-v1/findings.md) confirmed that NO major agent framework (CrewAI, LangGraph, AutoGen, Claude Agent SDK) does smarter-than-this rate-limit handling. Fixing this is differentiating — it makes us robust to upstream incidents in a way nobody else is.Also: our profiles now have grok-4.3 as final fallback (config change 2026-06-07), but the dispatcher doesn't actually try the fallback chain on 429 — it just auto-blocks.
Acceptance criteria
usage limit reached,rate_limit_exceeded, etc.)delayed_retry_until = resets_at_from_stderr_or_default_300sinstead of auto-blockedmax-retriesbut with a longer windowRelated
project_marvel_swarm_v6_6_rate_limit.md~/.hermes/skills/devops/agent-onboarding/_research/banner-v1/