Skip to content

Dispatcher: detect HTTP 429 in worker stderr and delayed-retry instead of protocol_violation auto-block #5

@jarvis-stark-ops

Description

@jarvis-stark-ops

Problem

When a Hermes worker hits an upstream rate-limit (HTTP 429 from OpenAI Codex, Anthropic, etc.) and exits cleanly (rc=0) WITHOUT calling kanban_complete or kanban_block, the dispatcher classifies this as protocol_violation and auto-blocks the task. After 1-2 retries hit the same 429, the task is permanently blocked and the chain stalls.

Evidence

v6.6 chain run 2026-06-07 ~10:55 PDT:

  • Tony+Tchalla Block C re-reviews (t_77ac35a7, t_a30a88db)
  • Worker stderr: HTTP 429: The usage limit has been reached. plan_type: team, resets_at: 1780861905
  • Dispatcher event: gave_up { error: 'worker exited cleanly (rc=0) without calling kanban_complete or kanban_block — protocol violation' }
  • Result: both tasks permanently blocked; v6.6 chain stalled for 2 hours until quota reset
  • Required manual restart of gateway + (would have required) manual re-promotion to recover

See ~/.claude/projects/-Users-jarvis--hermes/memory/project_marvel_swarm_v6_6_rate_limit.md for full incident write-up.

Why this matters

Banner v1 research (2026-06-07, see ~/.hermes/skills/devops/agent-onboarding/_research/banner-v1/findings.md) confirmed that NO major agent framework (CrewAI, LangGraph, AutoGen, Claude Agent SDK) does smarter-than-this rate-limit handling. Fixing this is differentiating — it makes us robust to upstream incidents in a way nobody else is.

Also: our profiles now have grok-4.3 as final fallback (config change 2026-06-07), but the dispatcher doesn't actually try the fallback chain on 429 — it just auto-blocks.

Acceptance criteria

  • Worker stderr is scanned post-mortem for known rate-limit signatures (HTTP 429, usage limit reached, rate_limit_exceeded, etc.)
  • On match: task is requeued with delayed_retry_until = resets_at_from_stderr_or_default_300s instead of auto-blocked
  • Delayed retry counts against max-retries but with a longer window
  • Provider that issued the 429 is marked unhealthy for the cooldown period; fallback providers tried first
  • Test: simulate 429 in worker stderr → assert dispatcher requeues, doesn't auto-block

Related

  • Builds on: existing v6.4 Fix B (no auto-block on keep_running protocol_violation)
  • Memory: project_marvel_swarm_v6_6_rate_limit.md
  • Banner findings: ~/.hermes/skills/devops/agent-onboarding/_research/banner-v1/

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions