Dispatcher: detect HTTP 429 in worker stderr and delayed-retry instead of protocol_violation auto-block

## Problem

When a Hermes worker hits an upstream rate-limit (HTTP 429 from OpenAI Codex, Anthropic, etc.) and exits cleanly (rc=0) WITHOUT calling `kanban_complete` or `kanban_block`, the dispatcher classifies this as `protocol_violation` and auto-blocks the task. After 1-2 retries hit the same 429, the task is permanently blocked and the chain stalls.

## Evidence

v6.6 chain run 2026-06-07 ~10:55 PDT:
- Tony+Tchalla Block C re-reviews (t_77ac35a7, t_a30a88db)
- Worker stderr: `HTTP 429: The usage limit has been reached. plan_type: team, resets_at: 1780861905`
- Dispatcher event: `gave_up { error: 'worker exited cleanly (rc=0) without calling kanban_complete or kanban_block — protocol violation' }`
- Result: both tasks permanently blocked; v6.6 chain stalled for 2 hours until quota reset
- Required manual restart of gateway + (would have required) manual re-promotion to recover

See `~/.claude/projects/-Users-jarvis--hermes/memory/project_marvel_swarm_v6_6_rate_limit.md` for full incident write-up.

## Why this matters

Banner v1 research (2026-06-07, see `~/.hermes/skills/devops/agent-onboarding/_research/banner-v1/findings.md`) confirmed that NO major agent framework (CrewAI, LangGraph, AutoGen, Claude Agent SDK) does smarter-than-this rate-limit handling. Fixing this is differentiating — it makes us robust to upstream incidents in a way nobody else is.

Also: our profiles now have grok-4.3 as final fallback (config change 2026-06-07), but the dispatcher doesn't actually try the fallback chain on 429 — it just auto-blocks.

## Acceptance criteria

- Worker stderr is scanned post-mortem for known rate-limit signatures (HTTP 429, `usage limit reached`, `rate_limit_exceeded`, etc.)
- On match: task is requeued with `delayed_retry_until = resets_at_from_stderr_or_default_300s` instead of auto-blocked
- Delayed retry counts against `max-retries` but with a longer window
- Provider that issued the 429 is marked unhealthy for the cooldown period; fallback providers tried first
- Test: simulate 429 in worker stderr → assert dispatcher requeues, doesn't auto-block

## Related

- Builds on: existing v6.4 Fix B (no auto-block on keep_running protocol_violation)
- Memory: `project_marvel_swarm_v6_6_rate_limit.md`
- Banner findings: `~/.hermes/skills/devops/agent-onboarding/_research/banner-v1/`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dispatcher: detect HTTP 429 in worker stderr and delayed-retry instead of protocol_violation auto-block #5

Problem

Evidence

Why this matters

Acceptance criteria

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Dispatcher: detect HTTP 429 in worker stderr and delayed-retry instead of protocol_violation auto-block #5

Description

Problem

Evidence

Why this matters

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions