Skip to content

fix(cli): non-quiet chat -q rate-limit must exit 75 in kanban workers#10

Merged
jarvis-stark-ops merged 1 commit into
mainfrom
wt/codex-429-worker-exit-code
Jun 8, 2026
Merged

fix(cli): non-quiet chat -q rate-limit must exit 75 in kanban workers#10
jarvis-stark-ops merged 1 commit into
mainfrom
wt/codex-429-worker-exit-code

Conversation

@jarvis-stark-ops

Copy link
Copy Markdown
Collaborator

Summary

Why this matters

v6.6 root cause: Tony+Tchalla re-reviews (t_77ac35a7, t_a30a88db) hit Codex 429, exited 0, dispatcher classified as "protocol violation" → auto-blocked → chain stalled 2 hours until quota reset. KANBAN_RATE_LIMIT_EXIT_CODE = 75 and the mapping existed at cli.py — but only on the QUIET (-Q) single-query path. Marvel workers use the non-quiet path (chat -q "work kanban task X") which never reached the mapping.

Test plan (9/9 passing)

  • None / non-dict / success result → exit 0
  • Failure outside kanban (no env) → exit 1 regardless of reason
  • Rate-limit inside kanban (HERMES_KANBAN_TASK set + failure_reason="rate_limit") → exit 75
  • Billing inside kanban → exit 75 (same recovery story)
  • Other failures inside kanban → exit 1
  • Missing failure_reason field inside kanban → exit 1 (defensive)
  • Rate-limit outside kanban → exit 1 (human CLI gets generic exit)
  • Manual: re-test v6.6 chain after merge, confirm Codex 429 → dispatcher reclassifies as rate_limited (not protocol_violation)

Code-review focus

  1. Behavioral preservation: refactored quiet path's inline mapping into the helper. Tests pin all 4 branches.
  2. _last_run_result stash safety: only consumer is the non-quiet -q branch. Reset at init AND at start of every chat() turn (caught by code review as a must-fix — stale value from previous turn could have leaked).
  3. Exception synthesis in chat(): when chat() catches Exception, sets _last_run_result = {failed:True, error:...} without failure_reason. Helper correctly falls through to exit 1.
  4. sys.exit semantics in non-quiet branch: only call when exit_code != 0 so success cases fall through to normal function return. Asymmetry with quiet branch is intentional (quiet always exits to skip downstream fallthrough at line 16175).
  5. Billing → 75 propagated to non-quiet: desired. Recovery story is identical to rate-limit (dispatcher releases without retry-count increment).

Follow-ups (separate issues, not blocking)

  • _print_exit_summary() shows "Resume this session with:" on rate-limit failure in non-quiet path. Pre-existing UX bug, not introduced by this PR.
  • Non-quiet branch doesn't check HERMES_KANBAN_GOAL_MODE env var — quiet path does. If a goal_mode worker spawns via non-quiet path, goal loop silently skips. Pre-existing.

🤖 Generated with Claude Code

Closes #5.

Problem
The non-quiet `hermes -p <profile> chat -q "..."` path (the actual invocation
pattern Marvel team kanban workers use, e.g. `chat -q "work kanban task X"`)
never applied the kanban EX_TEMPFAIL exit-code mapping. A rate-limited worker
exited 0 by virtue of `cli.chat()` returning cleanly. The dispatcher's reap
classifier then treated rc=0 as a "protocol violation" (worker exited cleanly
without calling `kanban_complete` or `kanban_block`) and auto-blocked the task.

This was the v6.6 incident root cause. Tony+Tchalla re-reviews (t_77ac35a7,
t_a30a88db) hit Codex 429, exited 0, dispatcher auto-blocked, chain stalled
2 hours until quota reset — even though `KANBAN_RATE_LIMIT_EXIT_CODE = 75`
and the mapping at cli.py existed (just only on the QUIET single-query path).

Solution
1. Extract the exit-code mapping into `_worker_exit_code_from_result(result)`
   at module level. Returns 0 / 1 / 75 per these rules:
     - None or non-dict or success → 0
     - Failure outside a kanban worker → 1
     - Failure inside a worker with failure_reason ∈ {rate_limit, billing} → 75
     - Any other failure inside a worker → 1
2. Refactor the quiet path (~line 16172) to call the helper instead of inline.
3. Make `cli.chat()` stash its run_conversation result on `self._last_run_result`
   so the non-quiet caller can inspect failure metadata after chat() returns
   only the response string. Reset to None at __init__ AND at start of every
   turn — invariant doesn't depend on early-return order (caught by code review).
4. Wire the non-quiet path (~line 16197) to call the helper after chat() returns.
5. Exception path inside chat() synthesizes `{failed:True, error:..., completed:False}`
   so single-query callers still apply mapping (rate-limit branch doesn't fire
   because no failure_reason — correctly falls through to exit 1).

Tests (9/9 passing) — tests/cli/test_worker_exit_code_from_result.py
- None result → 0
- Non-dict result → 0
- Success result → 0
- Failure outside kanban (no HERMES_KANBAN_TASK env) → 1, regardless of reason
- Rate-limit inside kanban → KANBAN_RATE_LIMIT_EXIT_CODE (75)
- Billing inside kanban → 75 (same recovery story)
- Other failures inside kanban → 1
- Missing failure_reason field inside kanban → 1 (defensive)
- Rate-limit without HERMES_KANBAN_TASK env → 1 (human CLI gets generic exit)

Combined with #6 (worker-startup fallback) and #7 (dispatcher heartbeat),
both already merged, the worker-startup → dispatcher-detect → next-retry
loop is now operationally robust:
- #7 — silent stalls detectable via heartbeat JSON
- #6 — primary provider auth crash falls through to fallback chain at startup
- #5 (this) — rate-limit failures exit 75 so dispatcher requeues without
  burning the retry counter

Code-review pre-merge: reviewer caught a stale-stash bug (previous turn's
`_last_run_result` leaking into a downstream consumer if chat() takes an
early return path). Fixed by initializing the stash to None in __init__
AND resetting at the top of each chat() turn — invariant pinned.

Follow-up (separate issues, not blocking)
- `_print_exit_summary()` shows "Resume this session with:" even on rate-limit
  failure in the non-quiet path. Pre-existing; not introduced by this PR.
- Non-quiet branch doesn't check `HERMES_KANBAN_GOAL_MODE` env var (only
  quiet path runs `_run_kanban_goal_loop_q`). If a goal_mode worker ever
  spawns via the non-quiet path, the goal loop silently skips. Pre-existing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jarvis-stark-ops jarvis-stark-ops merged commit bef7089 into main Jun 8, 2026
@jarvis-stark-ops jarvis-stark-ops deleted the wt/codex-429-worker-exit-code branch June 8, 2026 01:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dispatcher: detect HTTP 429 in worker stderr and delayed-retry instead of protocol_violation auto-block

1 participant