fix(cli): non-quiet chat -q rate-limit must exit 75 in kanban workers#10
Merged
Merged
Conversation
Closes #5. Problem The non-quiet `hermes -p <profile> chat -q "..."` path (the actual invocation pattern Marvel team kanban workers use, e.g. `chat -q "work kanban task X"`) never applied the kanban EX_TEMPFAIL exit-code mapping. A rate-limited worker exited 0 by virtue of `cli.chat()` returning cleanly. The dispatcher's reap classifier then treated rc=0 as a "protocol violation" (worker exited cleanly without calling `kanban_complete` or `kanban_block`) and auto-blocked the task. This was the v6.6 incident root cause. Tony+Tchalla re-reviews (t_77ac35a7, t_a30a88db) hit Codex 429, exited 0, dispatcher auto-blocked, chain stalled 2 hours until quota reset — even though `KANBAN_RATE_LIMIT_EXIT_CODE = 75` and the mapping at cli.py existed (just only on the QUIET single-query path). Solution 1. Extract the exit-code mapping into `_worker_exit_code_from_result(result)` at module level. Returns 0 / 1 / 75 per these rules: - None or non-dict or success → 0 - Failure outside a kanban worker → 1 - Failure inside a worker with failure_reason ∈ {rate_limit, billing} → 75 - Any other failure inside a worker → 1 2. Refactor the quiet path (~line 16172) to call the helper instead of inline. 3. Make `cli.chat()` stash its run_conversation result on `self._last_run_result` so the non-quiet caller can inspect failure metadata after chat() returns only the response string. Reset to None at __init__ AND at start of every turn — invariant doesn't depend on early-return order (caught by code review). 4. Wire the non-quiet path (~line 16197) to call the helper after chat() returns. 5. Exception path inside chat() synthesizes `{failed:True, error:..., completed:False}` so single-query callers still apply mapping (rate-limit branch doesn't fire because no failure_reason — correctly falls through to exit 1). Tests (9/9 passing) — tests/cli/test_worker_exit_code_from_result.py - None result → 0 - Non-dict result → 0 - Success result → 0 - Failure outside kanban (no HERMES_KANBAN_TASK env) → 1, regardless of reason - Rate-limit inside kanban → KANBAN_RATE_LIMIT_EXIT_CODE (75) - Billing inside kanban → 75 (same recovery story) - Other failures inside kanban → 1 - Missing failure_reason field inside kanban → 1 (defensive) - Rate-limit without HERMES_KANBAN_TASK env → 1 (human CLI gets generic exit) Combined with #6 (worker-startup fallback) and #7 (dispatcher heartbeat), both already merged, the worker-startup → dispatcher-detect → next-retry loop is now operationally robust: - #7 — silent stalls detectable via heartbeat JSON - #6 — primary provider auth crash falls through to fallback chain at startup - #5 (this) — rate-limit failures exit 75 so dispatcher requeues without burning the retry counter Code-review pre-merge: reviewer caught a stale-stash bug (previous turn's `_last_run_result` leaking into a downstream consumer if chat() takes an early return path). Fixed by initializing the stash to None in __init__ AND resetting at the top of each chat() turn — invariant pinned. Follow-up (separate issues, not blocking) - `_print_exit_summary()` shows "Resume this session with:" even on rate-limit failure in the non-quiet path. Pre-existing; not introduced by this PR. - Non-quiet branch doesn't check `HERMES_KANBAN_GOAL_MODE` env var (only quiet path runs `_run_kanban_goal_loop_q`). If a goal_mode worker ever spawns via the non-quiet path, the goal loop silently skips. Pre-existing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
chat -q "..."path (the actual invocation pattern Marvel kanban workers use) now applies the kanban EX_TEMPFAIL exit-code mapping. Rate-limited workers exit 75 instead of 0 → dispatcher requeues without burning retry counter or auto-blocking the task.Why this matters
v6.6 root cause: Tony+Tchalla re-reviews (t_77ac35a7, t_a30a88db) hit Codex 429, exited 0, dispatcher classified as "protocol violation" → auto-blocked → chain stalled 2 hours until quota reset.
KANBAN_RATE_LIMIT_EXIT_CODE = 75and the mapping existed at cli.py — but only on the QUIET (-Q) single-query path. Marvel workers use the non-quiet path (chat -q "work kanban task X") which never reached the mapping.Test plan (9/9 passing)
rate_limited(notprotocol_violation)Code-review focus
_last_run_resultstash safety: only consumer is the non-quiet-qbranch. Reset at init AND at start of everychat()turn (caught by code review as a must-fix — stale value from previous turn could have leaked).chat(): when chat() catches Exception, sets_last_run_result = {failed:True, error:...}withoutfailure_reason. Helper correctly falls through to exit 1.sys.exitsemantics in non-quiet branch: only call when exit_code != 0 so success cases fall through to normal function return. Asymmetry with quiet branch is intentional (quiet always exits to skip downstream fallthrough at line 16175).Follow-ups (separate issues, not blocking)
_print_exit_summary()shows "Resume this session with:" on rate-limit failure in non-quiet path. Pre-existing UX bug, not introduced by this PR.HERMES_KANBAN_GOAL_MODEenv var — quiet path does. If a goal_mode worker spawns via non-quiet path, goal loop silently skips. Pre-existing.🤖 Generated with Claude Code