Skip to content

feat(cli): adaptive retry with model escalation for kanban dispatcher#30620

Open
rodaddy wants to merge 2 commits into
NousResearch:mainfrom
rodaddy:feat/kanban-adaptive-retry
Open

feat(cli): adaptive retry with model escalation for kanban dispatcher#30620
rodaddy wants to merge 2 commits into
NousResearch:mainfrom
rodaddy:feat/kanban-adaptive-retry

Conversation

@rodaddy

@rodaddy rodaddy commented May 22, 2026

Copy link
Copy Markdown

What does this PR do?

When a kanban task crashes repeatedly, the dispatcher just keeps respawning the same agent with the same model hitting the same wall. This adds a config-driven model escalation ladder so the dispatcher can try a stronger model on retry (e.g. bump from sonnet4.6-off to sonnet4.6-low on the second attempt).

Also fixes the crash-loop stickiness bug from #30417: tasks that hit the circuit breaker (gave_up) were getting re-promoted by recompute_ready every tick because all([]) == True for parentless tasks. Now parentless gave_up tasks stay blocked until someone explicitly unblocks them. Tasks with parents still auto-recover when their parents finish, which is the intended behavior.

The escalation ladder pattern came from King-Capital/multi-agent-engine's self-healing module, which does model-upgrade-on-retry in a multi-agent orchestration context.

Empty config = no escalation = nothing changes for existing users.

Related Issue

Fixes #30587
Fixes #30417

Type of Change

  • ✨ New feature (non-breaking change that adds functionality)
  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • hermes_cli/config.py -- added kanban.retry_model_escalation config key, defaults to empty dict
  • hermes_cli/kanban_db.py -- dispatch_once() now takes an optional model_escalation dict. When a task has consecutive_failures > 0 and its current model is in the map, the dispatcher writes the escalated model to model_override before spawning
  • hermes_cli/kanban_db.py -- recompute_ready() now checks for a gave_up event on parentless blocked tasks and skips promotion unless there's been an explicit unblocked event since
  • gateway/run.py -- reads retry_model_escalation from kanban config, passes it through to dispatch_once()
  • tests/hermes_cli/test_kanban_db.py -- 8 new tests: 5 for model escalation (empty map, first spawn skipped, top of chain, escalation from None, escalation from set override) and 3 for the crash-loop fix (parentless gave_up stays blocked, unblock re-queues, gave_up with done parents still promotes)

How to Test

  1. Run the kanban tests:
    scripts/run_tests.sh tests/hermes_cli/test_kanban_db.py -q
  2. To test manually, add this to ~/.hermes/config.yaml:
    kanban:
      retry_model_escalation:
        sonnet4.6-off: sonnet4.6-low
        sonnet4.6-low: opus4.6-high
  3. Create a task that will fail and watch the dispatcher logs for kanban dispatcher: model_escalation=...
  4. Check that a gave_up parentless task stays blocked across dispatcher ticks

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: Ubuntu 24.04 (LXC)

Documentation and Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) -- or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys -- or N/A (kanban section not in example config)
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows -- or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide -- or N/A (pure Python dict lookups and SQLite, no platform-specific code)
  • I've updated tool descriptions/schemas if I changed tool behavior -- or N/A

@alt-glitch alt-glitch added type/feature New feature or request type/bug Something isn't working P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery labels May 22, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Salvage of closed #30608 (same feature + bug fix). Implements #30587, fixes #30417 (crash-loop stickiness). Related: #29732, #29328 (same vacuous-truth root cause, different approaches).

@rodaddy rodaddy force-pushed the feat/kanban-adaptive-retry branch from abd5869 to 0f7d7bd Compare May 24, 2026 14:23
Bilby and others added 2 commits June 1, 2026 14:45
Implements model escalation on consecutive task failures and fixes the
crash-loop stickiness bug (NousResearch#30417).

- Add kanban.retry_model_escalation config key (empty dict default, backward compatible)
- Dispatch logic upgrades model_override per escalation map when consecutive_failures > 0
- Fix recompute_ready so parentless gave_up-blocked tasks stay blocked until explicit unblock
- Tasks with parents still auto-recover when all parents complete
- 8 new tests covering escalation + crash loop scenarios

Closes NousResearch#30587
@rodaddy rodaddy force-pushed the feat/kanban-adaptive-retry branch from 0907949 to 64a8949 Compare June 1, 2026 18:57
@rodaddy

rodaddy commented Jun 1, 2026

Copy link
Copy Markdown
Author

Maintainer note: I rebased/updated this branch onto current main and resolved the conflicts. Current head is 64a89490077d34dfe5c4147595ef664c869c3600.

Local validation passed:

  • ruff format --check tests/hermes_cli/test_kanban_db.py
  • ruff check tests/hermes_cli/test_kanban_db.py
  • pytest tests/hermes_cli/test_kanban_db.py → 213 passed in 84.73s

GitHub Actions are currently waiting for maintainer approval because this is a fork PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have type/bug Something isn't working type/feature New feature or request

Projects

None yet

2 participants