feat(cli): adaptive retry with model escalation for kanban dispatcher#30620
Open
rodaddy wants to merge 2 commits into
Open
feat(cli): adaptive retry with model escalation for kanban dispatcher#30620rodaddy wants to merge 2 commits into
rodaddy wants to merge 2 commits into
Conversation
Collaborator
abd5869 to
0f7d7bd
Compare
Implements model escalation on consecutive task failures and fixes the crash-loop stickiness bug (NousResearch#30417). - Add kanban.retry_model_escalation config key (empty dict default, backward compatible) - Dispatch logic upgrades model_override per escalation map when consecutive_failures > 0 - Fix recompute_ready so parentless gave_up-blocked tasks stay blocked until explicit unblock - Tasks with parents still auto-recover when all parents complete - 8 new tests covering escalation + crash loop scenarios Closes NousResearch#30587
0907949 to
64a8949
Compare
Author
|
Maintainer note: I rebased/updated this branch onto current Local validation passed:
GitHub Actions are currently waiting for maintainer approval because this is a fork PR. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
When a kanban task crashes repeatedly, the dispatcher just keeps respawning the same agent with the same model hitting the same wall. This adds a config-driven model escalation ladder so the dispatcher can try a stronger model on retry (e.g. bump from
sonnet4.6-offtosonnet4.6-lowon the second attempt).Also fixes the crash-loop stickiness bug from #30417: tasks that hit the circuit breaker (
gave_up) were getting re-promoted byrecompute_readyevery tick becauseall([]) == Truefor parentless tasks. Now parentlessgave_uptasks stay blocked until someone explicitly unblocks them. Tasks with parents still auto-recover when their parents finish, which is the intended behavior.The escalation ladder pattern came from King-Capital/multi-agent-engine's self-healing module, which does model-upgrade-on-retry in a multi-agent orchestration context.
Empty config = no escalation = nothing changes for existing users.
Related Issue
Fixes #30587
Fixes #30417
Type of Change
Changes Made
hermes_cli/config.py-- addedkanban.retry_model_escalationconfig key, defaults to empty dicthermes_cli/kanban_db.py--dispatch_once()now takes an optionalmodel_escalationdict. When a task hasconsecutive_failures > 0and its current model is in the map, the dispatcher writes the escalated model tomodel_overridebefore spawninghermes_cli/kanban_db.py--recompute_ready()now checks for agave_upevent on parentless blocked tasks and skips promotion unless there's been an explicitunblockedevent sincegateway/run.py-- readsretry_model_escalationfrom kanban config, passes it through todispatch_once()tests/hermes_cli/test_kanban_db.py-- 8 new tests: 5 for model escalation (empty map, first spawn skipped, top of chain, escalation from None, escalation from set override) and 3 for the crash-loop fix (parentless gave_up stays blocked, unblock re-queues, gave_up with done parents still promotes)How to Test
~/.hermes/config.yaml:kanban dispatcher: model_escalation=...gave_upparentless task stays blocked across dispatcher ticksChecklist
Code
fix(scope):,feat(scope):, etc.)pytest tests/ -qand all tests passDocumentation and Housekeeping
docs/, docstrings) -- or N/Acli-config.yaml.exampleif I added/changed config keys -- or N/A (kanban section not in example config)CONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows -- or N/A