Skip to content

fix(kanban): keep gave_up tasks blocked#29328

Open
LaPelice wants to merge 1 commit into
NousResearch:mainfrom
LaPelice:fix/kanban-gave-up-sticky
Open

fix(kanban): keep gave_up tasks blocked#29328
LaPelice wants to merge 1 commit into
NousResearch:mainfrom
LaPelice:fix/kanban-gave-up-sticky

Conversation

@LaPelice

Copy link
Copy Markdown

Summary

  • Keep gave_up circuit-breaker blocks sticky in recompute_ready() until an explicit unblock.
  • Preserve dependency-only/direct DB blocked rows auto-promoting when parents are done.
  • Add a regression test for the crash -> gave_up -> promoted -> respawn loop seen on Kanban task t_8a40b49a.

Verification

  • RED: python -m pytest tests/hermes_cli/test_kanban_db.py::test_recompute_ready_keeps_gave_up_circuit_breaker_blocked -q -> failed before fix (promoted == 1).
  • GREEN: python -m pytest tests/hermes_cli/test_kanban_db.py::test_recompute_ready_keeps_gave_up_circuit_breaker_blocked tests/hermes_cli/test_kanban_db.py::test_recompute_ready_promotes_blocked_with_done_parents -q -> 2 passed.
  • python -m ruff check hermes_cli/kanban_db.py tests/hermes_cli/test_kanban_db.py -> passed.
  • python -m pytest tests/hermes_cli/test_kanban_db.py -q -> 159 passed.

Risks

  • No secrets/config/prod changes.
  • No Supabase/migration impact.
  • Behavior change is limited to blocked tasks whose latest block-state event is gave_up; dependency-only blocks without blocked/gave_up events still auto-promote.

@vice-magus-faolan

Copy link
Copy Markdown

This looks like it addresses another real repro I hit as well.

I observed a parented task repeatedly cycling through:

crash/protocol-violation -> gave_up -> promoted -> respawn

for hundreds of runs before intervention.

Notably:

  • the task was not parentless
  • the gave_up events were real circuit-breaker events, with payloads showing values like:
    • failures: 1
    • effective_limit: 1
    • trigger_outcome: crashed
  • despite that, the task was still promoted back to runnable state

The crash mix included:

  • nonzero worker exits
  • “worker not alive” outcomes
  • clean exits that became protocol violations because the worker never called kanban_complete or kanban_block

So this PR’s approach — keeping gave_up tasks blocked until explicit unblock — matches the failure pattern I saw pretty closely.

From my side, this seems like the right fix direction, and the regression coverage is valuable because it matches observed field behavior beyond just the original repro.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants