Skip to content

fix(kanban): retry corrupt-board dispatch after quarantine (salvage #33263)#33412

Merged
teknium1 merged 2 commits into
mainfrom
hermes/hermes-96bea7da
May 27, 2026
Merged

fix(kanban): retry corrupt-board dispatch after quarantine (salvage #33263)#33412
teknium1 merged 2 commits into
mainfrom
hermes/hermes-96bea7da

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

Salvage of #33263 by @donovan-yohan onto current main.

Changes the gateway-embedded Kanban dispatcher's corrupt-board cache from indefinite suppression to a 5-minute quarantine. After the quarantine expires, the dispatcher retries the same board even when the DB file fingerprint hasn't changed, while still avoiding per-tick retry/log/backup spam. Also catches KanbanDbCorruptError (a RuntimeError) in addition to the raw sqlite3.DatabaseError corrupt-board path.

This is a recovery-layer mitigation, scoped to making the gateway un-stick itself instead of staying wedged on a single board fingerprint until restart. It is not a fix for whatever is actually corrupting the DB.

Closes #33263. Related: #32543, #32593.

Changes

  • gateway/run.py: disabled_corrupt_boards now stores (fingerprint, monotonic_at) instead of just the fingerprint; tick retries after CORRUPT_BOARD_RETRY_AFTER_SECONDS = 300 elapsed even on unchanged fingerprint. _is_corrupt_board_db_error now recognises KanbanDbCorruptError. Added a separate except Exception branch that re-routes corrupt-guard raises through the same quarantine path.
  • tests/hermes_cli/test_kanban_core_functionality.py: parametrised the existing disable-without-traceback test over sqlite3.DatabaseError and KanbanDbCorruptError. Added test_gateway_dispatcher_retries_corrupt_board_after_quarantine.
  • scripts/release.py: AUTHOR_MAP entry for @donovan-yohan.

Validation

  • scripts/run_tests.sh tests/hermes_cli/test_kanban_core_functionality.py — 167/167 passing.
  • Targeted: pytest tests/hermes_cli/test_kanban_core_functionality.py -k "gateway_dispatcher_disables_corrupt_board_without_traceback or gateway_dispatcher_retries_corrupt_board_after_quarantine" — 3 passing (parametrize × 2 + new test).

Note on the wider cluster

This PR addresses the recovery dimension only. Open separately:

@github-actions

Copy link
Copy Markdown
Contributor

🔎 Lint report: hermes/hermes-96bea7da vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9508 on HEAD, 9507 on base (🆕 +1)

🆕 New issues (1):

Rule Count
unresolved-attribute 1
First entries
tests/hermes_cli/test_kanban_core_functionality.py:3736: [unresolved-attribute] unresolved-attribute: Attribute `f_back` is not defined on `None` in union `FrameType | None`

✅ Fixed issues: none

Unchanged: 5006 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@teknium1 teknium1 merged commit 5deb384 into main May 27, 2026
25 checks passed
@teknium1 teknium1 deleted the hermes/hermes-96bea7da branch May 27, 2026 18:48
@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have labels May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants