Skip to content

fix(kanban): retry corrupt-board dispatch after quarantine#33263

Closed
donovan-yohan wants to merge 1 commit into
NousResearch:mainfrom
donovan-yohan:fix/kanban-corrupt-quarantine-retry
Closed

fix(kanban): retry corrupt-board dispatch after quarantine#33263
donovan-yohan wants to merge 1 commit into
NousResearch:mainfrom
donovan-yohan:fix/kanban-corrupt-quarantine-retry

Conversation

@donovan-yohan

Copy link
Copy Markdown
Contributor

Summary

  • change the gateway-embedded Kanban dispatcher corrupt-board cache from indefinite suppression to a 5-minute quarantine
  • retry dispatch after the quarantine even when the DB fingerprint is unchanged, while still avoiding per-tick retry/log/backup spam
  • include the latest per-board DB dispatch issue in stuck-dispatch warnings
  • cover both raw sqlite3.DatabaseError and KanbanDbCorruptError corrupt-board paths

Why

A transient SQLite open/WAL failure can look like database disk image is malformed. Today the gateway records the board fingerprint in disabled_corrupt_boards and skips that board until the DB file changes or the gateway restarts. If the file does not change, ready tasks can sit forever with only generic “dispatcher stuck” warnings.

This is a mitigation, not the root-cause fix for the underlying WAL/corruption reports. It limits the blast radius: no hot-looping, but also no permanent process-local suppression.

Related: #32543, #32593.

Verification

  • python -m pytest tests/hermes_cli/test_kanban_core_functionality.py::test_gateway_dispatcher_disables_corrupt_board_without_traceback tests/hermes_cli/test_kanban_core_functionality.py::test_gateway_dispatcher_retries_corrupt_board_after_quarantine -q -o 'addopts='
  • ./scripts/run_tests.sh tests/hermes_cli/test_kanban_core_functionality.py
  • git diff --check

@donovan-yohan donovan-yohan force-pushed the fix/kanban-corrupt-quarantine-retry branch from 6f1c108 to 7a83830 Compare May 27, 2026 14:12
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/gateway Gateway runner, session dispatch, delivery labels May 27, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Part of the kanban corruption cluster (#26479 canonical). Related: #32094 (broader quarantine hardening), #31226 (disk I/O quarantine extension), #31740 (fail-closed serialization). This PR narrows the scope to quarantine TTL + KanbanDbCorruptError coverage.

@teknium1

Copy link
Copy Markdown
Contributor

Merged via #33412. Your commit was cherry-picked onto current main with your authorship preserved — see c94ad8981 in git log (Donovan Yohan <donovan-yohan@users.noreply.github.com>). The salvage PR rebase-merged so each commit kept its original author.

This addresses the recovery-layer dimension — gateway dispatcher un-sticks itself after the 5-minute quarantine. The deeper corruption-source and backup-file-churn questions you raised are being tracked separately in the wider PR/issue cluster (#32543, #32593, #33169, #32094, #33319 etc.).

Thanks for the careful root-cause analysis in the report and for not over-claiming the TLS-FD-recycle hypothesis — that's exactly the right framing.

mathias3 pushed a commit to mathias3/hermes-agent that referenced this pull request May 28, 2026
Bryce-huang pushed a commit to wbkunlun/hermes-agent that referenced this pull request May 29, 2026
mosaiq-systems pushed a commit to mosaiq-systems/hermes-agent that referenced this pull request May 29, 2026
KKT-OPT pushed a commit to KKT-OPT/hermes-agent that referenced this pull request May 31, 2026
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants