Skip to content

fix: quarantine kanban boards after sqlite disk I/O errors#31226

Closed
thebryce15 wants to merge 1 commit into
NousResearch:mainfrom
thebryce15:t_31bf63cd/dispatch-db-io-reap
Closed

fix: quarantine kanban boards after sqlite disk I/O errors#31226
thebryce15 wants to merge 1 commit into
NousResearch:mainfrom
thebryce15:t_31bf63cd/dispatch-db-io-reap

Conversation

@thebryce15

Copy link
Copy Markdown

Summary

  • quarantine kanban board DB fingerprints on repeated sqlite disk I/O errors in dispatcher ticks
  • suppress per-tick traceback spam while the bad DB fingerprint remains unchanged
  • add regression coverage for repeated disk I/O and corrupt-board handling

Test Plan

  • pytest -q tests/hermes_cli/test_kanban_core_functionality.py -k "disk_io_error_without_traceback or corrupt_board_without_traceback"
  • pytest -q tests/hermes_cli/test_kanban_core_functionality.py tests/hermes_cli/test_kanban_notify.py

@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins labels May 24, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Extends merged #28439 (quarantine corrupt kanban DBs for #26479) to also cover disk I/O error exceptions, not just "file is not a database" / "database disk image is malformed". Related: #30908 (index corruption), #31158 (WAL/SHM cache poisoning).

@thebryce15

Copy link
Copy Markdown
Author

@ddblue0's testing in #31736 characterized this as "works as a mitigation" rather than a root fix, that framing's accurate. The persistent-connection refactor in that thread is the real root direction; this is defensive depth for when I/O errors still slip through. I won't be driving review follow-up myself; deferring disposition to maintainers. Keep as mitigation, close in favor of the root fix, merge alongside it, all fine on my end.

@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Closing as superseded — different direction than the one that landed. This PR proposes latching the dispatcher when SQLite raises "disk i/o error", but #33482 commit 5c49cd0ed (fix(state): never silently downgrade WAL to DELETE on transient EIO) specifically removed EIO from the WAL-incompatibility marker list because it's transient (page-cache pressure, brief lock contention, recoverable storage hiccups) — not a permanent filesystem property. Latching the dispatcher on EIO would re-pause healthy boards after a one-tick I/O blip.

The corrupt-board latch already exists on main via c94ad8981 (5-minute retry quarantine with fingerprint-change retry). A follow-up issue (#33486) refines that with exponential backoff + PRAGMA quick_check confirmation to distinguish transient errors from real corruption before latching. Thanks for tackling the same problem space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants