fix: quarantine kanban boards after sqlite disk I/O errors by thebryce15 · Pull Request #31226 · NousResearch/hermes-agent

thebryce15 · 2026-05-24T01:06:52Z

Summary

quarantine kanban board DB fingerprints on repeated sqlite disk I/O errors in dispatcher ticks
suppress per-tick traceback spam while the bad DB fingerprint remains unchanged
add regression coverage for repeated disk I/O and corrupt-board handling

Test Plan

pytest -q tests/hermes_cli/test_kanban_core_functionality.py -k "disk_io_error_without_traceback or corrupt_board_without_traceback"
pytest -q tests/hermes_cli/test_kanban_core_functionality.py tests/hermes_cli/test_kanban_notify.py

alt-glitch · 2026-05-24T01:27:44Z

Extends merged #28439 (quarantine corrupt kanban DBs for #26479) to also cover disk I/O error exceptions, not just "file is not a database" / "database disk image is malformed". Related: #30908 (index corruption), #31158 (WAL/SHM cache poisoning).

thebryce15 · 2026-05-26T20:03:13Z

@ddblue0's testing in #31736 characterized this as "works as a mitigation" rather than a root fix, that framing's accurate. The persistent-connection refactor in that thread is the real root direction; this is defensive depth for when I/O errors still slip through. I won't be driving review follow-up myself; deferring disposition to maintainers. Keep as mitigation, close in favor of the root fix, merge alongside it, all fine on my end.

kshitijk4poor · 2026-05-28T06:40:02Z

Closing as superseded — different direction than the one that landed. This PR proposes latching the dispatcher when SQLite raises "disk i/o error", but #33482 commit 5c49cd0ed (fix(state): never silently downgrade WAL to DELETE on transient EIO) specifically removed EIO from the WAL-incompatibility marker list because it's transient (page-cache pressure, brief lock contention, recoverable storage hiccups) — not a permanent filesystem property. Latching the dispatcher on EIO would re-pause healthy boards after a one-tick I/O blip.

The corrupt-board latch already exists on main via c94ad8981 (5-minute retry quarantine with fingerprint-change retry). A follow-up issue (#33486) refines that with exponential backoff + PRAGMA quick_check confirmation to distinguish transient errors from real corruption before latching. Thanks for tackling the same problem space.

fix: quarantine kanban boards on sqlite disk io errors

413e7a6

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins labels May 24, 2026

ddblue0 mentioned this pull request May 25, 2026

Gateway embedded Kanban dispatcher opens SQLite WAL connections every tick, causing FD/WAL pressure #31736

Closed

alt-glitch mentioned this pull request May 27, 2026

fix(kanban): retry corrupt-board dispatch after quarantine #33263

Closed

kshitijk4poor closed this May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: quarantine kanban boards after sqlite disk I/O errors#31226

fix: quarantine kanban boards after sqlite disk I/O errors#31226
thebryce15 wants to merge 1 commit into
NousResearch:mainfrom
thebryce15:t_31bf63cd/dispatch-db-io-reap

thebryce15 commented May 24, 2026

Uh oh!

alt-glitch commented May 24, 2026

Uh oh!

thebryce15 commented May 26, 2026

Uh oh!

kshitijk4poor commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

thebryce15 commented May 24, 2026

Summary

Test Plan

Uh oh!

alt-glitch commented May 24, 2026

Uh oh!

thebryce15 commented May 26, 2026

Uh oh!

kshitijk4poor commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants