fix: harden kanban corrupt board quarantine by alphathetacoding · Pull Request #32094 · NousResearch/hermes-agent

alphathetacoding · 2026-05-25T13:37:03Z

Summary

Harden Kanban handling for corrupt per-board SQLite databases.

Before this change, repeated access to the same corrupt board DB could create repeated kanban.db.corrupt.* backup files. This could grow unbounded across retries/restarts if a dashboard or dispatcher kept touching the same malformed board.

This patch makes corrupt-board handling idempotent and bounded.

Changes

Add durable corrupt-board quarantine state via kanban.db.corrupt-quarantine.json.
Fingerprint corrupt DB files so repeated access to the same corrupt DB does not create new backups.
Add CORRUPT_DB_BACKUP_RETENTION = 3.
Prune old .bak, .bak-wal, and .bak-shm corrupt backup files.
Clear stale quarantine markers after a healthy DB open/recovery.
Extend dashboard/API handling to return clean unreadable-board diagnostics.
Preserve dashboard fallback away from stale hermes.kanban.selectedBoard.
Extend gateway dispatcher handling for quarantined corrupt boards.
Add regression coverage for idempotent quarantine, retention, dashboard diagnostics, and dispatcher handling.

Validation

venv/bin/python -m pytest tests/plugins/test_kanban_dashboard_plugin.py
97 passed, 1 warning

venv/bin/python -m pytest tests/hermes_cli/test_kanban_db.py -k "corrupt_db_quarantine or prune_corrupt_db_backups or init_db_refuses_corrupt_existing_file or connect_refuses_corrupt_existing_file or locked_healthy_db_does_not_classify_as_corrupt"
6 passed, 169 deselected

venv/bin/python -m pytest tests/hermes_cli/test_kanban_core_functionality.py -k "gateway_dispatcher_disables_corrupt_board_without_traceback or gateway_dispatcher_disables_quarantined_corrupt_board_without_traceback"
2 passed, 164 deselected

kshitijk4poor · 2026-05-28T06:40:04Z

Closing as superseded. Two pieces here:

KanbanDbCorruptError catch in _is_corrupt_board_db_error → already on main via the defensive getattr lookup that landed in c94ad8981 / #33482 commit fefb4617d series.
Persistent corrupt-board quarantine with JSON markers + backup rotation → the design diverges from the simpler in-memory latch on main (c94ad89's 5-min retry timer with fingerprint-change retry). Persistent markers across gateway restarts would prevent automated recovery once the underlying file changes — the current in-memory approach with fingerprint-based retry handles the recovery case automatically.

The remaining gap (transient-error confirmation before latching) is tracked as a follow-up in #33486, with exponential backoff + PRAGMA quick_check to distinguish real corruption from transient I/O. Thanks for the thorough write-up — the simpler-is-better direction won out but the failure scenarios you documented helped shape the policy.

fix: harden kanban corrupt board quarantine

4275249

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins comp/gateway Gateway runner, session dispatch, delivery labels May 25, 2026

This was referenced May 26, 2026

fix(gateway): catch KanbanDbCorruptError in kanban dispatcher corrupt-board path #32490

Closed

fix: stabilize gateway recovery and kanban backups #32493

Open

Danuselli mentioned this pull request May 26, 2026

feat(kanban): add busy_timeout PRAGMA to prevent WAL corruption under concurrent writers #32532

Closed

This was referenced May 27, 2026

[Bug] Kanban DB corruption when gateway and dashboard open the same board DB concurrently (WAL mode) #33169

Closed

fix(kanban): retry corrupt-board dispatch after quarantine #33263

Closed

teknium1 mentioned this pull request May 27, 2026

fix(kanban): retry corrupt-board dispatch after quarantine (salvage #33263) #33412

Merged

alt-glitch mentioned this pull request May 27, 2026

fix(kanban): quarantine corrupt DB backups by fingerprint #33529

Closed

19 tasks

kshitijk4poor closed this May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: harden kanban corrupt board quarantine#32094

fix: harden kanban corrupt board quarantine#32094
alphathetacoding wants to merge 1 commit into
NousResearch:mainfrom
alphathetacoding:fix/kanban-corrupt-quarantine

alphathetacoding commented May 25, 2026

Uh oh!

kshitijk4poor commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alphathetacoding commented May 25, 2026

Summary

Changes

Validation

Uh oh!

kshitijk4poor commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants