You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Harden Kanban handling for corrupt per-board SQLite databases.
Before this change, repeated access to the same corrupt board DB could create repeated kanban.db.corrupt.* backup files. This could grow unbounded across retries/restarts if a dashboard or dispatcher kept touching the same malformed board.
This patch makes corrupt-board handling idempotent and bounded.
Changes
Add durable corrupt-board quarantine state via kanban.db.corrupt-quarantine.json.
Fingerprint corrupt DB files so repeated access to the same corrupt DB does not create new backups.
Add CORRUPT_DB_BACKUP_RETENTION = 3.
Prune old .bak, .bak-wal, and .bak-shm corrupt backup files.
Clear stale quarantine markers after a healthy DB open/recovery.
Extend dashboard/API handling to return clean unreadable-board diagnostics.
Preserve dashboard fallback away from stale hermes.kanban.selectedBoard.
Extend gateway dispatcher handling for quarantined corrupt boards.
Add regression coverage for idempotent quarantine, retention, dashboard diagnostics, and dispatcher handling.
Validation
venv/bin/python -m pytest tests/plugins/test_kanban_dashboard_plugin.py
97 passed, 1 warning
venv/bin/python -m pytest tests/hermes_cli/test_kanban_db.py -k "corrupt_db_quarantine or prune_corrupt_db_backups or init_db_refuses_corrupt_existing_file or connect_refuses_corrupt_existing_file or locked_healthy_db_does_not_classify_as_corrupt"
6 passed, 169 deselected
venv/bin/python -m pytest tests/hermes_cli/test_kanban_core_functionality.py -k "gateway_dispatcher_disables_corrupt_board_without_traceback or gateway_dispatcher_disables_quarantined_corrupt_board_without_traceback"
2 passed, 164 deselected
KanbanDbCorruptError catch in _is_corrupt_board_db_error → already on main via the defensive getattr lookup that landed in c94ad8981 / #33482 commit fefb4617d series.
Persistent corrupt-board quarantine with JSON markers + backup rotation → the design diverges from the simpler in-memory latch on main (c94ad89's 5-min retry timer with fingerprint-change retry). Persistent markers across gateway restarts would prevent automated recovery once the underlying file changes — the current in-memory approach with fingerprint-based retry handles the recovery case automatically.
The remaining gap (transient-error confirmation before latching) is tracked as a follow-up in #33486, with exponential backoff + PRAGMA quick_check to distinguish real corruption from transient I/O. Thanks for the thorough write-up — the simpler-is-better direction won out but the failure scenarios you documented helped shape the policy.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
comp/gatewayGateway runner, session dispatch, deliverycomp/pluginsPlugin system and bundled pluginsP3Low — cosmetic, nice to havetype/bugSomething isn't working
3 participants
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Harden Kanban handling for corrupt per-board SQLite databases.
Before this change, repeated access to the same corrupt board DB could create repeated
kanban.db.corrupt.*backup files. This could grow unbounded across retries/restarts if a dashboard or dispatcher kept touching the same malformed board.This patch makes corrupt-board handling idempotent and bounded.
Changes
kanban.db.corrupt-quarantine.json.CORRUPT_DB_BACKUP_RETENTION = 3..bak,.bak-wal, and.bak-shmcorrupt backup files.hermes.kanban.selectedBoard.Validation