Skip to content

fix: harden kanban corrupt board quarantine#32094

Closed
alphathetacoding wants to merge 1 commit into
NousResearch:mainfrom
alphathetacoding:fix/kanban-corrupt-quarantine
Closed

fix: harden kanban corrupt board quarantine#32094
alphathetacoding wants to merge 1 commit into
NousResearch:mainfrom
alphathetacoding:fix/kanban-corrupt-quarantine

Conversation

@alphathetacoding

Copy link
Copy Markdown

Summary

Harden Kanban handling for corrupt per-board SQLite databases.

Before this change, repeated access to the same corrupt board DB could create repeated kanban.db.corrupt.* backup files. This could grow unbounded across retries/restarts if a dashboard or dispatcher kept touching the same malformed board.

This patch makes corrupt-board handling idempotent and bounded.

Changes

  • Add durable corrupt-board quarantine state via kanban.db.corrupt-quarantine.json.
  • Fingerprint corrupt DB files so repeated access to the same corrupt DB does not create new backups.
  • Add CORRUPT_DB_BACKUP_RETENTION = 3.
  • Prune old .bak, .bak-wal, and .bak-shm corrupt backup files.
  • Clear stale quarantine markers after a healthy DB open/recovery.
  • Extend dashboard/API handling to return clean unreadable-board diagnostics.
  • Preserve dashboard fallback away from stale hermes.kanban.selectedBoard.
  • Extend gateway dispatcher handling for quarantined corrupt boards.
  • Add regression coverage for idempotent quarantine, retention, dashboard diagnostics, and dispatcher handling.

Validation

venv/bin/python -m pytest tests/plugins/test_kanban_dashboard_plugin.py
97 passed, 1 warning

venv/bin/python -m pytest tests/hermes_cli/test_kanban_db.py -k "corrupt_db_quarantine or prune_corrupt_db_backups or init_db_refuses_corrupt_existing_file or connect_refuses_corrupt_existing_file or locked_healthy_db_does_not_classify_as_corrupt"
6 passed, 169 deselected

venv/bin/python -m pytest tests/hermes_cli/test_kanban_core_functionality.py -k "gateway_dispatcher_disables_corrupt_board_without_traceback or gateway_dispatcher_disables_quarantined_corrupt_board_without_traceback"
2 passed, 164 deselected

@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Closing as superseded. Two pieces here:

  1. KanbanDbCorruptError catch in _is_corrupt_board_db_error → already on main via the defensive getattr lookup that landed in c94ad8981 / #33482 commit fefb4617d series.

  2. Persistent corrupt-board quarantine with JSON markers + backup rotation → the design diverges from the simpler in-memory latch on main (c94ad89's 5-min retry timer with fingerprint-change retry). Persistent markers across gateway restarts would prevent automated recovery once the underlying file changes — the current in-memory approach with fingerprint-based retry handles the recovery case automatically.

The remaining gap (transient-error confirmation before latching) is tracked as a follow-up in #33486, with exponential backoff + PRAGMA quick_check to distinguish real corruption from transient I/O. Thanks for the thorough write-up — the simpler-is-better direction won out but the failure scenarios you documented helped shape the policy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants