fix(kanban-db): retry integrity probe before flagging DB as corrupt by emigal · Pull Request #31795 · NousResearch/hermes-agent

emigal · 2026-05-25T02:17:40Z

The kanban gateway dispatcher opens a fresh sqlite connection per tick, each of which runs PRAGMA integrity_check via _guard_existing_db_is_healthy. Under WAL with concurrent worker writes, the probe can transiently observe a torn page and either return a non-'ok' integrity row or raise sqlite3.DatabaseError('database disk image is malformed') -- even though the file is fine and the very next probe succeeds.

The previous guard treated the first such hit as terminal corruption: it copied the file to a timestamped .corrupt.*.bak and raised KanbanDbCorruptError, which the gateway dispatcher then used to disable dispatch on that board until the file mtime changed or the gateway restarted. In practice this caused the dispatcher to silently stop processing tasks on a perfectly healthy DB, and left dozens of spurious .corrupt backup files on disk.

Retry the integrity probe up to 3 times with a short backoff (250ms, 500ms) before declaring corruption. A genuinely corrupt file still gets flagged after 3 consistent failures; a transient WAL blip from a concurrent worker write now self-heals.

Adds a regression test that injects a single transient DatabaseError on the first probe attempt and asserts:

connect() succeeds (retry sees a healthy DB on attempt 2)
no .corrupt backup is produced
the retry path was actually exercised

Existing tests for genuine corruption and locked-but-healthy DBs continue to pass unchanged.

The kanban gateway dispatcher opens a fresh sqlite connection per tick, each of which runs PRAGMA integrity_check via _guard_existing_db_is_healthy. Under WAL with concurrent worker writes, the probe can transiently observe a torn page and either return a non-'ok' integrity row or raise sqlite3.DatabaseError('database disk image is malformed') -- even though the file is fine and the very next probe succeeds. The previous guard treated the first such hit as terminal corruption: it copied the file to a timestamped .corrupt.*.bak and raised KanbanDbCorruptError, which the gateway dispatcher then used to disable dispatch on that board until the file mtime changed or the gateway restarted. In practice this caused the dispatcher to silently stop processing tasks on a perfectly healthy DB, and left dozens of spurious .corrupt backup files on disk. Retry the integrity probe up to 3 times with a short backoff (250ms, 500ms) before declaring corruption. A genuinely corrupt file still gets flagged after 3 consistent failures; a transient WAL blip from a concurrent worker write now self-heals. Adds a regression test that injects a single transient DatabaseError on the first probe attempt and asserts: - connect() succeeds (retry sees a healthy DB on attempt 2) - no .corrupt backup is produced - the retry path was actually exercised Existing tests for genuine corruption and locked-but-healthy DBs continue to pass unchanged.

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard comp/plugins Plugin system and bundled plugins labels May 25, 2026

github-actions Bot mentioned this pull request May 25, 2026

🦞 OpenClaw 生态日报 2026-05-25 ivanweng2077/big_model_radar#87

Open

herrschmidt mentioned this pull request May 26, 2026

fix(kanban): remove false-positive corruption detection from separate probe connection #32449

Closed

alt-glitch mentioned this pull request May 26, 2026

fix(gateway): catch KanbanDbCorruptError in kanban dispatcher corrupt-board path #32490

Closed

teknium1 mentioned this pull request May 27, 2026

fix(kanban): retry corrupt-board dispatch after quarantine (salvage #33263) #33412

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban-db): retry integrity probe before flagging DB as corrupt#31795

fix(kanban-db): retry integrity probe before flagging DB as corrupt#31795
emigal wants to merge 1 commit into
NousResearch:mainfrom
emigal:fix/kanban-db-guard-retry-transient-malformed

emigal commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

emigal commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants