Skip to content

fix(kanban): change synchronous=NORMAL to FULL + add wal_autocheckpoint=100#31731

Closed
someaka wants to merge 1 commit into
NousResearch:mainfrom
someaka:fix/31502-kanban-synchronous-full-clean
Closed

fix(kanban): change synchronous=NORMAL to FULL + add wal_autocheckpoint=100#31731
someaka wants to merge 1 commit into
NousResearch:mainfrom
someaka:fix/31502-kanban-synchronous-full-clean

Conversation

@someaka

@someaka someaka commented May 24, 2026

Copy link
Copy Markdown

Problem

PRAGMA synchronous=NORMAL (line 1184 of hermes_cli/kanban_db.py) defers fsync in WAL mode, leaving kanban.db vulnerable to corruption when a process is SIGKILL'd mid-transaction or when WAL frames are partially written during concurrent access. This causes database disk image is malformed errors after ~9-10 rapid task creations.

Root Cause

synchronous=NORMAL means SQLite only syncs at checkpoint boundaries, not after every WAL frame write. If a writer process is killed mid-transaction (SIGKILL from reclaim, gateway shutdown, OOM killer), WAL frames may be written but the main DB is left in an inconsistent state. Next connection: malformed DB.

Also, the default 1000-page WAL checkpoint threshold lets the WAL grow large between checkpoints, widening the window where a SIGKILL can leave a huge WAL in a fragile state.

Fix — 2 lines, 1 file

  1. synchronous=NORMALFULL: ensures every WAL frame is fsync'd before the write completes, preventing WAL/main-DB inconsistency even after SIGKILL
  2. wal_autocheckpoint=100: caps WAL at 100 pages between automatic checkpoints — bounds the checkpoint I/O spike and reduces the window where a large WAL is fragile

These match the fix proposed in upstream PRs #30969 / #30973 which are not yet merged.

Trade-off

synchronous=FULL adds one fsync per write — ~1ms overhead on SSD, ~5-10ms on HDD. For kanban (create/show/complete operations, not a high-throughput OLTP path), this is negligible vs. the cost of DB corruption and manual recovery.

 hermes_cli/kanban_db.py | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Fixes: #31502
Related: #31618, #30896

…nt=100

PRAGMA synchronous=NORMAL defers fsync in WAL mode, leaving kanban.db
vulnerable to corruption when a process is SIGKILL'd mid-transaction
or when WAL frames are partially written during concurrent access.

- synchronous=NORMAL → FULL: fsync every WAL frame before write completes
- wal_autocheckpoint=100: limit WAL to 100 pages between auto-checkpoints

Fixes: NousResearch#31502
Related: NousResearch#31618, NousResearch#30896
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard labels May 24, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Competing with open PRs #30973 (synchronous=FULL + wal_autocheckpoint=100) and #31208 (secure_delete + cell_size_check + synchronous=FULL) — all target the same kanban SQLite corruption under concurrent writes. Clean replacement of stacked #31726. Related issue: #31618 (corruption recurs even with these PRAGMAs under SIGKILL).

@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Closing as already fixed on main — landed via #33482 commit 6416dd518 (@steveonjava's batch-salvage). That commit makes the exact same synchronous=NORMAL → FULL + wal_autocheckpoint=100 change you proposed here, and also adds secure_delete=ON + cell_size_check=ON for additional torn-write hardening. Thanks for tackling the same problem — same direction, just landed via the batch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Kanban SQLite database corruption under rapid task creation

3 participants