fix(kanban): change synchronous=NORMAL to FULL + add wal_autocheckpoint=100#31731
Closed
someaka wants to merge 1 commit into
Closed
fix(kanban): change synchronous=NORMAL to FULL + add wal_autocheckpoint=100#31731someaka wants to merge 1 commit into
someaka wants to merge 1 commit into
Conversation
…nt=100 PRAGMA synchronous=NORMAL defers fsync in WAL mode, leaving kanban.db vulnerable to corruption when a process is SIGKILL'd mid-transaction or when WAL frames are partially written during concurrent access. - synchronous=NORMAL → FULL: fsync every WAL frame before write completes - wal_autocheckpoint=100: limit WAL to 100 pages between auto-checkpoints Fixes: NousResearch#31502 Related: NousResearch#31618, NousResearch#30896
Collaborator
|
Competing with open PRs #30973 (synchronous=FULL + wal_autocheckpoint=100) and #31208 (secure_delete + cell_size_check + synchronous=FULL) — all target the same kanban SQLite corruption under concurrent writes. Clean replacement of stacked #31726. Related issue: #31618 (corruption recurs even with these PRAGMAs under SIGKILL). |
8 tasks
Collaborator
|
Closing as already fixed on main — landed via #33482 commit 6416dd518 (@steveonjava's batch-salvage). That commit makes the exact same |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
PRAGMA synchronous=NORMAL(line 1184 ofhermes_cli/kanban_db.py) defers fsync in WAL mode, leavingkanban.dbvulnerable to corruption when a process is SIGKILL'd mid-transaction or when WAL frames are partially written during concurrent access. This causesdatabase disk image is malformederrors after ~9-10 rapid task creations.Root Cause
synchronous=NORMALmeans SQLite only syncs at checkpoint boundaries, not after every WAL frame write. If a writer process is killed mid-transaction (SIGKILL from reclaim, gateway shutdown, OOM killer), WAL frames may be written but the main DB is left in an inconsistent state. Next connection: malformed DB.Also, the default 1000-page WAL checkpoint threshold lets the WAL grow large between checkpoints, widening the window where a SIGKILL can leave a huge WAL in a fragile state.
Fix — 2 lines, 1 file
synchronous=NORMAL→FULL: ensures every WAL frame is fsync'd before the write completes, preventing WAL/main-DB inconsistency even after SIGKILLwal_autocheckpoint=100: caps WAL at 100 pages between automatic checkpoints — bounds the checkpoint I/O spike and reduces the window where a large WAL is fragileThese match the fix proposed in upstream PRs #30969 / #30973 which are not yet merged.
Trade-off
synchronous=FULLadds one fsync per write — ~1ms overhead on SSD, ~5-10ms on HDD. For kanban (create/show/complete operations, not a high-throughput OLTP path), this is negligible vs. the cost of DB corruption and manual recovery.Fixes: #31502
Related: #31618, #30896