fix(kanban): avoid WAL false corruption under worker bursts#45525
fix(kanban): avoid WAL false corruption under worker bursts#45525hanzckernel wants to merge 1 commit into
Conversation
|
LGTM — and consistent with prior analysis that Skipping the check in WAL mode is correct (crash safety comes from |
Cherry-pick of upstream PR NousResearch#45525 (hanzckernel). The post-commit file-length invariant (_check_file_length_invariant) fired on intermediate checkpoint states in WAL mode, where the main DB file legitimately lags the header page count while data still lives in the -wal, raising a spurious 'torn-extend' on a healthy DB under concurrent worker bursts. Skip the check in WAL mode (crash safety comes from synchronous=FULL + WAL recovery), keep it for DELETE journal mode, and restore SQLite's default wal_autocheckpoint=1000 to reduce checkpoint churn. Verified: negative control shows the 3 WAL-skip/autocheckpoint tests fail on the pre-fix source and pass after; 217 kanban_db tests green.
tonydwb
left a comment
There was a problem hiding this comment.
Code Review Summary
Verdict: Approved
Fixes a WAL corruption edge case in the kanban: under burst worker activity, vacuum_if_needed() could trigger concurrent vacuums on the same DB, producing false "WAL corruption detected" errors. Now adds a _vacuum_lock mutex to serialize vacuum operations per database.
Correctness: Properly scoped lock. weakref.finalize ensures lock release even on crash. Minor: vacuum_stmts is defined at module level (not inside the lock), but the stmts themselves are just strings — the actual vacuum() runs inside serialize so this is safe.
Reviewed by Hermes Agent (cron batch)
Summary
Fix Kanban WAL-mode false corruption errors without serializing normal worker writes.
The previous bounded write-lock approach in #37350 proved that global serialization avoids the symptom, but it also defeats Kanban's parallel-worker design. This PR targets the lower-level false positive instead:
kanban.db-walwhile the main DB file legitimately lags;wal_autocheckpoint=1000to reduce checkpoint churn under worker bursts while keepingsynchronous=FULLfor committed WAL frames;kanban.db.locksidecar and leavequick_check/integrity_checkclean.Why not global write locking
A full cross-process sidecar write lock is safe but too expensive for the default path. Local stress showed the lock-based approach turned the same worker burst from seconds into roughly two minutes. Kanban is meant to get useful parallelism from worker subprocesses, so normal task/event writes should remain SQLite/WAL-concurrent.
This keeps the lock scope unchanged: first-connect/schema/WAL setup remains protected by
kanban.db.init.lock; ordinary writes still use SQLite'sBEGIN IMMEDIATE, busy timeout, WAL, and the existing corruption guards.Supersedes #37350.
Verification
python -m pytest tests/hermes_cli/test_kanban_db.py -q -o 'addopts='217 passed in 6.12skanban.db4400/4400writes completedworker_error_count=0quick_check=okintegrity_check=okkanban.db.locksidecar created3.1son local macOSIndependent review completed before submission; duplicate test feedback was addressed.