fix(kanban): add post-commit page_count invariant check to write_txn#32300
fix(kanban): add post-commit page_count invariant check to write_txn#32300steveonjava wants to merge 2 commits into
Conversation
Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect(). Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint) Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports)
|
I hit the same corruption signature in a local Kanban board and this PR is pointed at the right failure class: valid SQLite header, but header page_count greater than the physical file length after the WAL sidecar is gone/empty. One nuance: the current patch detects the torn-extend/page-count mismatch after A complementary fix worth considering here instead of a separate duplicate PR:
So: +1 on this issue shape, but I’d treat post-commit invariant checking as detection/forensics, and a non-blocking post-write checkpoint as the prevention layer. Happy to test or help fold this into this PR rather than spawn yet another Kanban-corruption PR hydra head. |
|
Bundled into #32857 for batch review. This draft remains open as a cherry-pick fallback if maintainers prefer surgical landing. |
|
Merged via #33482 (commit 99c19eb). Cherry-picked with authorship preserved as part of the @steveonjava batch salvage from #32857. Thanks! |
What does this PR do?
This PR adds a post-commit page_count invariant check to
write_txnin the kanban DB connection helper, catching torn-extend corruption at commit time instead of letting it propagate silently to subsequent reads.The kanban DB has seen several production corruption events whose damage class is "torn-extend": SQLite's header
page_countis updated to N after a commit, but the file was only extended to fewer than N pages on disk. The result is apage_count > file_pagesmismatch that surfaces hours later asdatabase disk image is malformed, with no way to attribute it to the specific transaction that caused it. By the time the integrity-check watchdog notices, the malformed page is interleaved with hundreds of unrelated subsequent writes, making recovery painful and forensics impossible.The fix is a cheap, in-transaction sanity check: after every
COMMITinwrite_txn, read the 4-byte page_count field from the SQLite header (bytes 28–31) and compare against the actual file size. Ifactual_pages < page_count(the torn-extend signature), raisesqlite3.DatabaseErrorso the caller's transaction is rolled back instead of silently leaving the DB in an inconsistent state.The check is bounded to a single
pread()of 32 bytes per commit. Negligible cost (sub-millisecond on local SSD), zero behavior change for healthy DBs. Also bumpswal_autocheckpoint=100inconnect()to keep the WAL bounded under heavy write activity, which reduces the inter-commit window during which a torn-extend can land.Related Issues
Refs #31208 (Bug E pragma hardening — same file, coordinate on merge ordering)
Refs #30973 (wal_autocheckpoint hardening — orthogonal)
Refs #30445, #30896, #30908 (related corruption reports)
This PR does NOT close any of the above on its own — they describe larger corruption-resilience programs. This is one detection knob in that program.
Type of Change
Changes Made
hermes_cli/kanban_db.py— Added page_count post-commit check insidewrite_txn. AddedPRAGMA wal_autocheckpoint=100toconnect().tests/hermes_cli/test_kanban_db.py— 5 new tests: healthy commit passes the check, simulated torn-extend (truncate the file post-commit) raisesDatabaseError, the check is skipped for read-only / no-commit code paths, the pragma is set after connect, and the check is robust against concurrent writers (file size grows between header read and stat).How to Test
The new tests assert behavior under the simulated torn-extend by writing a normal commit then truncating the file with
os.truncateto drop the last page; the subsequent assertion is that the nextwrite_txnraises on the check. (Real kernel-level torn-extends are rare and can't be reliably forced in CI — the truncate is a faithful proxy because it produces the exact same header-vs-file-pages mismatch.)Checklist
fix(kanban): ...)scripts/run_tests.shpasses locally