Skip to content

fix(kanban): add post-commit page_count invariant check to write_txn#32300

Closed
steveonjava wants to merge 2 commits into
NousResearch:mainfrom
steveonjava:fix/kanban-bug-f-torn-extend-post-commit-check
Closed

fix(kanban): add post-commit page_count invariant check to write_txn#32300
steveonjava wants to merge 2 commits into
NousResearch:mainfrom
steveonjava:fix/kanban-bug-f-torn-extend-post-commit-check

Conversation

@steveonjava

Copy link
Copy Markdown
Contributor

What does this PR do?

This PR adds a post-commit page_count invariant check to write_txn in the kanban DB connection helper, catching torn-extend corruption at commit time instead of letting it propagate silently to subsequent reads.

The kanban DB has seen several production corruption events whose damage class is "torn-extend": SQLite's header page_count is updated to N after a commit, but the file was only extended to fewer than N pages on disk. The result is a page_count > file_pages mismatch that surfaces hours later as database disk image is malformed, with no way to attribute it to the specific transaction that caused it. By the time the integrity-check watchdog notices, the malformed page is interleaved with hundreds of unrelated subsequent writes, making recovery painful and forensics impossible.

The fix is a cheap, in-transaction sanity check: after every COMMIT in write_txn, read the 4-byte page_count field from the SQLite header (bytes 28–31) and compare against the actual file size. If actual_pages < page_count (the torn-extend signature), raise sqlite3.DatabaseError so the caller's transaction is rolled back instead of silently leaving the DB in an inconsistent state.

The check is bounded to a single pread() of 32 bytes per commit. Negligible cost (sub-millisecond on local SSD), zero behavior change for healthy DBs. Also bumps wal_autocheckpoint=100 in connect() to keep the WAL bounded under heavy write activity, which reduces the inter-commit window during which a torn-extend can land.

Related Issues

Refs #31208 (Bug E pragma hardening — same file, coordinate on merge ordering)
Refs #30973 (wal_autocheckpoint hardening — orthogonal)
Refs #30445, #30896, #30908 (related corruption reports)

This PR does NOT close any of the above on its own — they describe larger corruption-resilience programs. This is one detection knob in that program.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature
  • Breaking change
  • Documentation update

Changes Made

  • hermes_cli/kanban_db.py — Added page_count post-commit check inside write_txn. Added PRAGMA wal_autocheckpoint=100 to connect().
  • tests/hermes_cli/test_kanban_db.py — 5 new tests: healthy commit passes the check, simulated torn-extend (truncate the file post-commit) raises DatabaseError, the check is skipped for read-only / no-commit code paths, the pragma is set after connect, and the check is robust against concurrent writers (file size grows between header read and stat).

How to Test

scripts/run_tests.sh                           # full suite (matches CI)
pytest tests/hermes_cli/test_kanban_db.py -q   # the 5 new + existing kanban_db tests

The new tests assert behavior under the simulated torn-extend by writing a normal commit then truncating the file with os.truncate to drop the last page; the subsequent assertion is that the next write_txn raises on the check. (Real kernel-level torn-extends are rare and can't be reliably forced in CI — the truncate is a faithful proxy because it produces the exact same header-vs-file-pages mismatch.)

Checklist

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(kanban): ...)
  • I've checked there isn't already a PR for this change
  • My PR includes only changes related to one bug/feature
  • scripts/run_tests.sh passes locally
  • I've added tests for my changes (required for bug fixes — 5 new tests covering positive + adversarial paths)
  • I've tested on my platform (Linux/macOS/Windows) — human reviewer to confirm

Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect().

Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint)
Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports)
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins labels May 25, 2026
@hehehe0803

Copy link
Copy Markdown
Contributor

I hit the same corruption signature in a local Kanban board and this PR is pointed at the right failure class: valid SQLite header, but header page_count greater than the physical file length after the WAL sidecar is gone/empty.

One nuance: the current patch detects the torn-extend/page-count mismatch after COMMIT, but it does not prevent the common WAL-sidecar-loss variant. In WAL mode, a commit can leave newly allocated pages only in kanban.db-wal while the main DB header advertises the larger page count. If kanban.db-wal is later lost/truncated/copied separately, the main DB becomes malformed on the next open.

A complementary fix worth considering here instead of a separate duplicate PR:

  • after successful Kanban write_txn commit, run a best-effort PRAGMA wal_checkpoint(TRUNCATE) so the no-active-reader path collapses back into a self-contained kanban.db;
  • make that checkpoint non-blocking by temporarily setting PRAGMA busy_timeout=0 around the checkpoint and restoring the prior timeout, otherwise an active dashboard/gateway reader can make every write stall for the normal 30s SQLite timeout;
  • add two focused regressions:
    1. large Kanban write allocates overflow pages, delete kanban.db-wal/-shm, reconnect, PRAGMA integrity_check == ok;
    2. active reader holds a snapshot while another connection writes, and the post-commit checkpoint path returns quickly instead of blocking on the busy timeout.

So: +1 on this issue shape, but I’d treat post-commit invariant checking as detection/forensics, and a non-blocking post-write checkpoint as the prevention layer. Happy to test or help fold this into this PR rather than spawn yet another Kanban-corruption PR hydra head.

@steveonjava

Copy link
Copy Markdown
Contributor Author

Bundled into #32857 for batch review. This draft remains open as a cherry-pick fallback if maintainers prefer surgical landing.

@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Merged via #33482 (commit 99c19eb). Cherry-picked with authorship preserved as part of the @steveonjava batch salvage from #32857. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants