Skip to content

fix(kanban): harden SQLite against torn-write corruption (secure_delete + cell_size_check + synchronous=FULL)#31208

Closed
steveonjava wants to merge 2 commits into
NousResearch:mainfrom
steveonjava:fix/kanban-sqlite-torn-write-hardening
Closed

fix(kanban): harden SQLite against torn-write corruption (secure_delete + cell_size_check + synchronous=FULL)#31208
steveonjava wants to merge 2 commits into
NousResearch:mainfrom
steveonjava:fix/kanban-sqlite-torn-write-hardening

Conversation

@steveonjava

@steveonjava steveonjava commented May 24, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

This PR adds three SQLite PRAGMA settings to the kanban database connect() function to protect against data corruption from torn writes and power loss:

  • PRAGMA secure_delete=ON — zeros freed pages on disk so stale data can't resurface if a later write leaves the page incomplete
  • PRAGMA cell_size_check=ON — catches corrupt cells as errors at read time instead of silently returning wrong data
  • PRAGMA synchronous=FULL — requires full fsync before checkpoint, narrowing the window where a crash between WAL commit and main-db write could corrupt a b-tree page header

These are set on every connection. secure_delete also persists to the DB header on fresh DBs, so the protection survives process restarts.

About disclosure: This is a data integrity fix, not a security vulnerability under the project's SECURITY.md §3.2. Database hardening is explicitly welcome as a public PR here — it doesn't cross OS-level isolation boundaries and doesn't fall under the in-scope categories. No private disclosure needed.

Related Issues

This PR addresses one piece of a broader cluster of open kanban-DB-corruption reports:

The three pragma changes here are defense-in-depth at the page layer; they do not fix the dispatcher-level write amplification or the gateway-restart latch in those issues but they reduce blast radius when those scenarios fire.

Related PRs

Refreshed prior-art snapshot at packaging time (2026-05-23 17:30 PT); these references reflect upstream state at PR open.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • Add PRAGMA secure_delete=ON to connect() with a comment explaining why (Bug E forensics: torn-write signature, stale-cell-content exposure)
  • Add PRAGMA cell_size_check=ON to connect()
  • Change PRAGMA synchronous=NORMAL to PRAGMA synchronous=FULL
  • Four new tests in tests/hermes_cli/test_kanban_db.py:
    • test_connect_sets_secure_delete_on
    • test_connect_sets_cell_size_check_on
    • test_connect_sets_synchronous_full
    • test_connect_pragmas_applied_on_reconnect

How to Test

Run the kanban DB tests:

scripts/run_tests.sh tests/hermes_cli/test_kanban_db.py -q

All four new tests pass, confirming each pragma is set on fresh and reconnected connections. Full test suite also passes (no behavior changes).

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes)
  • I've tested on my platform

@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins labels May 24, 2026
…te + cell_size_check + synchronous=FULL)

Production corruption NousResearch#6 left b-tree pages with zeroed headers but intact old cell content — the Bug E pattern. This fix applies three pragma calls on every connect():

- synchronous=FULL (was NORMAL): closes the WAL-checkpoint reordering window where a crash between WAL commit and main-DB write leaves a partially-written b-tree page header. Cost is <1ms per commit on local SSD; negligible at kanban write volume.

- secure_delete=ON: forces SQLite to zero freed page bytes on disk. If a torn write or hardware fault later corrupts a page, the underlying cell content is zero, so corruption is detectable and no stale rows can resurface as live data.

- cell_size_check=ON: adds a read-side guard so corrupt cells surface as errors at read time rather than as silent wrong-data returns.

All three are connection-scoped and re-applied on every connect(). secure_delete also writes a persistent flag into the DB header on the first call against a fresh DB, making the protection durable across processes for new DBs.

Tests added for all four required cases: each pragma active on a fresh connection, and all three re-applied after close+reopen. Also adds the required negative test (migration path does not reset pragmas).
@steveonjava steveonjava force-pushed the fix/kanban-sqlite-torn-write-hardening branch from ce363ac to 3d06ab0 Compare May 24, 2026 01:11
Required by .github/workflows/contributor-check.yml for first-time
contributor PRs whose diff touches *.py files.
@steveonjava steveonjava force-pushed the fix/kanban-sqlite-torn-write-hardening branch from 3d06ab0 to b47e94d Compare May 24, 2026 01:15
@steveonjava steveonjava marked this pull request as ready for review May 24, 2026 01:19
@JoshBolding

Copy link
Copy Markdown

I hit a very similar Kanban DB corruption path on WSL after a local worker retry-storm while the local model endpoint was unavailable. The synchronous=FULL part of this PR matches the local patch that stabilized my install.

One extra setting that helped in my local patch was:

conn.execute("PRAGMA journal_size_limit=8388608")
That was meant to keep a retry-storm from letting the WAL grow huge between checkpoints. Not sure if you want that in this PR or as a separate follow-up, but this PR is aligned with the failure mode I saw.

@steveonjava

Copy link
Copy Markdown
Contributor Author

Thanks for sharing your reproduction on WSL, @JoshBolding. The synchronous=FULL part of this PR is what actually protects against the corruption pattern you saw, so on its own it should stabilize your install. The journal_size_limit=8388608 setting in your local patch is harmless but does not do what its name suggests: it only takes effect when SQLite runs a TRUNCATE checkpoint, which the default wal_autocheckpoint path never does, so the WAL was already bounded around 4 MiB by wal_autocheckpoint=1000 regardless of the limit. I confirmed it with a 200k-insert retry-storm reproducer: max WAL is 3.95 MiB with or without the pragma set.

@JoshBolding

Copy link
Copy Markdown

That makes sense, thanks for checking it with a reproducer. I had assumed journal_size_limit was helping bound the retry-storm case, but if the default WAL autocheckpoint already keeps it around ~4 MiB and the setting only matters for TRUNCATE checkpoints, then I agree it is not necessary for this fix.

The important part for my case was avoiding the corruption pattern under WSL, so synchronous=FULL covering that is exactly what I was hoping for. Appreciate you digging into it.

steveonjava added a commit to steveonjava/hermes-agent that referenced this pull request May 25, 2026
…ite txn

The two functions previously ran in separate IMMEDIATE transactions. The
inter-txn gap is a window where a WAL auto-checkpoint can partially flush —
transferring the tasks-table page to main-db while leaving idx_tasks_status
pages in WAL. If the checkpoint is then interrupted (SIGTERM, EIO, OS
buffer eviction), the index drifts from the table and surfaces on the next
connection as "row N missing from index idx_tasks_status".

Fix: add a `_within_txn=False` kwarg to recompute_ready and to the
`_clear_failure_counter` helper; when True they skip their own write_txn
wrapper and execute inline. complete_task now invokes both with
`_within_txn=True` inside its own write_txn, so the parent's status='done'
UPDATE, the `completed` event, the failure-counter reset, and every child
status='ready' UPDATE land in a single COMMIT. The checkpoint window
closes.

Stress reproducer in tests/stress/ asserts exactly ONE BEGIN IMMEDIATE for
the merged txn and clean PRAGMA integrity_check across 50 sequential
completions and across two concurrent connections.

Cross-references: NousResearch#31208 (Bug E hardening — synchronous=FULL +
secure_delete + cell_size_check, surfaces drift as loud rather than silent)
and NousResearch#30908 (related corruption class triggered by EIO during checkpoint).
@steveonjava

Copy link
Copy Markdown
Contributor Author

Bundled into #32857 for batch review. This draft remains open as a cherry-pick fallback if maintainers prefer surgical landing.

kshitijk4poor pushed a commit that referenced this pull request May 27, 2026
Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect().

Refs: #31208 (Bug E - same file, coordinate), #30973 (wal_autocheckpoint)
Refs: #30445, #30896, #30908 (corruption reports)
@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Merged via #33482 (commit 6416dd5). Authorship preserved.

mathias3 pushed a commit to mathias3/hermes-agent that referenced this pull request May 28, 2026
Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect().

Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint)
Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports)
Bryce-huang pushed a commit to wbkunlun/hermes-agent that referenced this pull request May 29, 2026
Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect().

Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint)
Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports)

#AI commit#
mosaiq-systems pushed a commit to mosaiq-systems/hermes-agent that referenced this pull request May 29, 2026
Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect().

Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint)
Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports)
KKT-OPT pushed a commit to KKT-OPT/hermes-agent that referenced this pull request May 31, 2026
Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect().

Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint)
Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports)
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect().

Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint)
Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix: SQLite WAL + BTRFS COW compatibility — busy_timeout + retry logic

4 participants