fix(kanban): harden SQLite against torn-write corruption (secure_delete + cell_size_check + synchronous=FULL) by steveonjava · Pull Request #31208 · NousResearch/hermes-agent

steveonjava · 2026-05-24T00:05:00Z

What does this PR do?

This PR adds three SQLite PRAGMA settings to the kanban database connect() function to protect against data corruption from torn writes and power loss:

PRAGMA secure_delete=ON — zeros freed pages on disk so stale data can't resurface if a later write leaves the page incomplete
PRAGMA cell_size_check=ON — catches corrupt cells as errors at read time instead of silently returning wrong data
PRAGMA synchronous=FULL — requires full fsync before checkpoint, narrowing the window where a crash between WAL commit and main-db write could corrupt a b-tree page header

These are set on every connection. secure_delete also persists to the DB header on fresh DBs, so the protection survives process restarts.

About disclosure: This is a data integrity fix, not a security vulnerability under the project's SECURITY.md §3.2. Database hardening is explicitly welcome as a public PR here — it doesn't cross OS-level isolation boundaries and doesn't fall under the in-scope categories. No private disclosure needed.

Related Issues

This PR addresses one piece of a broader cluster of open kanban-DB-corruption reports:

Refs Kanban DB corruption risk from multi-gateway concurrent SQLite access #30445 (kanban DB corruption risk from multi-gateway concurrent SQLite access)
Refs [Bug]: Kanban: rapid worker spawn-crash loop (sub-2s/crash) corrupts board SQLite B-tree before failure_limit trips #30896 (kanban: rapid worker spawn-crash loop corrupts board SQLite B-tree)
Refs kanban.db index corruption after frequent gateway restarts — dispatcher disables board permanently #30908 (kanban.db index corruption after frequent gateway restarts)

The three pragma changes here are defense-in-depth at the page layer; they do not fix the dispatcher-level write amplification or the gateway-restart latch in those issues but they reduce blast radius when those scenarios fire.

Related PRs

Refs fix: harden kanban sqlite durability #30645 (synchronous=FULL only) — this PR's synchronous=FULL change overlaps; the additional secure_delete and cell_size_check pragmas are unique to this PR. Up to maintainers which to land.
Refs fix(state): add PRAGMA synchronous=FULL + TRUNCATE checkpoint to prevent WAL corruption #30654 (state.db: synchronous=FULL + TRUNCATE checkpoint) — sibling DB, same pragma philosophy. Complementary.
Refs fix(state): retry transient SQLite WAL setup failures (Fixes #30576) #30700 (state.db: retry transient SQLite WAL setup failures, Fixes Fix: SQLite WAL + BTRFS COW compatibility — busy_timeout + retry logic #30576) — adjacent error-recovery angle on the sibling DB.
Refs fix: harden kanban sqlite corruption handling #30969 (DB-backed dispatcher leases, fail-closed corrupt-DB backup, reconcile-completions safeguards) — same area of kanban_db.py, complementary scope (this PR is page-layer hardening; fix: harden kanban sqlite corruption handling #30969 is dispatcher-layer hardening + recovery). The two PRs may merge-conflict on kanban_db.py but address different failure modes.
Refs feat(kanban): opt-in HERMES_KANBAN_SYNCHRONOUS_MODE for synchronous=FULL durability hardening #30973 (synchronous=FULL + wal_autocheckpoint for a different root cause) — wal_autocheckpoint is orthogonal; not included here.

Refreshed prior-art snapshot at packaging time (2026-05-23 17:30 PT); these references reflect upstream state at PR open.

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

Add PRAGMA secure_delete=ON to connect() with a comment explaining why (Bug E forensics: torn-write signature, stale-cell-content exposure)
Add PRAGMA cell_size_check=ON to connect()
Change PRAGMA synchronous=NORMAL to PRAGMA synchronous=FULL
Four new tests in tests/hermes_cli/test_kanban_db.py:
- test_connect_sets_secure_delete_on
- test_connect_sets_cell_size_check_on
- test_connect_sets_synchronous_full
- test_connect_pragmas_applied_on_reconnect

How to Test

Run the kanban DB tests:

scripts/run_tests.sh tests/hermes_cli/test_kanban_db.py -q

All four new tests pass, confirming each pragma is set on fresh and reconnected connections. Full test suite also passes (no behavior changes).

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix/feature
I've run pytest tests/ -q and all tests pass
I've added tests for my changes (required for bug fixes)
I've tested on my platform

…te + cell_size_check + synchronous=FULL) Production corruption NousResearch#6 left b-tree pages with zeroed headers but intact old cell content — the Bug E pattern. This fix applies three pragma calls on every connect(): - synchronous=FULL (was NORMAL): closes the WAL-checkpoint reordering window where a crash between WAL commit and main-DB write leaves a partially-written b-tree page header. Cost is <1ms per commit on local SSD; negligible at kanban write volume. - secure_delete=ON: forces SQLite to zero freed page bytes on disk. If a torn write or hardware fault later corrupts a page, the underlying cell content is zero, so corruption is detectable and no stale rows can resurface as live data. - cell_size_check=ON: adds a read-side guard so corrupt cells surface as errors at read time rather than as silent wrong-data returns. All three are connection-scoped and re-applied on every connect(). secure_delete also writes a persistent flag into the DB header on the first call against a fresh DB, making the protection durable across processes for new DBs. Tests added for all four required cases: each pragma active on a fresh connection, and all three re-applied after close+reopen. Also adds the required negative test (migration path does not reset pragmas).

Required by .github/workflows/contributor-check.yml for first-time contributor PRs whose diff touches *.py files.

JoshBolding · 2026-05-24T05:05:00Z

I hit a very similar Kanban DB corruption path on WSL after a local worker retry-storm while the local model endpoint was unavailable. The synchronous=FULL part of this PR matches the local patch that stabilized my install.

One extra setting that helped in my local patch was:

conn.execute("PRAGMA journal_size_limit=8388608")
That was meant to keep a retry-storm from letting the WAL grow huge between checkpoints. Not sure if you want that in this PR or as a separate follow-up, but this PR is aligned with the failure mode I saw.

steveonjava · 2026-05-24T05:14:41Z

Thanks for sharing your reproduction on WSL, @JoshBolding. The synchronous=FULL part of this PR is what actually protects against the corruption pattern you saw, so on its own it should stabilize your install. The journal_size_limit=8388608 setting in your local patch is harmless but does not do what its name suggests: it only takes effect when SQLite runs a TRUNCATE checkpoint, which the default wal_autocheckpoint path never does, so the WAL was already bounded around 4 MiB by wal_autocheckpoint=1000 regardless of the limit. I confirmed it with a 200k-insert retry-storm reproducer: max WAL is 3.95 MiB with or without the pragma set.

JoshBolding · 2026-05-24T05:38:12Z

That makes sense, thanks for checking it with a reproducer. I had assumed journal_size_limit was helping bound the retry-storm case, but if the default WAL autocheckpoint already keeps it around ~4 MiB and the setting only matters for TRUNCATE checkpoints, then I agree it is not necessary for this fix.

The important part for my case was avoiding the corruption pattern under WSL, so synchronous=FULL covering that is exactly what I was hoping for. Appreciate you digging into it.

…ite txn The two functions previously ran in separate IMMEDIATE transactions. The inter-txn gap is a window where a WAL auto-checkpoint can partially flush — transferring the tasks-table page to main-db while leaving idx_tasks_status pages in WAL. If the checkpoint is then interrupted (SIGTERM, EIO, OS buffer eviction), the index drifts from the table and surfaces on the next connection as "row N missing from index idx_tasks_status". Fix: add a `_within_txn=False` kwarg to recompute_ready and to the `_clear_failure_counter` helper; when True they skip their own write_txn wrapper and execute inline. complete_task now invokes both with `_within_txn=True` inside its own write_txn, so the parent's status='done' UPDATE, the `completed` event, the failure-counter reset, and every child status='ready' UPDATE land in a single COMMIT. The checkpoint window closes. Stress reproducer in tests/stress/ asserts exactly ONE BEGIN IMMEDIATE for the merged txn and clean PRAGMA integrity_check across 50 sequential completions and across two concurrent connections. Cross-references: NousResearch#31208 (Bug E hardening — synchronous=FULL + secure_delete + cell_size_check, surfaces drift as loud rather than silent) and NousResearch#30908 (related corruption class triggered by EIO during checkpoint).

steveonjava · 2026-05-26T22:36:33Z

Bundled into #32857 for batch review. This draft remains open as a cherry-pick fallback if maintainers prefer surgical landing.

Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect(). Refs: #31208 (Bug E - same file, coordinate), #30973 (wal_autocheckpoint) Refs: #30445, #30896, #30908 (corruption reports)

kshitijk4poor · 2026-05-27T21:32:37Z

Merged via #33482 (commit 6416dd5). Authorship preserved.

Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect(). Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint) Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports)

Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect(). Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint) Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports) #AI commit#

Reads header bytes 28-31 after every COMMIT and compares against actual file size. Raises sqlite3.DatabaseError on torn-extend (actual_pages < page_count). Also sets PRAGMA wal_autocheckpoint=100 in connect(). Refs: NousResearch#31208 (Bug E - same file, coordinate), NousResearch#30973 (wal_autocheckpoint) Refs: NousResearch#30445, NousResearch#30896, NousResearch#30908 (corruption reports)

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins labels May 24, 2026

steveonjava force-pushed the fix/kanban-sqlite-torn-write-hardening branch from ce363ac to 3d06ab0 Compare May 24, 2026 01:11

chore(release): add steveonjava to AUTHOR_MAP

b47e94d

Required by .github/workflows/contributor-check.yml for first-time contributor PRs whose diff touches *.py files.

steveonjava force-pushed the fix/kanban-sqlite-torn-write-hardening branch from 3d06ab0 to b47e94d Compare May 24, 2026 01:15

steveonjava marked this pull request as ready for review May 24, 2026 01:19

This was referenced May 24, 2026

Kanban SQLite database corruption under rapid task creation #31502

Closed

fix(kanban): change synchronous=NORMAL to FULL + add wal_autocheckpoint=100 #31731

Closed

steveonjava mentioned this pull request May 25, 2026

fix(kanban): merge complete_task and recompute_ready into a single write txn #31891

Closed

8 tasks

steveonjava mentioned this pull request May 25, 2026

fix(kanban): add post-commit page_count invariant check to write_txn #32300

Closed

11 tasks

alt-glitch mentioned this pull request May 26, 2026

[Bug]: Kanban DB intermittent corruption after worker crash — missing WAL checkpoint #32543

Closed

This was referenced May 26, 2026

fix(kanban): batch-salvage 8 SQLite corruption hardening fixes (closes #31158, refs #29610) #32857

Closed

Bug: embedded Kanban dispatcher still leaks sqlite/WAL file descriptors after #28301 #29610

Closed

alt-glitch mentioned this pull request May 27, 2026

[Bug]: Database corruption problem with Kanban causing system crash. #33334

Closed

1 task

kshitijk4poor mentioned this pull request May 27, 2026

fix(kanban): batch-salvage 7 SQLite corruption hardening fixes from #32857 #33482

Merged

kshitijk4poor closed this in #33482 May 27, 2026

kaluluosi mentioned this pull request Jun 12, 2026

[Bug]: _try_wal_checkpoint TRUNCATE silently swallows exceptions, corrupts state.db WAL to zero bytes #44795

Open

liuhao1024 mentioned this pull request Jun 12, 2026

fix(state): log WAL checkpoint failures instead of silently swallowing #44834

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): harden SQLite against torn-write corruption (secure_delete + cell_size_check + synchronous=FULL)#31208

fix(kanban): harden SQLite against torn-write corruption (secure_delete + cell_size_check + synchronous=FULL)#31208
steveonjava wants to merge 2 commits into
NousResearch:mainfrom
steveonjava:fix/kanban-sqlite-torn-write-hardening

steveonjava commented May 24, 2026 •

edited

Loading

Uh oh!

JoshBolding commented May 24, 2026

Uh oh!

steveonjava commented May 24, 2026

Uh oh!

JoshBolding commented May 24, 2026

Uh oh!

steveonjava commented May 26, 2026

Uh oh!

kshitijk4poor commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

steveonjava commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Related Issues

Related PRs

Type of Change

Changes Made

How to Test

Checklist

Code

Uh oh!

JoshBolding commented May 24, 2026

Uh oh!

steveonjava commented May 24, 2026

Uh oh!

JoshBolding commented May 24, 2026

Uh oh!

steveonjava commented May 26, 2026

Uh oh!

kshitijk4poor commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

steveonjava commented May 24, 2026 •

edited

Loading