fix: prevent state.db B-tree corruption from WAL TRUNCATE checkpoint by Sugumaran-Balasubramaniyan · Pull Request #45546 · NousResearch/hermes-agent

Sugumaran-Balasubramaniyan · 2026-06-13T11:12:17Z

Summary

state.db suffered recurring B-tree corruption on large FTS5 databases (255 MB, ~65K pages, 347 sessions, ~25K messages). Observed twice in 24 hours with identical corruption patterns.

Root Cause

_try_wal_checkpoint() in hermes_state.py used PRAGMA wal_checkpoint(TRUNCATE) every 50 writes. TRUNCATE mode:

Acquires an exclusive lock on the database
Transfers all committed WAL frames back to the main DB file
Calls ftruncate() to shrink the WAL file to zero bytes

On large FTS5 shadow tables, the ftruncate() can race with pending page writes, leaving B-tree pages with invalid references — manifesting as btreeInitPage() returns error code 11 and invalid page number 218103808.

Additionally, SQLite's built-in auto-checkpoint (default: PASSIVE every 1000 pages) was running concurrently, creating redundant I/O and lock contention.

Fix

Change	What
TRUNCATE → RESTART	Same frame transfer + WAL-header reset but no ftruncate(). WAL file retains high-water-mark size; new writers reuse already-checkpointed frames. Eliminates the race while preventing unbounded WAL growth.
Disable auto-checkpoint	`PRAGMA wal_autocheckpoint=0` — stops SQLite's internal PASSIVE checkpoints from competing with explicit RESTART.
TRUNCATE retained safely	Still used in `close()` (connection teardown under lock) and `vacuum()` (full DB rewrite).

Changes

File	Change	Lines
`hermes_state.py`	Checkpoint mode + auto-checkpoint disable + docs	+49/-18

Verification

282 tests in test_hermes_state.py + test_hermes_state_wal_fallback.py: ✅
22 tests in test_hermes_state_compression_locks.py + test_state_db_malformed_repair.py: ✅

Closes #45383

state.db suffered recurring B-tree corruption on large FTS5 databases (255 MB, ~65K pages, 347 sessions). Root cause: _try_wal_checkpoint() used PRAGMA wal_checkpoint(TRUNCATE) every 50 writes, which acquires an exclusive lock, transfers all WAL frames, syncs the DB, and calls ftruncate() to shrink the WAL to zero bytes. On large FTS5 shadow tables, the ftruncate() can race with pending page writes, leaving B-tree pages with invalid references (btreeInitPage error 11, invalid page numbers). Three changes: 1. Switch TRUNCATE to RESTART — same frame transfer but without ftruncate(), eliminating the race while still preventing unbounded WAL growth. TRUNCATE retained only in close() and vacuum() where the exclusive lock makes it safe. 2. Disable SQLite auto-checkpoint (PRAGMA wal_autocheckpoint=0) so the internal PASSIVE checkpoints don't compete with our explicit RESTART strategy on large databases. 3. Update stale comments to reflect the new checkpoint strategy. Closes NousResearch#45383

AIalliAI · 2026-06-13T11:46:27Z

The corruption diagnosis is solid — wal_checkpoint(TRUNCATE) racing ftruncate() against pending page writes on large FTS5 shadow tables is a real mechanism, and #45383's evidence (repeated btreeInitPage SQLITE_IOERR on the FTS5 B-trees) fits.

My concern is the fix's PRAGMA wal_autocheckpoint=0: combined with the existing best-effort except Exception: pass around the manual checkpoint, that removes the only fallback — if a manual RESTART checkpoint ever fails (I/O error, lock timeout) the WAL grows unbounded silently, with no log and no recovery path, on the most critical DB in the system. RESTART does avoid the ftruncate race, but disabling autocheckpoint is the risky part. The competing #45396 (which #45383's author points to as the fix) takes the safer route — PASSIVE periodic checkpoints, autocheckpoint kept as a fallback, plus checkpoint-failure logging and dedicated tests — and is CI-green. Suggest consolidating on #45396's approach; if RESTART is preferred over PASSIVE, at minimum keep autocheckpoint enabled and log checkpoint failures rather than swallowing them.

liuhao1024 · 2026-06-13T11:46:34Z

Verification: Clean fix — RESTART checkpoint avoids FTS5 B-tree corruption

Reviewed the checkpoint mode change (TRUNCATE → RESTART) in _try_wal_checkpoint():

Correct rationale — TRUNCATE's ftruncate on large FTS5 databases (65K+ pages) while holding an exclusive lock is the documented corruption vector (state.db recurring B-tree corruption from WAL TRUNCATE checkpoint on large FTS5 databases #45383). RESTART performs the same frame transfer without shrinking the WAL file, avoiding the race.
close() and vacuum() correctly retain TRUNCATE — these are safe because the connection is being torn down (close) or the entire DB is being rewritten (vacuum), so exclusive access is guaranteed.
wal_autocheckpoint=0 — disabling SQLite's internal PASSIVE checkpoint prevents redundant I/O and lock contention with the explicit RESTART checkpoint. Comment clearly explains the rationale.
Docstring updated — the old "PASSIVE was previously used" comment is replaced with a clear explanation of the TRUNCATE→RESTART change and why.

Well-structured, minimal-risk change with good documentation.

tonydwb

Code Review Summary

Verdict: Approved

Important fix for state.db WAL checkpoint corruption (Issue #45383). The periodic TRUNCATE WAL checkpoint could cause B-tree corruption on large FTS5 databases (65K+ pages) due to the ftruncate race. This fix:

Changes periodic checkpoints from TRUNCATE to RESTART mode (same frame transfer and header reset, no ftruncate)
Disables SQLite's built-in PASSIVE auto-checkpoint to avoid redundant I/O
Keeps TRUNCATE only in close() and vacuum() where exclusive access is guaranteed
Adds detailed comments explaining the root cause and the RESTART vs TRUNCATE tradeoffs

Well-researched fix. No concerns.

Reviewed by Hermes Agent

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels Jun 13, 2026

tonydwb approved these changes Jun 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent state.db B-tree corruption from WAL TRUNCATE checkpoint#45546

fix: prevent state.db B-tree corruption from WAL TRUNCATE checkpoint#45546
Sugumaran-Balasubramaniyan wants to merge 1 commit into
NousResearch:mainfrom
Sugumaran-Balasubramaniyan:fix/45383-wal-checkpoint-corruption

Sugumaran-Balasubramaniyan commented Jun 13, 2026

Uh oh!

AIalliAI commented Jun 13, 2026

Uh oh!

liuhao1024 commented Jun 13, 2026

Uh oh!

tonydwb left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Sugumaran-Balasubramaniyan commented Jun 13, 2026

Summary

Root Cause

Fix

Changes

Verification

Uh oh!

AIalliAI commented Jun 13, 2026

Uh oh!

liuhao1024 commented Jun 13, 2026

Verification: Clean fix — RESTART checkpoint avoids FTS5 B-tree corruption

Uh oh!

tonydwb left a comment

Choose a reason for hiding this comment

Code Review Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants