Skip to content

fix: prevent state.db B-tree corruption from WAL TRUNCATE checkpoint#45546

Open
Sugumaran-Balasubramaniyan wants to merge 1 commit into
NousResearch:mainfrom
Sugumaran-Balasubramaniyan:fix/45383-wal-checkpoint-corruption
Open

fix: prevent state.db B-tree corruption from WAL TRUNCATE checkpoint#45546
Sugumaran-Balasubramaniyan wants to merge 1 commit into
NousResearch:mainfrom
Sugumaran-Balasubramaniyan:fix/45383-wal-checkpoint-corruption

Conversation

@Sugumaran-Balasubramaniyan

Copy link
Copy Markdown

Summary

state.db suffered recurring B-tree corruption on large FTS5 databases (255 MB, ~65K pages, 347 sessions, ~25K messages). Observed twice in 24 hours with identical corruption patterns.

Root Cause

_try_wal_checkpoint() in hermes_state.py used PRAGMA wal_checkpoint(TRUNCATE) every 50 writes. TRUNCATE mode:

  1. Acquires an exclusive lock on the database
  2. Transfers all committed WAL frames back to the main DB file
  3. Calls ftruncate() to shrink the WAL file to zero bytes

On large FTS5 shadow tables, the ftruncate() can race with pending page writes, leaving B-tree pages with invalid references — manifesting as btreeInitPage() returns error code 11 and invalid page number 218103808.

Additionally, SQLite's built-in auto-checkpoint (default: PASSIVE every 1000 pages) was running concurrently, creating redundant I/O and lock contention.

Fix

Change What
TRUNCATE → RESTART Same frame transfer + WAL-header reset but no ftruncate(). WAL file retains high-water-mark size; new writers reuse already-checkpointed frames. Eliminates the race while preventing unbounded WAL growth.
Disable auto-checkpoint PRAGMA wal_autocheckpoint=0 — stops SQLite's internal PASSIVE checkpoints from competing with explicit RESTART.
TRUNCATE retained safely Still used in close() (connection teardown under lock) and vacuum() (full DB rewrite).

Changes

File Change Lines
hermes_state.py Checkpoint mode + auto-checkpoint disable + docs +49/-18

Verification

  • 282 tests in test_hermes_state.py + test_hermes_state_wal_fallback.py: ✅
  • 22 tests in test_hermes_state_compression_locks.py + test_state_db_malformed_repair.py: ✅

Closes #45383

state.db suffered recurring B-tree corruption on large FTS5 databases
(255 MB, ~65K pages, 347 sessions). Root cause: _try_wal_checkpoint()
used PRAGMA wal_checkpoint(TRUNCATE) every 50 writes, which acquires
an exclusive lock, transfers all WAL frames, syncs the DB, and calls
ftruncate() to shrink the WAL to zero bytes. On large FTS5 shadow
tables, the ftruncate() can race with pending page writes, leaving
B-tree pages with invalid references (btreeInitPage error 11,
invalid page numbers).

Three changes:
1. Switch TRUNCATE to RESTART — same frame transfer but without
   ftruncate(), eliminating the race while still preventing unbounded
   WAL growth. TRUNCATE retained only in close() and vacuum() where
   the exclusive lock makes it safe.
2. Disable SQLite auto-checkpoint (PRAGMA wal_autocheckpoint=0) so
   the internal PASSIVE checkpoints don't compete with our explicit
   RESTART strategy on large databases.
3. Update stale comments to reflect the new checkpoint strategy.

Closes NousResearch#45383
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels Jun 13, 2026
@AIalliAI

Copy link
Copy Markdown
Contributor

The corruption diagnosis is solid — wal_checkpoint(TRUNCATE) racing ftruncate() against pending page writes on large FTS5 shadow tables is a real mechanism, and #45383's evidence (repeated btreeInitPage SQLITE_IOERR on the FTS5 B-trees) fits.

My concern is the fix's PRAGMA wal_autocheckpoint=0: combined with the existing best-effort except Exception: pass around the manual checkpoint, that removes the only fallback — if a manual RESTART checkpoint ever fails (I/O error, lock timeout) the WAL grows unbounded silently, with no log and no recovery path, on the most critical DB in the system. RESTART does avoid the ftruncate race, but disabling autocheckpoint is the risky part. The competing #45396 (which #45383's author points to as the fix) takes the safer route — PASSIVE periodic checkpoints, autocheckpoint kept as a fallback, plus checkpoint-failure logging and dedicated tests — and is CI-green. Suggest consolidating on #45396's approach; if RESTART is preferred over PASSIVE, at minimum keep autocheckpoint enabled and log checkpoint failures rather than swallowing them.

@liuhao1024

Copy link
Copy Markdown
Contributor

Verification: Clean fix — RESTART checkpoint avoids FTS5 B-tree corruption

Reviewed the checkpoint mode change (TRUNCATE → RESTART) in _try_wal_checkpoint():

  1. Correct rationale — TRUNCATE's ftruncate on large FTS5 databases (65K+ pages) while holding an exclusive lock is the documented corruption vector (state.db recurring B-tree corruption from WAL TRUNCATE checkpoint on large FTS5 databases #45383). RESTART performs the same frame transfer without shrinking the WAL file, avoiding the race.

  2. close() and vacuum() correctly retain TRUNCATE — these are safe because the connection is being torn down (close) or the entire DB is being rewritten (vacuum), so exclusive access is guaranteed.

  3. wal_autocheckpoint=0 — disabling SQLite's internal PASSIVE checkpoint prevents redundant I/O and lock contention with the explicit RESTART checkpoint. Comment clearly explains the rationale.

  4. Docstring updated — the old "PASSIVE was previously used" comment is replaced with a clear explanation of the TRUNCATE→RESTART change and why.

Well-structured, minimal-risk change with good documentation.

@tonydwb tonydwb left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Verdict: Approved

Important fix for state.db WAL checkpoint corruption (Issue #45383). The periodic TRUNCATE WAL checkpoint could cause B-tree corruption on large FTS5 databases (65K+ pages) due to the ftruncate race. This fix:

  • Changes periodic checkpoints from TRUNCATE to RESTART mode (same frame transfer and header reset, no ftruncate)
  • Disables SQLite's built-in PASSIVE auto-checkpoint to avoid redundant I/O
  • Keeps TRUNCATE only in close() and vacuum() where exclusive access is guaranteed
  • Adds detailed comments explaining the root cause and the RESTART vs TRUNCATE tradeoffs

Well-researched fix. No concerns.


Reviewed by Hermes Agent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

state.db recurring B-tree corruption from WAL TRUNCATE checkpoint on large FTS5 databases

5 participants