Skip to content

state.db recurring B-tree corruption from WAL TRUNCATE checkpoint on large FTS5 databases #45383

@lukedd666-sudo

Description

@lukedd666-sudo

Summary

state.db suffers recurring B-tree corruption. We have observed this twice in 24 hours (2026-06-12 and 2026-06-13) on the same database. The root cause is the PRAGMA wal_checkpoint(TRUNCATE) strategy in _try_wal_checkpoint(), combined with unbounded DB growth from FTS5 indexes.

Environment

  • Hermes: current main (d221e36)
  • OS: macOS 15.5 (Apple Silicon)
  • Profiles: 6 gateway profiles (default + 5 named), each with its own state.db
  • Affected DB: default profile only, 255 MB, ~65K pages, 347 sessions, ~25K messages

Observed Corruption

Both incidents produced identical corruption patterns in the same B-trees:

Tree 23 page 63857: btreeInitPage() returns error code 11
Tree 23 page 828 cell 0: invalid page number 218103808
Tree 23 page 21849 cell 0: Rowid 1374389534729 out of order
Tree 20 page 64517-64577: btreeInitPage() returns error code 11
Tree 20 page 63912: 24x "2nd reference to page" errors
  • Tree 20/23 = messages table + FTS5 index B-trees (the largest structures in the DB)
  • sessions table: always clean
  • All other tables (state_meta, schema_version, compression_locks): always clean

Root Cause Analysis

1. Unbounded FTS5 growth

maybe_auto_prune_and_vacuum() had never run on this database (no last_auto_prune in state_meta). With two FTS5 indexes (messages_fts + messages_fts_trigram), every message INSERT triggers multiple FTS index writes. The indexes grew to dominate the page space.

2. Aggressive TRUNCATE checkpoint

_try_wal_checkpoint() runs PRAGMA wal_checkpoint(TRUNCATE) every 50 successful writes (_CHECKPOINT_EVERY_N_WRITES = 50). On a 65K-page database, this means:

  • Copying thousands of WAL frames back into the main DB file
  • Complete WAL truncation (exclusive lock required)
  • High I/O pressure on every 50th write

The code comment explains PASSIVE was removed because "it never truncates the WAL file." But TRUNCATE on a large DB with deep FTS5 B-trees is the corruption trigger.

3. Silent error swallowing

def _try_wal_checkpoint(self) -> None:
    try:
        with self._lock:
            result = self._conn.execute("PRAGMA wal_checkpoint(TRUNCATE)")
    except Exception:
        pass  # Best effort — never fatal.

A partially-failed checkpoint corrupts pages silently. The next quick_check (which runs on a cron) discovers it, but by then the damage is done.

4. Timeline evidence

Time Event
2026-06-12 09:17 First corruption detected (errors.log)
2026-06-12 17:38 Manual repair (.dump/reimport)
2026-06-12 18:58 Repair confirmed, quick_check ok
2026-06-13 03:30 Last quick_check ok
2026-06-13 11:01 Corruption detected again (same B-trees)

No macOS sleep/wake events, no process crashes, no disk errors in system log. The machine was awake and idle between 03:30 and 11:01 with only a few cron/feishu sessions writing.

Suggested Fix

Short term

  1. Log checkpoint failures instead of except Exception: pass. A failed TRUNCATE checkpoint should at minimum emit a WARNING so operators can detect the onset of corruption.

  2. Reduce checkpoint frequency for large databases. Consider making _CHECKPOINT_EVERY_N_WRITES adaptive (e.g., scale with DB page count) or increase the default from 50 to 200+.

  3. Use PASSIVE or FULL instead of TRUNCATE for periodic checkpoints. Reserve TRUNCATE for explicit shutdown (close()). This trades WAL file size for safety.

Medium term

  1. Set PRAGMA wal_autocheckpoint = N explicitly so SQLite's built-in auto-checkpoint keeps the WAL manageable between application-level checkpoints.

  2. Ensure maybe_auto_prune_and_vacuum runs at startup for long-lived gateway processes. On our instance, the auto-prune metadata was never set, meaning the DB never got pruned or vacuumed despite months of operation.

Workaround

We are now running a cron quick_check every few hours and restoring from snapshots when corruption is detected. We also ran a manual vacuum() which optimized the FTS5 indexes and reduced the page count slightly.

Reproduction

This is a slow-onset issue. It manifests on any Hermes instance where:

  • The default gateway processes high message volume
  • FTS5 indexes grow large (tens of thousands of pages)
  • _CHECKPOINT_EVERY_N_WRITES = 50 triggers frequent TRUNCATE checkpoints

The corruption may not reproduce on small databases or low-traffic instances.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existsarea/configConfig system, migrations, profilescomp/agentCore agent loop, run_agent.py, prompt buildertype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions