Summary
state.db suffers recurring B-tree corruption. We have observed this twice in 24 hours (2026-06-12 and 2026-06-13) on the same database. The root cause is the PRAGMA wal_checkpoint(TRUNCATE) strategy in _try_wal_checkpoint(), combined with unbounded DB growth from FTS5 indexes.
Environment
- Hermes: current main (d221e36)
- OS: macOS 15.5 (Apple Silicon)
- Profiles: 6 gateway profiles (default + 5 named), each with its own
state.db
- Affected DB: default profile only, 255 MB, ~65K pages, 347 sessions, ~25K messages
Observed Corruption
Both incidents produced identical corruption patterns in the same B-trees:
Tree 23 page 63857: btreeInitPage() returns error code 11
Tree 23 page 828 cell 0: invalid page number 218103808
Tree 23 page 21849 cell 0: Rowid 1374389534729 out of order
Tree 20 page 64517-64577: btreeInitPage() returns error code 11
Tree 20 page 63912: 24x "2nd reference to page" errors
- Tree 20/23 =
messages table + FTS5 index B-trees (the largest structures in the DB)
sessions table: always clean
- All other tables (
state_meta, schema_version, compression_locks): always clean
Root Cause Analysis
1. Unbounded FTS5 growth
maybe_auto_prune_and_vacuum() had never run on this database (no last_auto_prune in state_meta). With two FTS5 indexes (messages_fts + messages_fts_trigram), every message INSERT triggers multiple FTS index writes. The indexes grew to dominate the page space.
2. Aggressive TRUNCATE checkpoint
_try_wal_checkpoint() runs PRAGMA wal_checkpoint(TRUNCATE) every 50 successful writes (_CHECKPOINT_EVERY_N_WRITES = 50). On a 65K-page database, this means:
- Copying thousands of WAL frames back into the main DB file
- Complete WAL truncation (exclusive lock required)
- High I/O pressure on every 50th write
The code comment explains PASSIVE was removed because "it never truncates the WAL file." But TRUNCATE on a large DB with deep FTS5 B-trees is the corruption trigger.
3. Silent error swallowing
def _try_wal_checkpoint(self) -> None:
try:
with self._lock:
result = self._conn.execute("PRAGMA wal_checkpoint(TRUNCATE)")
except Exception:
pass # Best effort — never fatal.
A partially-failed checkpoint corrupts pages silently. The next quick_check (which runs on a cron) discovers it, but by then the damage is done.
4. Timeline evidence
| Time |
Event |
| 2026-06-12 09:17 |
First corruption detected (errors.log) |
| 2026-06-12 17:38 |
Manual repair (.dump/reimport) |
| 2026-06-12 18:58 |
Repair confirmed, quick_check ok |
| 2026-06-13 03:30 |
Last quick_check ok |
| 2026-06-13 11:01 |
Corruption detected again (same B-trees) |
No macOS sleep/wake events, no process crashes, no disk errors in system log. The machine was awake and idle between 03:30 and 11:01 with only a few cron/feishu sessions writing.
Suggested Fix
Short term
-
Log checkpoint failures instead of except Exception: pass. A failed TRUNCATE checkpoint should at minimum emit a WARNING so operators can detect the onset of corruption.
-
Reduce checkpoint frequency for large databases. Consider making _CHECKPOINT_EVERY_N_WRITES adaptive (e.g., scale with DB page count) or increase the default from 50 to 200+.
-
Use PASSIVE or FULL instead of TRUNCATE for periodic checkpoints. Reserve TRUNCATE for explicit shutdown (close()). This trades WAL file size for safety.
Medium term
-
Set PRAGMA wal_autocheckpoint = N explicitly so SQLite's built-in auto-checkpoint keeps the WAL manageable between application-level checkpoints.
-
Ensure maybe_auto_prune_and_vacuum runs at startup for long-lived gateway processes. On our instance, the auto-prune metadata was never set, meaning the DB never got pruned or vacuumed despite months of operation.
Workaround
We are now running a cron quick_check every few hours and restoring from snapshots when corruption is detected. We also ran a manual vacuum() which optimized the FTS5 indexes and reduced the page count slightly.
Reproduction
This is a slow-onset issue. It manifests on any Hermes instance where:
- The default gateway processes high message volume
- FTS5 indexes grow large (tens of thousands of pages)
_CHECKPOINT_EVERY_N_WRITES = 50 triggers frequent TRUNCATE checkpoints
The corruption may not reproduce on small databases or low-traffic instances.
Summary
state.dbsuffers recurring B-tree corruption. We have observed this twice in 24 hours (2026-06-12 and 2026-06-13) on the same database. The root cause is thePRAGMA wal_checkpoint(TRUNCATE)strategy in_try_wal_checkpoint(), combined with unbounded DB growth from FTS5 indexes.Environment
state.dbObserved Corruption
Both incidents produced identical corruption patterns in the same B-trees:
messagestable + FTS5 index B-trees (the largest structures in the DB)sessionstable: always cleanstate_meta,schema_version,compression_locks): always cleanRoot Cause Analysis
1. Unbounded FTS5 growth
maybe_auto_prune_and_vacuum()had never run on this database (nolast_auto_pruneinstate_meta). With two FTS5 indexes (messages_fts+messages_fts_trigram), every message INSERT triggers multiple FTS index writes. The indexes grew to dominate the page space.2. Aggressive TRUNCATE checkpoint
_try_wal_checkpoint()runsPRAGMA wal_checkpoint(TRUNCATE)every 50 successful writes (_CHECKPOINT_EVERY_N_WRITES = 50). On a 65K-page database, this means:The code comment explains PASSIVE was removed because "it never truncates the WAL file." But TRUNCATE on a large DB with deep FTS5 B-trees is the corruption trigger.
3. Silent error swallowing
A partially-failed checkpoint corrupts pages silently. The next
quick_check(which runs on a cron) discovers it, but by then the damage is done.4. Timeline evidence
.dump/reimport)No macOS sleep/wake events, no process crashes, no disk errors in system log. The machine was awake and idle between 03:30 and 11:01 with only a few cron/feishu sessions writing.
Suggested Fix
Short term
Log checkpoint failures instead of
except Exception: pass. A failed TRUNCATE checkpoint should at minimum emit a WARNING so operators can detect the onset of corruption.Reduce checkpoint frequency for large databases. Consider making
_CHECKPOINT_EVERY_N_WRITESadaptive (e.g., scale with DB page count) or increase the default from 50 to 200+.Use PASSIVE or FULL instead of TRUNCATE for periodic checkpoints. Reserve TRUNCATE for explicit shutdown (
close()). This trades WAL file size for safety.Medium term
Set
PRAGMA wal_autocheckpoint = Nexplicitly so SQLite's built-in auto-checkpoint keeps the WAL manageable between application-level checkpoints.Ensure
maybe_auto_prune_and_vacuumruns at startup for long-lived gateway processes. On our instance, the auto-prune metadata was never set, meaning the DB never got pruned or vacuumed despite months of operation.Workaround
We are now running a cron
quick_checkevery few hours and restoring from snapshots when corruption is detected. We also ran a manualvacuum()which optimized the FTS5 indexes and reduced the page count slightly.Reproduction
This is a slow-onset issue. It manifests on any Hermes instance where:
_CHECKPOINT_EVERY_N_WRITES = 50triggers frequent TRUNCATE checkpointsThe corruption may not reproduce on small databases or low-traffic instances.