state.db recurring B-tree corruption from WAL TRUNCATE checkpoint on large FTS5 databases

## Summary

`state.db` suffers recurring B-tree corruption. We have observed this **twice in 24 hours** (2026-06-12 and 2026-06-13) on the same database. The root cause is the `PRAGMA wal_checkpoint(TRUNCATE)` strategy in `_try_wal_checkpoint()`, combined with unbounded DB growth from FTS5 indexes.

## Environment

- Hermes: current main (d221e36)
- OS: macOS 15.5 (Apple Silicon)
- Profiles: 6 gateway profiles (default + 5 named), each with its own `state.db`
- Affected DB: default profile only, **255 MB**, ~65K pages, 347 sessions, ~25K messages

## Observed Corruption

Both incidents produced identical corruption patterns in the same B-trees:

```
Tree 23 page 63857: btreeInitPage() returns error code 11
Tree 23 page 828 cell 0: invalid page number 218103808
Tree 23 page 21849 cell 0: Rowid 1374389534729 out of order
Tree 20 page 64517-64577: btreeInitPage() returns error code 11
Tree 20 page 63912: 24x "2nd reference to page" errors
```

- **Tree 20/23** = `messages` table + FTS5 index B-trees (the largest structures in the DB)
- `sessions` table: always clean
- All other tables (`state_meta`, `schema_version`, `compression_locks`): always clean

## Root Cause Analysis

### 1. Unbounded FTS5 growth

`maybe_auto_prune_and_vacuum()` had never run on this database (no `last_auto_prune` in `state_meta`). With two FTS5 indexes (`messages_fts` + `messages_fts_trigram`), every message INSERT triggers multiple FTS index writes. The indexes grew to dominate the page space.

### 2. Aggressive TRUNCATE checkpoint

`_try_wal_checkpoint()` runs `PRAGMA wal_checkpoint(TRUNCATE)` every **50 successful writes** (`_CHECKPOINT_EVERY_N_WRITES = 50`). On a 65K-page database, this means:
- Copying thousands of WAL frames back into the main DB file
- Complete WAL truncation (exclusive lock required)
- High I/O pressure on every 50th write

The code comment explains PASSIVE was removed because "it never truncates the WAL file." But TRUNCATE on a large DB with deep FTS5 B-trees is the corruption trigger.

### 3. Silent error swallowing

```python
def _try_wal_checkpoint(self) -> None:
    try:
        with self._lock:
            result = self._conn.execute("PRAGMA wal_checkpoint(TRUNCATE)")
    except Exception:
        pass  # Best effort — never fatal.
```

A partially-failed checkpoint corrupts pages silently. The next `quick_check` (which runs on a cron) discovers it, but by then the damage is done.

### 4. Timeline evidence

| Time | Event |
|------|-------|
| 2026-06-12 09:17 | First corruption detected (errors.log) |
| 2026-06-12 17:38 | Manual repair (`.dump`/reimport) |
| 2026-06-12 18:58 | Repair confirmed, quick_check ok |
| 2026-06-13 03:30 | Last quick_check ok |
| 2026-06-13 11:01 | Corruption detected again (same B-trees) |

No macOS sleep/wake events, no process crashes, no disk errors in system log. The machine was awake and idle between 03:30 and 11:01 with only a few cron/feishu sessions writing.

## Suggested Fix

### Short term

1. **Log checkpoint failures** instead of `except Exception: pass`. A failed TRUNCATE checkpoint should at minimum emit a WARNING so operators can detect the onset of corruption.

2. **Reduce checkpoint frequency** for large databases. Consider making `_CHECKPOINT_EVERY_N_WRITES` adaptive (e.g., scale with DB page count) or increase the default from 50 to 200+.

3. **Use PASSIVE or FULL instead of TRUNCATE** for periodic checkpoints. Reserve TRUNCATE for explicit shutdown (`close()`). This trades WAL file size for safety.

### Medium term

4. **Set `PRAGMA wal_autocheckpoint = N`** explicitly so SQLite's built-in auto-checkpoint keeps the WAL manageable between application-level checkpoints.

5. **Ensure `maybe_auto_prune_and_vacuum` runs at startup** for long-lived gateway processes. On our instance, the auto-prune metadata was never set, meaning the DB never got pruned or vacuumed despite months of operation.

## Workaround

We are now running a cron `quick_check` every few hours and restoring from snapshots when corruption is detected. We also ran a manual `vacuum()` which optimized the FTS5 indexes and reduced the page count slightly.

## Reproduction

This is a slow-onset issue. It manifests on any Hermes instance where:
- The default gateway processes high message volume
- FTS5 indexes grow large (tens of thousands of pages)
- `_CHECKPOINT_EVERY_N_WRITES = 50` triggers frequent TRUNCATE checkpoints

The corruption may not reproduce on small databases or low-traffic instances.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

state.db recurring B-tree corruption from WAL TRUNCATE checkpoint on large FTS5 databases #45383

Summary

Environment

Observed Corruption

Root Cause Analysis

1. Unbounded FTS5 growth

2. Aggressive TRUNCATE checkpoint

3. Silent error swallowing

4. Timeline evidence

Suggested Fix

Short term

Medium term

Workaround

Reproduction

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Time	Event
2026-06-12 09:17	First corruption detected (errors.log)
2026-06-12 17:38	Manual repair (`.dump`/reimport)
2026-06-12 18:58	Repair confirmed, quick_check ok
2026-06-13 03:30	Last quick_check ok
2026-06-13 11:01	Corruption detected again (same B-trees)

state.db recurring B-tree corruption from WAL TRUNCATE checkpoint on large FTS5 databases #45383

Description

Summary

Environment

Observed Corruption

Root Cause Analysis

1. Unbounded FTS5 growth

2. Aggressive TRUNCATE checkpoint

3. Silent error swallowing

4. Timeline evidence

Suggested Fix

Short term

Medium term

Workaround

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions