state.db corruption from SIGTERM during launchd shutdown under high load

## Summary

Hermes Agent v0.14.0 on macOS 15.3 (Darwin 25.3.0), single-user single-machine. `state.db` corrupted three times over a 48h window. Reproducible cause identified from logs.

## Trigger

From `~/.hermes/logs/errors.log`, immediately preceding the first corruption:

```
2026-05-20 01:13:31  WARNING gateway.run: Shutdown context:
                     signal=SIGTERM under_systemd=yes parent_pid=1
                     parent_name=? loadavg_1m=13.69 parent_cmdline='(unknown)'
2026-05-20 01:14:04  WARNING [20260520_010451_ca0b1a] run_agent:
                     Session DB append_message failed: database disk image is malformed
```

launchd sent SIGTERM to the gateway while `loadavg_1m=13.69` (system saturated). 33 s later, WAL pages had not been checkpointed and `state.db` went malformed. From that point on, every gateway start logged `SQLite session store not available: database disk image is malformed` and silently fell back to JSONL sidecar sessions — masking the corruption from the user. Telegram-sourced sessions were invisible from `hermes sessions list` because the JSONL fallback only contains CLI sessions.

## Root cause hypothesis

1. **No `PRAGMA synchronous=FULL`**. `hermes_state.py:152` (`apply_wal_with_fallback`) sets `journal_mode=WAL` but never raises synchronous. Default `NORMAL` in WAL mode is vulnerable to OS-level process kills mid-write.
2. **No pre-exit checkpoint in gateway shutdown**. `gateway/run.py` closes `SessionDB` on shutdown (line ~5819) but doesn't force `PRAGMA wal_checkpoint(TRUNCATE)`. Combined with `HERMES_SIGTERM_GRACE=1.5s`, there's no guarantee the WAL is flushed before launchd escalates to SIGKILL.
3. **FTS5 trigram index amplifies write pressure and is fragile**. Reproduced manually during DB recovery: `INSERT INTO messages SELECT ...` with FTS triggers active returned `database disk image is malformed (11)` despite both source and target passing `PRAGMA integrity_check`. Dropping the FTS triggers, inserting, then rebuilding `messages_fts_trigram` from scratch succeeded. Suggests the trigram index has stability issues under bulk writes.

## Cascading data loss

Between May 20 and May 22, two automatic recovery attempts shrank `state.db` from 167 MB → 26 MB silently:

| Date | File | Size | Notes |
|---|---|---|---|
| May 21 13:55 | `state.db.corrupt.20260521_135543.bak` | 167 MB | First corruption snapshot |
| May 21 13:55 | `state.db.recovered.20260521_135543` | 178 MB | First `.recover` output, schema OK |
| May 22 11:45 | `state.db.bak.20260522_114556` | 167 MB | Backup before re-corruption |
| May 22 11:46 | `state.db.repaired` | 37 MB | **Failed `.recover`: only `lost_and_found` table** |
| May 22 17:00 | `state.db.corrupt_latest` | 15 MB | **DB shrank — massive data loss** |
| May 23 00:05 | `state.db` (active) | 26 MB | Only 1080 messages remained out of 14k+ |

Manual recovery: 14,104 messages / 241 sessions reconstructed by `.recover` + chunked re-insert with FTS triggers temporarily disabled, then trigram index rebuilt from scratch. Telegram sessions that had been invisible to `hermes sessions list` are back.

## Suggested fixes

- In `apply_wal_with_fallback`, also `PRAGMA synchronous=FULL` (or at minimum `=EXTRA` for WAL mode)
- In gateway shutdown sequence, force `PRAGMA wal_checkpoint(TRUNCATE)` on all open `SessionDB` connections before exiting the SIGTERM grace window (and possibly extend the grace window if a checkpoint is in flight)
- Surface a periodic `wal_checkpoint(TRUNCATE)` timer (e.g., every 5 min) independent of WAL size — users currently have to set this up themselves via launchd/cron
- Add `hermes db checkpoint`, `hermes db backup`, `hermes db repair` CLI commands. Today, recovery requires hand-rolled SQLite incantations
- When falling back to JSONL sidecar mode in `hermes sessions list`, **make the warning louder** and prefix the listing with `(SHOWING CLI SESSIONS ONLY — telegram/api_server sessions are not in JSONL fallback)`. Right now a 1-line warning is easy to miss and creates the illusion that telegram sessions are gone.
- Investigate FTS5 trigram index robustness under bulk inserts — at minimum document the workaround (disable triggers → insert → rebuild trigram FTS)

## Environment

- Hermes Agent v0.14.0 (2026.5.16)
- macOS 15.3 (Darwin 25.3.0), Apple Silicon
- Python 3.11.12
- Default `journal_mode=WAL`, default `synchronous=NORMAL`
- Gateway managed by launchd (`ai.hermes.gateway.plist`)
- Heavy workload: ~20 concurrent Hermes-related processes (gateway, 4 TUIs, 5 OpenAI-compatible proxies, cloudflared tunnel, cron jobs)

## Related

- #5563 — broader UX report mentions state.db corruption (Issue 2/4) but no specific trigger identified
- #30445 — Kanban DB corruption from multi-gateway concurrent access
- #29610 — sqlite/WAL fd leaks in Kanban dispatcher

Happy to share the recovery SQL and the launchd plists I used as a userspace workaround if useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

state.db corruption from SIGTERM during launchd shutdown under high load #30636

Summary

Trigger

Root cause hypothesis

Cascading data loss

Suggested fixes

Environment

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Date	File	Size	Notes
May 21 13:55	`state.db.corrupt.20260521_135543.bak`	167 MB	First corruption snapshot
May 21 13:55	`state.db.recovered.20260521_135543`	178 MB	First `.recover` output, schema OK
May 22 11:45	`state.db.bak.20260522_114556`	167 MB	Backup before re-corruption
May 22 11:46	`state.db.repaired`	37 MB	Failed `.recover`: only `lost_and_found` table
May 22 17:00	`state.db.corrupt_latest`	15 MB	DB shrank — massive data loss
May 23 00:05	`state.db` (active)	26 MB	Only 1080 messages remained out of 14k+

state.db corruption from SIGTERM during launchd shutdown under high load #30636

Description

Summary

Trigger

Root cause hypothesis

Cascading data loss

Suggested fixes

Environment

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions