Summary
Hermes Agent v0.14.0 on macOS 15.3 (Darwin 25.3.0), single-user single-machine. state.db corrupted three times over a 48h window. Reproducible cause identified from logs.
Trigger
From ~/.hermes/logs/errors.log, immediately preceding the first corruption:
2026-05-20 01:13:31 WARNING gateway.run: Shutdown context:
signal=SIGTERM under_systemd=yes parent_pid=1
parent_name=? loadavg_1m=13.69 parent_cmdline='(unknown)'
2026-05-20 01:14:04 WARNING [20260520_010451_ca0b1a] run_agent:
Session DB append_message failed: database disk image is malformed
launchd sent SIGTERM to the gateway while loadavg_1m=13.69 (system saturated). 33 s later, WAL pages had not been checkpointed and state.db went malformed. From that point on, every gateway start logged SQLite session store not available: database disk image is malformed and silently fell back to JSONL sidecar sessions — masking the corruption from the user. Telegram-sourced sessions were invisible from hermes sessions list because the JSONL fallback only contains CLI sessions.
Root cause hypothesis
- No
PRAGMA synchronous=FULL. hermes_state.py:152 (apply_wal_with_fallback) sets journal_mode=WAL but never raises synchronous. Default NORMAL in WAL mode is vulnerable to OS-level process kills mid-write.
- No pre-exit checkpoint in gateway shutdown.
gateway/run.py closes SessionDB on shutdown (line ~5819) but doesn't force PRAGMA wal_checkpoint(TRUNCATE). Combined with HERMES_SIGTERM_GRACE=1.5s, there's no guarantee the WAL is flushed before launchd escalates to SIGKILL.
- FTS5 trigram index amplifies write pressure and is fragile. Reproduced manually during DB recovery:
INSERT INTO messages SELECT ... with FTS triggers active returned database disk image is malformed (11) despite both source and target passing PRAGMA integrity_check. Dropping the FTS triggers, inserting, then rebuilding messages_fts_trigram from scratch succeeded. Suggests the trigram index has stability issues under bulk writes.
Cascading data loss
Between May 20 and May 22, two automatic recovery attempts shrank state.db from 167 MB → 26 MB silently:
| Date |
File |
Size |
Notes |
| May 21 13:55 |
state.db.corrupt.20260521_135543.bak |
167 MB |
First corruption snapshot |
| May 21 13:55 |
state.db.recovered.20260521_135543 |
178 MB |
First .recover output, schema OK |
| May 22 11:45 |
state.db.bak.20260522_114556 |
167 MB |
Backup before re-corruption |
| May 22 11:46 |
state.db.repaired |
37 MB |
Failed .recover: only lost_and_found table |
| May 22 17:00 |
state.db.corrupt_latest |
15 MB |
DB shrank — massive data loss |
| May 23 00:05 |
state.db (active) |
26 MB |
Only 1080 messages remained out of 14k+ |
Manual recovery: 14,104 messages / 241 sessions reconstructed by .recover + chunked re-insert with FTS triggers temporarily disabled, then trigram index rebuilt from scratch. Telegram sessions that had been invisible to hermes sessions list are back.
Suggested fixes
- In
apply_wal_with_fallback, also PRAGMA synchronous=FULL (or at minimum =EXTRA for WAL mode)
- In gateway shutdown sequence, force
PRAGMA wal_checkpoint(TRUNCATE) on all open SessionDB connections before exiting the SIGTERM grace window (and possibly extend the grace window if a checkpoint is in flight)
- Surface a periodic
wal_checkpoint(TRUNCATE) timer (e.g., every 5 min) independent of WAL size — users currently have to set this up themselves via launchd/cron
- Add
hermes db checkpoint, hermes db backup, hermes db repair CLI commands. Today, recovery requires hand-rolled SQLite incantations
- When falling back to JSONL sidecar mode in
hermes sessions list, make the warning louder and prefix the listing with (SHOWING CLI SESSIONS ONLY — telegram/api_server sessions are not in JSONL fallback). Right now a 1-line warning is easy to miss and creates the illusion that telegram sessions are gone.
- Investigate FTS5 trigram index robustness under bulk inserts — at minimum document the workaround (disable triggers → insert → rebuild trigram FTS)
Environment
- Hermes Agent v0.14.0 (2026.5.16)
- macOS 15.3 (Darwin 25.3.0), Apple Silicon
- Python 3.11.12
- Default
journal_mode=WAL, default synchronous=NORMAL
- Gateway managed by launchd (
ai.hermes.gateway.plist)
- Heavy workload: ~20 concurrent Hermes-related processes (gateway, 4 TUIs, 5 OpenAI-compatible proxies, cloudflared tunnel, cron jobs)
Related
Happy to share the recovery SQL and the launchd plists I used as a userspace workaround if useful.
Summary
Hermes Agent v0.14.0 on macOS 15.3 (Darwin 25.3.0), single-user single-machine.
state.dbcorrupted three times over a 48h window. Reproducible cause identified from logs.Trigger
From
~/.hermes/logs/errors.log, immediately preceding the first corruption:launchd sent SIGTERM to the gateway while
loadavg_1m=13.69(system saturated). 33 s later, WAL pages had not been checkpointed andstate.dbwent malformed. From that point on, every gateway start loggedSQLite session store not available: database disk image is malformedand silently fell back to JSONL sidecar sessions — masking the corruption from the user. Telegram-sourced sessions were invisible fromhermes sessions listbecause the JSONL fallback only contains CLI sessions.Root cause hypothesis
PRAGMA synchronous=FULL.hermes_state.py:152(apply_wal_with_fallback) setsjournal_mode=WALbut never raises synchronous. DefaultNORMALin WAL mode is vulnerable to OS-level process kills mid-write.gateway/run.pyclosesSessionDBon shutdown (line ~5819) but doesn't forcePRAGMA wal_checkpoint(TRUNCATE). Combined withHERMES_SIGTERM_GRACE=1.5s, there's no guarantee the WAL is flushed before launchd escalates to SIGKILL.INSERT INTO messages SELECT ...with FTS triggers active returneddatabase disk image is malformed (11)despite both source and target passingPRAGMA integrity_check. Dropping the FTS triggers, inserting, then rebuildingmessages_fts_trigramfrom scratch succeeded. Suggests the trigram index has stability issues under bulk writes.Cascading data loss
Between May 20 and May 22, two automatic recovery attempts shrank
state.dbfrom 167 MB → 26 MB silently:state.db.corrupt.20260521_135543.bakstate.db.recovered.20260521_135543.recoveroutput, schema OKstate.db.bak.20260522_114556state.db.repaired.recover: onlylost_and_foundtablestate.db.corrupt_lateststate.db(active)Manual recovery: 14,104 messages / 241 sessions reconstructed by
.recover+ chunked re-insert with FTS triggers temporarily disabled, then trigram index rebuilt from scratch. Telegram sessions that had been invisible tohermes sessions listare back.Suggested fixes
apply_wal_with_fallback, alsoPRAGMA synchronous=FULL(or at minimum=EXTRAfor WAL mode)PRAGMA wal_checkpoint(TRUNCATE)on all openSessionDBconnections before exiting the SIGTERM grace window (and possibly extend the grace window if a checkpoint is in flight)wal_checkpoint(TRUNCATE)timer (e.g., every 5 min) independent of WAL size — users currently have to set this up themselves via launchd/cronhermes db checkpoint,hermes db backup,hermes db repairCLI commands. Today, recovery requires hand-rolled SQLite incantationshermes sessions list, make the warning louder and prefix the listing with(SHOWING CLI SESSIONS ONLY — telegram/api_server sessions are not in JSONL fallback). Right now a 1-line warning is easy to miss and creates the illusion that telegram sessions are gone.Environment
journal_mode=WAL, defaultsynchronous=NORMALai.hermes.gateway.plist)Related
Happy to share the recovery SQL and the launchd plists I used as a userspace workaround if useful.