Skip to content

state.db corruption from SIGTERM during launchd shutdown under high load #30636

@mbakkali

Description

@mbakkali

Summary

Hermes Agent v0.14.0 on macOS 15.3 (Darwin 25.3.0), single-user single-machine. state.db corrupted three times over a 48h window. Reproducible cause identified from logs.

Trigger

From ~/.hermes/logs/errors.log, immediately preceding the first corruption:

2026-05-20 01:13:31  WARNING gateway.run: Shutdown context:
                     signal=SIGTERM under_systemd=yes parent_pid=1
                     parent_name=? loadavg_1m=13.69 parent_cmdline='(unknown)'
2026-05-20 01:14:04  WARNING [20260520_010451_ca0b1a] run_agent:
                     Session DB append_message failed: database disk image is malformed

launchd sent SIGTERM to the gateway while loadavg_1m=13.69 (system saturated). 33 s later, WAL pages had not been checkpointed and state.db went malformed. From that point on, every gateway start logged SQLite session store not available: database disk image is malformed and silently fell back to JSONL sidecar sessions — masking the corruption from the user. Telegram-sourced sessions were invisible from hermes sessions list because the JSONL fallback only contains CLI sessions.

Root cause hypothesis

  1. No PRAGMA synchronous=FULL. hermes_state.py:152 (apply_wal_with_fallback) sets journal_mode=WAL but never raises synchronous. Default NORMAL in WAL mode is vulnerable to OS-level process kills mid-write.
  2. No pre-exit checkpoint in gateway shutdown. gateway/run.py closes SessionDB on shutdown (line ~5819) but doesn't force PRAGMA wal_checkpoint(TRUNCATE). Combined with HERMES_SIGTERM_GRACE=1.5s, there's no guarantee the WAL is flushed before launchd escalates to SIGKILL.
  3. FTS5 trigram index amplifies write pressure and is fragile. Reproduced manually during DB recovery: INSERT INTO messages SELECT ... with FTS triggers active returned database disk image is malformed (11) despite both source and target passing PRAGMA integrity_check. Dropping the FTS triggers, inserting, then rebuilding messages_fts_trigram from scratch succeeded. Suggests the trigram index has stability issues under bulk writes.

Cascading data loss

Between May 20 and May 22, two automatic recovery attempts shrank state.db from 167 MB → 26 MB silently:

Date File Size Notes
May 21 13:55 state.db.corrupt.20260521_135543.bak 167 MB First corruption snapshot
May 21 13:55 state.db.recovered.20260521_135543 178 MB First .recover output, schema OK
May 22 11:45 state.db.bak.20260522_114556 167 MB Backup before re-corruption
May 22 11:46 state.db.repaired 37 MB Failed .recover: only lost_and_found table
May 22 17:00 state.db.corrupt_latest 15 MB DB shrank — massive data loss
May 23 00:05 state.db (active) 26 MB Only 1080 messages remained out of 14k+

Manual recovery: 14,104 messages / 241 sessions reconstructed by .recover + chunked re-insert with FTS triggers temporarily disabled, then trigram index rebuilt from scratch. Telegram sessions that had been invisible to hermes sessions list are back.

Suggested fixes

  • In apply_wal_with_fallback, also PRAGMA synchronous=FULL (or at minimum =EXTRA for WAL mode)
  • In gateway shutdown sequence, force PRAGMA wal_checkpoint(TRUNCATE) on all open SessionDB connections before exiting the SIGTERM grace window (and possibly extend the grace window if a checkpoint is in flight)
  • Surface a periodic wal_checkpoint(TRUNCATE) timer (e.g., every 5 min) independent of WAL size — users currently have to set this up themselves via launchd/cron
  • Add hermes db checkpoint, hermes db backup, hermes db repair CLI commands. Today, recovery requires hand-rolled SQLite incantations
  • When falling back to JSONL sidecar mode in hermes sessions list, make the warning louder and prefix the listing with (SHOWING CLI SESSIONS ONLY — telegram/api_server sessions are not in JSONL fallback). Right now a 1-line warning is easy to miss and creates the illusion that telegram sessions are gone.
  • Investigate FTS5 trigram index robustness under bulk inserts — at minimum document the workaround (disable triggers → insert → rebuild trigram FTS)

Environment

  • Hermes Agent v0.14.0 (2026.5.16)
  • macOS 15.3 (Darwin 25.3.0), Apple Silicon
  • Python 3.11.12
  • Default journal_mode=WAL, default synchronous=NORMAL
  • Gateway managed by launchd (ai.hermes.gateway.plist)
  • Heavy workload: ~20 concurrent Hermes-related processes (gateway, 4 TUIs, 5 OpenAI-compatible proxies, cloudflared tunnel, cron jobs)

Related

Happy to share the recovery SQL and the launchd plists I used as a userspace workaround if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High — major feature broken, no workaroundcomp/agentCore agent loop, run_agent.py, prompt buildercomp/gatewayGateway runner, session dispatch, deliverytype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions