Skip to content

fix(state): add PRAGMA synchronous=FULL + TRUNCATE checkpoint to prevent WAL corruption#30654

Open
SimoKiihamaki wants to merge 1 commit into
NousResearch:mainfrom
SimoKiihamaki:fix/30636-synchronous-full-wal
Open

fix(state): add PRAGMA synchronous=FULL + TRUNCATE checkpoint to prevent WAL corruption#30654
SimoKiihamaki wants to merge 1 commit into
NousResearch:mainfrom
SimoKiihamaki:fix/30636-synchronous-full-wal

Conversation

@SimoKiihamaki

Copy link
Copy Markdown
Contributor

Summary

Fixes #30636 — state.db corruption from SIGTERM during launchd shutdown under high load.

Root Cause

The gateway sets journal_mode=WAL but never raises synchronous from its default (NORMAL). In WAL mode, NORMAL only flushes WAL frames after the write lock is released — if the process is killed before that happens, unflushed WAL pages can corrupt the database.

Combined with launchd SIGKILL (1.5s grace), this caused 3 database corruptions in 48h on a busy gateway.

Changes

  1. apply_wal_with_fallback(): Set PRAGMA synchronous=FULL after enabling WAL mode. This ensures WAL frames are flushed to disk before the write transaction returns, closing the corruption window.

  2. SessionDB.close(): Upgrade from PASSIVE to TRUNCATE WAL checkpoint on shutdown. TRUNCATE blocks until all committed WAL frames are written to the database file and then truncates the WAL to zero bytes. Falls back to PASSIVE if TRUNCATE fails (e.g. active readers).

Performance Impact

synchronous=FULL adds ~0.1-0.5ms per write transaction on SSD. For typical gateway patterns (< 100 writes/sec), this is negligible. The trade-off is worthwhile to prevent database corruption.

Testing

  • 215 existing test_hermes_state.py tests pass
  • 4 kanban_db WAL fallback tests pass
  • No new regressions

…ose (fixes NousResearch#30636)

WAL mode with default synchronous=NORMAL is vulnerable to database
corruption when the OS kills the process mid-write (SIGTERM/SIGKILL
under high loadavg). This happened 3 times in 48h on a macOS launchd
gateway with loadavg_1m=13.69.

Changes:
- apply_wal_with_fallback(): set PRAGMA synchronous=FULL after WAL
  mode to ensure WAL frames are flushed to disk before write commits
- SessionDB.close(): upgrade from PASSIVE to TRUNCATE checkpoint to
  fully flush and truncate the WAL on clean shutdown
- Fallback to PASSIVE if TRUNCATE fails (e.g. active readers)

The performance cost of synchronous=FULL is minimal for typical
gateway write patterns (< 100 writes/sec).
Copilot AI review requested due to automatic review settings May 22, 2026 23:42

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR strengthens SQLite durability behavior when using WAL mode to mitigate corruption risks under forced process termination by adjusting synchronous settings and performing a stronger checkpoint on shutdown.

Changes:

  • Set PRAGMA synchronous=FULL when enabling WAL mode.
  • Switch shutdown checkpoint from wal_checkpoint(PASSIVE) to wal_checkpoint(TRUNCATE) with a PASSIVE fallback.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread hermes_state.py
Comment on lines 152 to +160
conn.execute("PRAGMA journal_mode=WAL")
# synchronous=FULL ensures WAL frames are flushed to disk before
# the write transaction returns. The default (NORMAL) in WAL mode
# is fast but vulnerable to OS-level process kills (SIGTERM/SIGKILL)
# mid-write, which can corrupt the database on busy systems — see
# issue #30636. The performance cost is minimal for typical gateway
# write patterns (< 100 writes/sec) and eliminates the corruption
# window that launchd's aggressive SIGKILL creates.
conn.execute("PRAGMA synchronous=FULL")
Comment thread hermes_state.py
Comment on lines +460 to +465
Attempts a TRUNCATE WAL checkpoint first so that exiting processes
flush the WAL back into the main DB file. TRUNCATE is stronger than
PASSIVE — it blocks until all committed WAL frames are written to the
database file and then truncates the WAL to zero bytes. This prevents
the corruption scenario described in issue #30636 where SIGTERM under
high load leaves uncheckpointed WAL pages behind.
Comment thread hermes_state.py
Comment on lines +470 to +476
self._conn.execute("PRAGMA wal_checkpoint(TRUNCATE)")
except Exception:
pass
# Fallback to PASSIVE if TRUNCATE fails (e.g. active readers)
try:
self._conn.execute("PRAGMA wal_checkpoint(PASSIVE)")
except Exception:
pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

state.db corruption from SIGTERM during launchd shutdown under high load

3 participants