fix(state): add PRAGMA synchronous=FULL + TRUNCATE checkpoint to prevent WAL corruption by SimoKiihamaki · Pull Request #30654 · NousResearch/hermes-agent

SimoKiihamaki · 2026-05-22T23:42:23Z

Summary

Fixes #30636 — state.db corruption from SIGTERM during launchd shutdown under high load.

Root Cause

The gateway sets journal_mode=WAL but never raises synchronous from its default (NORMAL). In WAL mode, NORMAL only flushes WAL frames after the write lock is released — if the process is killed before that happens, unflushed WAL pages can corrupt the database.

Combined with launchd SIGKILL (1.5s grace), this caused 3 database corruptions in 48h on a busy gateway.

Changes

apply_wal_with_fallback(): Set PRAGMA synchronous=FULL after enabling WAL mode. This ensures WAL frames are flushed to disk before the write transaction returns, closing the corruption window.
SessionDB.close(): Upgrade from PASSIVE to TRUNCATE WAL checkpoint on shutdown. TRUNCATE blocks until all committed WAL frames are written to the database file and then truncates the WAL to zero bytes. Falls back to PASSIVE if TRUNCATE fails (e.g. active readers).

Performance Impact

synchronous=FULL adds ~0.1-0.5ms per write transaction on SSD. For typical gateway patterns (< 100 writes/sec), this is negligible. The trade-off is worthwhile to prevent database corruption.

Testing

215 existing test_hermes_state.py tests pass
4 kanban_db WAL fallback tests pass
No new regressions

…ose (fixes NousResearch#30636) WAL mode with default synchronous=NORMAL is vulnerable to database corruption when the OS kills the process mid-write (SIGTERM/SIGKILL under high loadavg). This happened 3 times in 48h on a macOS launchd gateway with loadavg_1m=13.69. Changes: - apply_wal_with_fallback(): set PRAGMA synchronous=FULL after WAL mode to ensure WAL frames are flushed to disk before write commits - SessionDB.close(): upgrade from PASSIVE to TRUNCATE checkpoint to fully flush and truncate the WAL on clean shutdown - Fallback to PASSIVE if TRUNCATE fails (e.g. active readers) The performance cost of synchronous=FULL is minimal for typical gateway write patterns (< 100 writes/sec).

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR strengthens SQLite durability behavior when using WAL mode to mitigate corruption risks under forced process termination by adjusting synchronous settings and performing a stronger checkpoint on shutdown.

Changes:

Set PRAGMA synchronous=FULL when enabling WAL mode.
Switch shutdown checkpoint from wal_checkpoint(PASSIVE) to wal_checkpoint(TRUNCATE) with a PASSIVE fallback.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

        conn.execute("PRAGMA journal_mode=WAL")
+        # synchronous=FULL ensures WAL frames are flushed to disk before
+        # the write transaction returns.  The default (NORMAL) in WAL mode
+        # is fast but vulnerable to OS-level process kills (SIGTERM/SIGKILL)
+        # mid-write, which can corrupt the database on busy systems — see
+        # issue #30636.  The performance cost is minimal for typical gateway
+        # write patterns (< 100 writes/sec) and eliminates the corruption
+        # window that launchd's aggressive SIGKILL creates.
+        conn.execute("PRAGMA synchronous=FULL")


+        Attempts a TRUNCATE WAL checkpoint first so that exiting processes
+        flush the WAL back into the main DB file.  TRUNCATE is stronger than
+        PASSIVE — it blocks until all committed WAL frames are written to the
+        database file and then truncates the WAL to zero bytes.  This prevents
+        the corruption scenario described in issue #30636 where SIGTERM under
+        high load leaves uncheckpointed WAL pages behind.


+                    self._conn.execute("PRAGMA wal_checkpoint(TRUNCATE)")
                except Exception:
-                    pass
+                    # Fallback to PASSIVE if TRUNCATE fails (e.g. active readers)
+                    try:
+                        self._conn.execute("PRAGMA wal_checkpoint(PASSIVE)")
+                    except Exception:
+                        pass


Copilot AI review requested due to automatic review settings May 22, 2026 23:42

Copilot AI reviewed May 22, 2026

View reviewed changes

alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/agent Core agent loop, run_agent.py, prompt builder comp/gateway Gateway runner, session dispatch, delivery labels May 23, 2026

deepujain mentioned this pull request May 23, 2026

fix(state): retry transient SQLite WAL setup failures (Fixes #30576) #30700

Open

briandevans mentioned this pull request May 23, 2026

fix(state): wrap DELETE journal_mode fallback in try/except to survive APFS double-failure #30823

Closed

19 tasks

This was referenced May 24, 2026

fix(kanban): harden SQLite against torn-write corruption (secure_delete + cell_size_check + synchronous=FULL) #31208

Closed

fix(state): never silently downgrade WAL to DELETE on transient EIO #31294

Closed

Tranquil-Flow mentioned this pull request May 24, 2026

fix(state): proactively skip WAL journal mode on BTRFS filesystems (#30846) #31586

Open

kylekahraman mentioned this pull request May 27, 2026

state.db corruption from SIGTERM during launchd shutdown under high load #30636

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(state): add PRAGMA synchronous=FULL + TRUNCATE checkpoint to prevent WAL corruption#30654

fix(state): add PRAGMA synchronous=FULL + TRUNCATE checkpoint to prevent WAL corruption#30654
SimoKiihamaki wants to merge 1 commit into
NousResearch:mainfrom
SimoKiihamaki:fix/30636-synchronous-full-wal

SimoKiihamaki commented May 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

SimoKiihamaki commented May 22, 2026

Summary

Root Cause

Changes

Performance Impact

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants