fix(state): add PRAGMA synchronous=FULL + TRUNCATE checkpoint to prevent WAL corruption#30654
Open
SimoKiihamaki wants to merge 1 commit into
Open
Conversation
…ose (fixes NousResearch#30636) WAL mode with default synchronous=NORMAL is vulnerable to database corruption when the OS kills the process mid-write (SIGTERM/SIGKILL under high loadavg). This happened 3 times in 48h on a macOS launchd gateway with loadavg_1m=13.69. Changes: - apply_wal_with_fallback(): set PRAGMA synchronous=FULL after WAL mode to ensure WAL frames are flushed to disk before write commits - SessionDB.close(): upgrade from PASSIVE to TRUNCATE checkpoint to fully flush and truncate the WAL on clean shutdown - Fallback to PASSIVE if TRUNCATE fails (e.g. active readers) The performance cost of synchronous=FULL is minimal for typical gateway write patterns (< 100 writes/sec).
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR strengthens SQLite durability behavior when using WAL mode to mitigate corruption risks under forced process termination by adjusting synchronous settings and performing a stronger checkpoint on shutdown.
Changes:
- Set
PRAGMA synchronous=FULLwhen enabling WAL mode. - Switch shutdown checkpoint from
wal_checkpoint(PASSIVE)towal_checkpoint(TRUNCATE)with a PASSIVE fallback.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
152
to
+160
| conn.execute("PRAGMA journal_mode=WAL") | ||
| # synchronous=FULL ensures WAL frames are flushed to disk before | ||
| # the write transaction returns. The default (NORMAL) in WAL mode | ||
| # is fast but vulnerable to OS-level process kills (SIGTERM/SIGKILL) | ||
| # mid-write, which can corrupt the database on busy systems — see | ||
| # issue #30636. The performance cost is minimal for typical gateway | ||
| # write patterns (< 100 writes/sec) and eliminates the corruption | ||
| # window that launchd's aggressive SIGKILL creates. | ||
| conn.execute("PRAGMA synchronous=FULL") |
Comment on lines
+460
to
+465
| Attempts a TRUNCATE WAL checkpoint first so that exiting processes | ||
| flush the WAL back into the main DB file. TRUNCATE is stronger than | ||
| PASSIVE — it blocks until all committed WAL frames are written to the | ||
| database file and then truncates the WAL to zero bytes. This prevents | ||
| the corruption scenario described in issue #30636 where SIGTERM under | ||
| high load leaves uncheckpointed WAL pages behind. |
Comment on lines
+470
to
+476
| self._conn.execute("PRAGMA wal_checkpoint(TRUNCATE)") | ||
| except Exception: | ||
| pass | ||
| # Fallback to PASSIVE if TRUNCATE fails (e.g. active readers) | ||
| try: | ||
| self._conn.execute("PRAGMA wal_checkpoint(PASSIVE)") | ||
| except Exception: | ||
| pass |
Closed
19 tasks
This was referenced May 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #30636 — state.db corruption from SIGTERM during launchd shutdown under high load.
Root Cause
The gateway sets
journal_mode=WALbut never raisessynchronousfrom its default (NORMAL). In WAL mode, NORMAL only flushes WAL frames after the write lock is released — if the process is killed before that happens, unflushed WAL pages can corrupt the database.Combined with launchd SIGKILL (1.5s grace), this caused 3 database corruptions in 48h on a busy gateway.
Changes
apply_wal_with_fallback(): SetPRAGMA synchronous=FULLafter enabling WAL mode. This ensures WAL frames are flushed to disk before the write transaction returns, closing the corruption window.SessionDB.close(): Upgrade fromPASSIVEtoTRUNCATEWAL checkpoint on shutdown. TRUNCATE blocks until all committed WAL frames are written to the database file and then truncates the WAL to zero bytes. Falls back to PASSIVE if TRUNCATE fails (e.g. active readers).Performance Impact
synchronous=FULLadds ~0.1-0.5ms per write transaction on SSD. For typical gateway patterns (< 100 writes/sec), this is negligible. The trade-off is worthwhile to prevent database corruption.Testing
test_hermes_state.pytests pass