fix(kanban): flock WAL init and write_txn against concurrent processes#31965
fix(kanban): flock WAL init and write_txn against concurrent processes#31965steveonjava wants to merge 2 commits into
Conversation
SQLite WAL mode initialization and write transactions were not protected against concurrent processes starting simultaneously. Under load (multiple gateway instances or rapid restarts), this could cause database corruption or EIO errors on the -shm file. - flock around PRAGMA journal_mode=WAL in hermes_state.py - flock around BEGIN IMMEDIATE in kanban_db.py write_txn Fixes hangs and EIO errors seen in multi-gateway deployments.
|
Closing — superseded by #33482 (root-cause WAL-init fix, your own work via the batch salvage) + #33696 (@MoonRay305's first-connect-only flock). After thinking through the layering, your per-write_txn flock here would tax every kanban transaction without addressing what's actually broken. The corruption scenario is the first-connect WAL-init race between processes — not concurrent Both your batch salvage and MoonRay305's follow-on landed on this same investigation — credit on the root-cause fix is preserved in #33482. Thanks! |
What changes
Two database operations are now serialized with a file lock when multiple processes share the same kanban DB:
WAL initialization (
PRAGMA journal_mode=WAL) — happens once when a process first opens the DB. Without serialization, two processes racing on first-open can leave one in WAL mode and the other in DELETE mode, causing silent write divergence.BEGIN IMMEDIATEin write transactions — concurrentBEGIN IMMEDIATEcalls from different processes can deadlock. The flock ensures only one process enters the write path at a time, letting the others queue instead of fail.Both locks are file-based (
.wal-init.lockand.write-txn.lockalongside the DB file) and are skipped automatically on Windows and when the DB is in-memory. TheHERMES_WAL_INIT_FLOCK_DISABLEenv var disables the WAL-init lock for environments where file locking is unreliable (e.g. network filesystems).When you'd see a problem without this
Concurrent gateway/dispatcher processes (e.g. during a restart race, or a multi-process deployment) opening the same kanban DB can produce:
BEGIN IMMEDIATEBoth are rare in normal single-process operation. They appear under load or on fast restarts.
How to Test
195 tests pass. Flock behavior is covered by the
test_apply_wal_with_fallback_*andtest_write_txn_flock_*test groups.Checklist