Skip to content

fix(kanban): flock WAL init and write_txn against concurrent processes#31965

Closed
steveonjava wants to merge 2 commits into
NousResearch:mainfrom
steveonjava:fix/kanban-flock-wal-and-write-txn
Closed

fix(kanban): flock WAL init and write_txn against concurrent processes#31965
steveonjava wants to merge 2 commits into
NousResearch:mainfrom
steveonjava:fix/kanban-flock-wal-and-write-txn

Conversation

@steveonjava

Copy link
Copy Markdown
Contributor

What changes

Two database operations are now serialized with a file lock when multiple processes share the same kanban DB:

  1. WAL initialization (PRAGMA journal_mode=WAL) — happens once when a process first opens the DB. Without serialization, two processes racing on first-open can leave one in WAL mode and the other in DELETE mode, causing silent write divergence.

  2. BEGIN IMMEDIATE in write transactions — concurrent BEGIN IMMEDIATE calls from different processes can deadlock. The flock ensures only one process enters the write path at a time, letting the others queue instead of fail.

Both locks are file-based (.wal-init.lock and .write-txn.lock alongside the DB file) and are skipped automatically on Windows and when the DB is in-memory. The HERMES_WAL_INIT_FLOCK_DISABLE env var disables the WAL-init lock for environments where file locking is unreliable (e.g. network filesystems).

When you'd see a problem without this

Concurrent gateway/dispatcher processes (e.g. during a restart race, or a multi-process deployment) opening the same kanban DB can produce:

  • "database is locked" errors on BEGIN IMMEDIATE
  • One process silently staying in DELETE journal mode after another process set WAL

Both are rare in normal single-process operation. They appear under load or on fast restarts.

How to Test

python -m pytest tests/hermes_cli/test_kanban_db.py tests/test_hermes_state_wal_fallback.py -v

195 tests pass. Flock behavior is covered by the test_apply_wal_with_fallback_* and test_write_txn_flock_* test groups.

Checklist

  • Tests added for flock paths (both WAL init and write-txn)
  • Flock disabled safely on Windows and in-memory DBs
  • Escape hatch env var documented in code comments

SQLite WAL mode initialization and write transactions were not
protected against concurrent processes starting simultaneously.
Under load (multiple gateway instances or rapid restarts), this
could cause database corruption or EIO errors on the -shm file.

- flock around PRAGMA journal_mode=WAL in hermes_state.py
- flock around BEGIN IMMEDIATE in kanban_db.py write_txn

Fixes hangs and EIO errors seen in multi-gateway deployments.
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins comp/cli CLI entry point, hermes_cli/, setup wizard labels May 25, 2026
@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Closing — superseded by #33482 (root-cause WAL-init fix, your own work via the batch salvage) + #33696 (@MoonRay305's first-connect-only flock).

After thinking through the layering, your per-write_txn flock here would tax every kanban transaction without addressing what's actually broken. The corruption scenario is the first-connect WAL-init race between processes — not concurrent BEGIN IMMEDIATE (SQLite handles that via WAL serialization with the busy_timeout knob #33696 added). #33482 commit 8 (skip redundant WAL pragma on already-WAL connections) eliminates the unlink-and-recreate dance for the common case, and #33696's _cross_process_init_lock covers the residual gap when two processes race on a fresh DB.

Both your batch salvage and MoonRay305's follow-on landed on this same investigation — credit on the root-cause fix is preserved in #33482. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants