fix(kanban): flock WAL init and write_txn against concurrent processes by steveonjava · Pull Request #31965 · NousResearch/hermes-agent

steveonjava · 2026-05-25T08:35:47Z

What changes

Two database operations are now serialized with a file lock when multiple processes share the same kanban DB:

WAL initialization (PRAGMA journal_mode=WAL) — happens once when a process first opens the DB. Without serialization, two processes racing on first-open can leave one in WAL mode and the other in DELETE mode, causing silent write divergence.
BEGIN IMMEDIATE in write transactions — concurrent BEGIN IMMEDIATE calls from different processes can deadlock. The flock ensures only one process enters the write path at a time, letting the others queue instead of fail.

Both locks are file-based (.wal-init.lock and .write-txn.lock alongside the DB file) and are skipped automatically on Windows and when the DB is in-memory. The HERMES_WAL_INIT_FLOCK_DISABLE env var disables the WAL-init lock for environments where file locking is unreliable (e.g. network filesystems).

When you'd see a problem without this

Concurrent gateway/dispatcher processes (e.g. during a restart race, or a multi-process deployment) opening the same kanban DB can produce:

"database is locked" errors on BEGIN IMMEDIATE
One process silently staying in DELETE journal mode after another process set WAL

Both are rare in normal single-process operation. They appear under load or on fast restarts.

How to Test

python -m pytest tests/hermes_cli/test_kanban_db.py tests/test_hermes_state_wal_fallback.py -v

195 tests pass. Flock behavior is covered by the test_apply_wal_with_fallback_* and test_write_txn_flock_* test groups.

Checklist

Tests added for flock paths (both WAL init and write-txn)
Flock disabled safely on Windows and in-memory DBs
Escape hatch env var documented in code comments

SQLite WAL mode initialization and write transactions were not protected against concurrent processes starting simultaneously. Under load (multiple gateway instances or rapid restarts), this could cause database corruption or EIO errors on the -shm file. - flock around PRAGMA journal_mode=WAL in hermes_state.py - flock around BEGIN IMMEDIATE in kanban_db.py write_txn Fixes hangs and EIO errors seen in multi-gateway deployments.

kshitijk4poor · 2026-05-28T06:29:38Z

Closing — superseded by #33482 (root-cause WAL-init fix, your own work via the batch salvage) + #33696 (@MoonRay305's first-connect-only flock).

After thinking through the layering, your per-write_txn flock here would tax every kanban transaction without addressing what's actually broken. The corruption scenario is the first-connect WAL-init race between processes — not concurrent BEGIN IMMEDIATE (SQLite handles that via WAL serialization with the busy_timeout knob #33696 added). #33482 commit 8 (skip redundant WAL pragma on already-WAL connections) eliminates the unlink-and-recreate dance for the common case, and #33696's _cross_process_init_lock covers the residual gap when two processes race on a fresh DB.

Both your batch salvage and MoonRay305's follow-on landed on this same investigation — credit on the root-cause fix is preserved in #33482. Thanks!

steveonjava mentioned this pull request May 25, 2026

fix(kanban): gate notifier watcher and harden WAL/transaction locks #31905

Closed

12 tasks

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins comp/cli CLI entry point, hermes_cli/, setup wizard labels May 25, 2026

chore(release): add steveonjava to AUTHOR_MAP

506c4eb

steveonjava mentioned this pull request May 26, 2026

fix(gateway): cache kanban DB connections per OS thread in GatewayRunner #32322

Closed

kshitijk4poor mentioned this pull request May 28, 2026

fix(kanban): salvage cross-process init flock + busy_timeout (from #32759) #33696

Merged

kshitijk4poor closed this in #33696 May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): flock WAL init and write_txn against concurrent processes#31965

fix(kanban): flock WAL init and write_txn against concurrent processes#31965
steveonjava wants to merge 2 commits into
NousResearch:mainfrom
steveonjava:fix/kanban-flock-wal-and-write-txn

steveonjava commented May 25, 2026

Uh oh!

kshitijk4poor commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

steveonjava commented May 25, 2026

What changes

When you'd see a problem without this

How to Test

Checklist

Uh oh!

kshitijk4poor commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants