fix(kanban): salvage cross-process init flock + busy_timeout (from #32759) by kshitijk4poor · Pull Request #33696 · NousResearch/hermes-agent

kshitijk4poor · 2026-05-28T06:24:06Z

Summary

Salvages PR #32759 (@MoonRay305 / Squiddy of 2rook.ai) onto current main with a manual conflict resolution that preserves the SQLite pragma hardening from #33482 (just landed) alongside this PR's new cross-process flock + busy_timeout knob. Authorship preserved on both contributor commits via cherry-pick.

Closes #32759. Closes #32461 (alternative inter-process flock by @WenhuaXia, weaker design — fcntl import at module level breaks Windows; uses id(conn) as dict key with potential leaks). Closes #31965 (@steveonjava's flock-on-every-write_txn variant — see "Why this PR over the alternatives" below). Refs #31158.

What this PR adds (defense-in-depth on top of #33482)

hermes_cli/kanban_db.py:

_cross_process_init_lock(path) — context manager using fcntl.flock(LOCK_EX) (Unix) and msvcrt.locking(LK_LOCK) (Windows, added in commit 2). Wraps the first-connect setup phase (header validation + integrity probe + WAL pragma + schema init + additive migrations) in a process-level exclusive lock via a <db>.init.lock sidecar file. Process-local _INIT_LOCK (the existing threading.RLock) stays nested inside — they're complementary: threading lock for in-process races, file lock for cross-process races.
_resolve_busy_timeout_ms() + DEFAULT_BUSY_TIMEOUT_MS=120_000 — replaces the previously-hardcoded timeout=30 and timeout=5 with an explicit 2-minute busy timeout. Configurable via HERMES_KANBAN_BUSY_TIMEOUT_MS env var, following the established HERMES_KANBAN_* env-var convention (CLAIM_TTL_SECONDS, CRASH_GRACE_SECONDS, etc.). Lets concurrent writers serialize via WAL rather than failing with database is locked during bursts.
_sqlite_connect(path) — centralized sqlite3.connect wrapper that applies the busy timeout consistently across connect() and _guard_existing_db_is_healthy() probes (was inconsistent before: 5s vs 30s).

Manual conflict resolution (what changed from the original PR)

The PR was authored against pre-#33482 code and a verbatim cherry-pick would have reverted the just-landed SQLite hardening pragmas. Specifically, the PR's connect() re-write hardcoded synchronous=NORMAL (the old baseline), missing all four pragmas we just added:

-                conn.execute("PRAGMA synchronous=NORMAL")
+                conn.execute("PRAGMA synchronous=FULL")
+                conn.execute("PRAGMA wal_autocheckpoint=100")
                 conn.execute("PRAGMA foreign_keys=ON")
+                conn.execute("PRAGMA secure_delete=ON")
+                conn.execute("PRAGMA cell_size_check=ON")

The merge resolution keeps both contributions: the PR's new _cross_process_init_lock + _sqlite_connect + busy_timeout infrastructure AND #33482's pragma block (FULL, secure_delete, cell_size_check, wal_autocheckpoint). Verified via E2E test (see below).

Why this PR over the alternatives

Three different open PRs proposed inter-process locking for the same problem class:

PR	Author	Scope	Cost	Windows	Issues
#32759 (THIS)	@MoonRay305 / Squiddy	First-connect only	One-time	✅ msvcrt	Clean, minimal
#32461	@WenhuaXia	First-connect, but `fcntl` import at module level	One-time	❌ breaks on Windows immediately	Uses `id(conn)` dict key, leaks unless tracked
#31965	@steveonjava	First-connect AND every `write_txn` (flock per-transaction)	Per-write	❌ skipped (no-op)	Overkill — SQLite serializes via WAL once busy_timeout is set

#32759 is the cleanest design. The WAL-init race is the actual scenario; once both processes are on WAL with synchronous=FULL + wal_autocheckpoint=100, BEGIN IMMEDIATE should not deadlock — SQLite serializes via WAL with the new busy_timeout knob. #31965's per-write flock would tax every kanban write without addressing the root cause that #33482 commit 8 (skip redundant WAL pragma on already-WAL connections) already fixed at the source.

#33482 commit 8 closes the redundant-WAL-pragma path on already-WAL connections. #32759 covers the remaining gap: the first connection from each process still hits the WAL-init path, and two processes racing on a fresh DB can produce the mixed-journal-mode corruption pattern. #32759's flock serializes that first-open across processes.

Tests + E2E verification

.venv/bin/python -m pytest tests/hermes_cli/test_kanban_db.py -o addopts= --tb=line -q
# 196 passed

Broader sweep (kanban + state + dispatcher functionality):

.venv/bin/python -m pytest tests/hermes_cli/test_kanban_db.py tests/test_hermes_state.py \
  tests/test_hermes_state_wal_fallback.py tests/hermes_cli/test_kanban_core_functionality.py \
  tests/hermes_cli/test_kanban_db_init.py -o addopts= --tb=line -q
# 599 passed, 1 skipped, 3 failed (3 pre-existing on origin/main — live_system_guard
# rejects os.kill(99999, …) on macOS sandboxes; CI is fine)

E2E (4 concurrent subprocesses against a fresh DB):

✅ _cross_process_init_lock fires (<db>.init.lock sidecar created)
✅ journal_mode=wal
✅ synchronous=2 (FULL — fix(kanban): batch-salvage 7 SQLite corruption hardening fixes from #32857 #33482's pragma preserved)
✅ secure_delete=1 (ON — fix(kanban): batch-salvage 7 SQLite corruption hardening fixes from #32857 #33482's pragma preserved)
✅ cell_size_check=1 (ON — fix(kanban): batch-salvage 7 SQLite corruption hardening fixes from #32857 #33482's pragma preserved)
✅ wal_autocheckpoint=100 (fix(kanban): batch-salvage 7 SQLite corruption hardening fixes from #32857 #33482's pragma preserved)
✅ foreign_keys=1 (ON)
✅ busy_timeout=60000 (this PR's HERMES_KANBAN_BUSY_TIMEOUT_MS env var honored)

ruff check on all changed files: clean.

Credit

Both substantive commits authored by @MoonRay305 (Squiddy of 2rook.ai) — squiddy@2rook.ai identity, preserved per-commit. Cherry-picked with git cherry-pick (which preserves author); manual merge resolution applied to the pragma block. Follow-up chore(release) commit (AUTHOR_MAP entry) attributed to the salvager.

… salvage Adds `squiddy@2rook.ai → MoonRay305` to AUTHOR_MAP so contributor_audit.py passes for the salvaged commits in NousResearch#33482-followup PR.

…on under concurrent workers Root cause: PR NousResearch#33696 only locked the init phase, leaving write_txn() unprotected. Under concurrent multi-worker writes (task_events + task_runs + status transitions all within the same gateway tick), BEGIN IMMEDIATE transactions from different processes can race and corrupt the SQLite WAL — proven by 7 corruptions in 33 min. Fix: Acquire an exclusive flock (.db.lock) around every write_txn() call. DDL migrations in connect() also use the same lock file, so schema init cannot collide with in-flight DML from other processes. Graceful no-op on platforms without fcntl (Windows) or when lock open fails. Reference: Original PR NousResearch#32461 was closed as duplicate of NousResearch#33696, but NousResearch#33696's 'SQLite serializes via WAL once busy_timeout is set' assumption was proven wrong by real-world corruption data.

Squiddy and others added 3 commits May 28, 2026 11:49

fix(kanban): harden sqlite connection concurrency

709e0d9

fix(kanban): add Windows init lock guard

938823c

chore(release): map MoonRay305 contributor email for NousResearch#32759…

8ba7c5a

… salvage Adds `squiddy@2rook.ai → MoonRay305` to AUTHOR_MAP so contributor_audit.py passes for the salvaged commits in NousResearch#33482-followup PR.

kshitijk4poor merged commit 7b778db into NousResearch:main May 28, 2026
18 of 19 checks passed

This was referenced May 28, 2026

fix(kanban): harden SQLite connection concurrency #32759

Closed

fix(kanban): add inter-process file lock to prevent DB corruption under concurrent writes #32461

Closed

fix(kanban): flock WAL init and write_txn against concurrent processes #31965

Closed

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins labels May 28, 2026

kshitijk4poor mentioned this pull request May 28, 2026

fix(kanban): remove false-positive corruption detection from separate probe connection #32449

Closed

WenhuaXia mentioned this pull request May 31, 2026

fix(kanban): add inter-process file lock on write_txn() to prevent DB corruption #35770

Closed

WenhuaXia mentioned this pull request May 31, 2026

fix(kanban): add inter-process flock to prevent SQLite WAL corruption #35787

Closed

hanzckernel mentioned this pull request Jun 2, 2026

fix(kanban): serialize cross-process SQLite writes #37350

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): salvage cross-process init flock + busy_timeout (from #32759)#33696

fix(kanban): salvage cross-process init flock + busy_timeout (from #32759)#33696
kshitijk4poor merged 3 commits into
NousResearch:mainfrom
kshitijk4poor:salvage/pr-32759-kanban-concurrency

kshitijk4poor commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kshitijk4poor commented May 28, 2026

Summary

What this PR adds (defense-in-depth on top of #33482)

Manual conflict resolution (what changed from the original PR)

Why this PR over the alternatives

Tests + E2E verification

Credit

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants