Skip to content

fix(kanban): salvage cross-process init flock + busy_timeout (from #32759)#33696

Merged
kshitijk4poor merged 3 commits into
NousResearch:mainfrom
kshitijk4poor:salvage/pr-32759-kanban-concurrency
May 28, 2026
Merged

fix(kanban): salvage cross-process init flock + busy_timeout (from #32759)#33696
kshitijk4poor merged 3 commits into
NousResearch:mainfrom
kshitijk4poor:salvage/pr-32759-kanban-concurrency

Conversation

@kshitijk4poor

Copy link
Copy Markdown
Collaborator

Summary

Salvages PR #32759 (@MoonRay305 / Squiddy of 2rook.ai) onto current main with a manual conflict resolution that preserves the SQLite pragma hardening from #33482 (just landed) alongside this PR's new cross-process flock + busy_timeout knob. Authorship preserved on both contributor commits via cherry-pick.

Closes #32759. Closes #32461 (alternative inter-process flock by @WenhuaXia, weaker design — fcntl import at module level breaks Windows; uses id(conn) as dict key with potential leaks). Closes #31965 (@steveonjava's flock-on-every-write_txn variant — see "Why this PR over the alternatives" below). Refs #31158.

What this PR adds (defense-in-depth on top of #33482)

hermes_cli/kanban_db.py:

  1. _cross_process_init_lock(path) — context manager using fcntl.flock(LOCK_EX) (Unix) and msvcrt.locking(LK_LOCK) (Windows, added in commit 2). Wraps the first-connect setup phase (header validation + integrity probe + WAL pragma + schema init + additive migrations) in a process-level exclusive lock via a <db>.init.lock sidecar file. Process-local _INIT_LOCK (the existing threading.RLock) stays nested inside — they're complementary: threading lock for in-process races, file lock for cross-process races.
  2. _resolve_busy_timeout_ms() + DEFAULT_BUSY_TIMEOUT_MS=120_000 — replaces the previously-hardcoded timeout=30 and timeout=5 with an explicit 2-minute busy timeout. Configurable via HERMES_KANBAN_BUSY_TIMEOUT_MS env var, following the established HERMES_KANBAN_* env-var convention (CLAIM_TTL_SECONDS, CRASH_GRACE_SECONDS, etc.). Lets concurrent writers serialize via WAL rather than failing with database is locked during bursts.
  3. _sqlite_connect(path) — centralized sqlite3.connect wrapper that applies the busy timeout consistently across connect() and _guard_existing_db_is_healthy() probes (was inconsistent before: 5s vs 30s).

Manual conflict resolution (what changed from the original PR)

The PR was authored against pre-#33482 code and a verbatim cherry-pick would have reverted the just-landed SQLite hardening pragmas. Specifically, the PR's connect() re-write hardcoded synchronous=NORMAL (the old baseline), missing all four pragmas we just added:

-                conn.execute("PRAGMA synchronous=NORMAL")
+                conn.execute("PRAGMA synchronous=FULL")
+                conn.execute("PRAGMA wal_autocheckpoint=100")
                 conn.execute("PRAGMA foreign_keys=ON")
+                conn.execute("PRAGMA secure_delete=ON")
+                conn.execute("PRAGMA cell_size_check=ON")

The merge resolution keeps both contributions: the PR's new _cross_process_init_lock + _sqlite_connect + busy_timeout infrastructure AND #33482's pragma block (FULL, secure_delete, cell_size_check, wal_autocheckpoint). Verified via E2E test (see below).

Why this PR over the alternatives

Three different open PRs proposed inter-process locking for the same problem class:

PR Author Scope Cost Windows Issues
#32759 (THIS) @MoonRay305 / Squiddy First-connect only One-time ✅ msvcrt Clean, minimal
#32461 @WenhuaXia First-connect, but fcntl import at module level One-time ❌ breaks on Windows immediately Uses id(conn) dict key, leaks unless tracked
#31965 @steveonjava First-connect AND every write_txn (flock per-transaction) Per-write ❌ skipped (no-op) Overkill — SQLite serializes via WAL once busy_timeout is set

#32759 is the cleanest design. The WAL-init race is the actual scenario; once both processes are on WAL with synchronous=FULL + wal_autocheckpoint=100, BEGIN IMMEDIATE should not deadlock — SQLite serializes via WAL with the new busy_timeout knob. #31965's per-write flock would tax every kanban write without addressing the root cause that #33482 commit 8 (skip redundant WAL pragma on already-WAL connections) already fixed at the source.

#33482 commit 8 closes the redundant-WAL-pragma path on already-WAL connections. #32759 covers the remaining gap: the first connection from each process still hits the WAL-init path, and two processes racing on a fresh DB can produce the mixed-journal-mode corruption pattern. #32759's flock serializes that first-open across processes.

Tests + E2E verification

.venv/bin/python -m pytest tests/hermes_cli/test_kanban_db.py -o addopts= --tb=line -q
# 196 passed

Broader sweep (kanban + state + dispatcher functionality):

.venv/bin/python -m pytest tests/hermes_cli/test_kanban_db.py tests/test_hermes_state.py \
  tests/test_hermes_state_wal_fallback.py tests/hermes_cli/test_kanban_core_functionality.py \
  tests/hermes_cli/test_kanban_db_init.py -o addopts= --tb=line -q
# 599 passed, 1 skipped, 3 failed (3 pre-existing on origin/main — live_system_guard
# rejects os.kill(99999, …) on macOS sandboxes; CI is fine)

E2E (4 concurrent subprocesses against a fresh DB):

ruff check on all changed files: clean.

Credit

Both substantive commits authored by @MoonRay305 (Squiddy of 2rook.ai)squiddy@2rook.ai identity, preserved per-commit. Cherry-picked with git cherry-pick (which preserves author); manual merge resolution applied to the pragma block. Follow-up chore(release) commit (AUTHOR_MAP entry) attributed to the salvager.

Squiddy and others added 3 commits May 28, 2026 11:49
… salvage

Adds `squiddy@2rook.ai → MoonRay305` to AUTHOR_MAP so contributor_audit.py
passes for the salvaged commits in NousResearch#33482-followup PR.
@kshitijk4poor kshitijk4poor merged commit 7b778db into NousResearch:main May 28, 2026
18 of 19 checks passed
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins labels May 28, 2026
WenhuaXia added a commit to WenhuaXia/hermes-agent that referenced this pull request May 31, 2026
…on under concurrent workers

Root cause: PR NousResearch#33696 only locked the init phase, leaving write_txn() unprotected.
Under concurrent multi-worker writes (task_events + task_runs + status transitions
all within the same gateway tick), BEGIN IMMEDIATE transactions from different
processes can race and corrupt the SQLite WAL — proven by 7 corruptions in 33 min.

Fix: Acquire an exclusive flock (.db.lock) around every write_txn() call.
DDL migrations in connect() also use the same lock file, so schema init
cannot collide with in-flight DML from other processes.

Graceful no-op on platforms without fcntl (Windows) or when lock open fails.

Reference: Original PR NousResearch#32461 was closed as duplicate of NousResearch#33696, but
NousResearch#33696's 'SQLite serializes via WAL once busy_timeout is set' assumption
was proven wrong by real-world corruption data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants