kanban.db corruption recurs under concurrent reclaim-SIGKILL even with synchronous=FULL + wal_autocheckpoint=100
Summary
~/.hermes/kanban.db returns database disk image is malformed on the affected profile under a workload of multiple concurrent long-running workers when the reclaim path force-kills a worker mid-transaction. PRAGMA settings from #30973 (synchronous=FULL + wal_autocheckpoint=100) protect against clean-shutdown durability races but do not protect against SIGKILL during a WAL frame write.
Related prior incidents:
Symptom
After ~45 minutes of a workload that fires a new ready task every ~5 minutes (with some skill runs exceeding max_runtime_seconds), kanban operations on the affected profile start failing:
$ hermes kanban heartbeat <task_id>
Error: database disk image is malformed
$ hermes kanban complete <task_id>
Error: database disk image is malformed
$ sqlite3 ~/.hermes/kanban.db "PRAGMA integrity_check;"
*** in database main ***
Page <N>: btreeInitPage() returns error code 11
... (multiple damaged pages)
The actual skill work completes successfully (skills write back to their downstream system of record via API); only the kanban metadata layer is poisoned.
Repro recipe
Workload that surfaces this reliably:
- A profile with a poller-style dispatcher that enqueues one new ready task every N minutes (in our case, every 5).
- Skills with long runtime (we saw it at 1800s, 2490s, 41min — exceeding the profile's
max_runtime_seconds).
- Heartbeat interval ~30s during the skill execution.
- No
max_concurrent_workers ceiling (every ready task spawns its own worker subprocess; up to ~5 workers ran concurrently in our incident).
- After M concurrent kill-9s on workers (reclaim path),
kanban.db corrupts.
Re-creating in isolation should be possible by running the existing kanban concurrent-writer stress test with a SIGKILL injection partway through a write transaction.
Root-cause hypothesis
Workers don't install a SIGTERM handler that closes the SQLite connection. Reclaim path:
# hermes_cli/kanban_db.py:_terminate_reclaimed_worker
kill(pid, signal.SIGTERM)
for _ in range(10):
if not _pid_alive(pid):
return
time.sleep(0.5)
# 5s elapsed
kill(pid, signal.SIGKILL)
5 seconds is tight when the child is mid-LLM-call. The child receives SIGTERM but doesn't drain its DB connection before SIGKILL lands. If the SIGKILL hits:
- between
BEGIN-allocated WAL frame and matching COMMIT, or
- during
wal_autocheckpoint=100 rollover (frequent because tight), or
- while another worker is in a read transaction holding a shared lock
… the WAL header can desync from the main DB pages. Once desynced, every subsequent open returns "disk image is malformed".
synchronous=FULL fsyncs committed WAL frames. It cannot rescue an in-flight transaction that gets killed before commit. WAL is normally rollback-safe for crashes, but the combination of:
- concurrent writers holding write locks
- one writer killed mid-transaction
- another writer trying to checkpoint at
wal_autocheckpoint=100
…seems to produce a state SQLite's recovery can't reconcile.
Suggested fix directions
In rough order of effort / impact:
-
Worker-side SIGTERM handler that flushes the kanban DB connection:
def _on_sigterm(signum, frame):
try:
if _kanban_conn is not None:
_kanban_conn.execute("PRAGMA wal_checkpoint(TRUNCATE)")
_kanban_conn.close()
finally:
sys.exit(143)
signal.signal(signal.SIGTERM, _on_sigterm)
Install this in the worker entry point that opens the kanban connection.
-
Configurable reclaim grace window — HERMES_KANBAN_RECLAIM_GRACE_SECONDS, default 30s. 5s is too aggressive for children that are mid-LLM-call.
-
BEGIN IMMEDIATE for kanban writes — moves the lock acquisition to the start of the transaction rather than the first write, so contention is serialised at acquire time. Reduces the window in which a kill leaves the WAL inconsistent.
-
max_concurrent_workers in dispatcher config — bound the number of concurrent worker subprocess spawns per profile. Currently _default_spawn is fire-and-forget per ready task; a flood of ready tasks → unbounded subprocess concurrency. A bounded pool (semaphore) keeps the write-contention surface area small.
-
Optional, longer-term: route all kanban writes through a single coordinator process (dispatcher) and have workers send write intents via a pipe/socket. Pushes the SQLite single-writer principle to its logical conclusion. Higher complexity, but eliminates the multi-writer correctness burden entirely.
Happy to send a PR for (1) and (2) if those directions are agreeable; they're the smallest deltas with the highest blast-radius reduction. (3) and (4) deserve their own design discussions.
Workaround for affected users
While a fix is in flight:
# When corruption is suspected:
sqlite3 ~/.hermes/kanban.db "PRAGMA integrity_check;"
# Recovery:
sqlite3 ~/.hermes/kanban.db ".recover" > /tmp/recover.sql
sqlite3 ~/.hermes/kanban-new.db < /tmp/recover.sql
mv ~/.hermes/kanban.db ~/.hermes/kanban.db.broken-$(date +%Y-%m-%d)
mv ~/.hermes/kanban-new.db ~/.hermes/kanban.db
# restart the gateway
# Mitigation: throttle concurrent dispatches at the producer layer
# (our poller now caps at 2 concurrent running tasks per profile).
Environment
kanban.db corruption recurs under concurrent reclaim-SIGKILL even with synchronous=FULL + wal_autocheckpoint=100
Summary
~/.hermes/kanban.dbreturnsdatabase disk image is malformedon the affected profile under a workload of multiple concurrent long-running workers when the reclaim path force-kills a worker mid-transaction. PRAGMA settings from #30973 (synchronous=FULL+wal_autocheckpoint=100) protect against clean-shutdown durability races but do not protect againstSIGKILLduring a WAL frame write.Related prior incidents:
synchronous=NORMAL→FULL+ tightwal_autocheckpointfix; resolved the [Bug]: Kanban: rapid worker spawn-crash loop (sub-2s/crash) corrupts board SQLite B-tree before failure_limit trips #30896 case but did not generalise to the kill-mid-write path described hereSymptom
After ~45 minutes of a workload that fires a new ready task every ~5 minutes (with some skill runs exceeding
max_runtime_seconds), kanban operations on the affected profile start failing:The actual skill work completes successfully (skills write back to their downstream system of record via API); only the kanban metadata layer is poisoned.
Repro recipe
Workload that surfaces this reliably:
max_runtime_seconds).max_concurrent_workersceiling (every ready task spawns its own worker subprocess; up to ~5 workers ran concurrently in our incident).kanban.dbcorrupts.Re-creating in isolation should be possible by running the existing kanban concurrent-writer stress test with a SIGKILL injection partway through a write transaction.
Root-cause hypothesis
Workers don't install a SIGTERM handler that closes the SQLite connection. Reclaim path:
5 seconds is tight when the child is mid-LLM-call. The child receives SIGTERM but doesn't drain its DB connection before SIGKILL lands. If the SIGKILL hits:
BEGIN-allocated WAL frame and matchingCOMMIT, orwal_autocheckpoint=100rollover (frequent because tight), or… the WAL header can desync from the main DB pages. Once desynced, every subsequent open returns "disk image is malformed".
synchronous=FULLfsyncs committed WAL frames. It cannot rescue an in-flight transaction that gets killed before commit. WAL is normally rollback-safe for crashes, but the combination of:wal_autocheckpoint=100…seems to produce a state SQLite's recovery can't reconcile.
Suggested fix directions
In rough order of effort / impact:
Worker-side SIGTERM handler that flushes the kanban DB connection:
Install this in the worker entry point that opens the kanban connection.
Configurable reclaim grace window —
HERMES_KANBAN_RECLAIM_GRACE_SECONDS, default 30s. 5s is too aggressive for children that are mid-LLM-call.BEGIN IMMEDIATEfor kanban writes — moves the lock acquisition to the start of the transaction rather than the first write, so contention is serialised at acquire time. Reduces the window in which a kill leaves the WAL inconsistent.max_concurrent_workersin dispatcher config — bound the number of concurrent worker subprocess spawns per profile. Currently_default_spawnis fire-and-forget per ready task; a flood of ready tasks → unbounded subprocess concurrency. A bounded pool (semaphore) keeps the write-contention surface area small.Optional, longer-term: route all kanban writes through a single coordinator process (dispatcher) and have workers send write intents via a pipe/socket. Pushes the SQLite single-writer principle to its logical conclusion. Higher complexity, but eliminates the multi-writer correctness burden entirely.
Happy to send a PR for (1) and (2) if those directions are agreeable; they're the smallest deltas with the highest blast-radius reduction. (3) and (4) deserve their own design discussions.
Workaround for affected users
While a fix is in flight:
Environment
ca63746f3(one commit ahead of7cd1f6e2efrom feat(kanban): opt-in HERMES_KANBAN_SYNCHRONOUS_MODE for synchronous=FULL durability hardening #30973)synchronous=FULL,wal_autocheckpoint=100(per feat(kanban): opt-in HERMES_KANBAN_SYNCHRONOUS_MODE for synchronous=FULL durability hardening #30973)