Skip to content

kanban.db corruption recurs under concurrent reclaim-SIGKILL even with synchronous=FULL + wal_autocheckpoint=100 #31618

@julio-cloudvisor

Description

@julio-cloudvisor

kanban.db corruption recurs under concurrent reclaim-SIGKILL even with synchronous=FULL + wal_autocheckpoint=100


Summary

~/.hermes/kanban.db returns database disk image is malformed on the affected profile under a workload of multiple concurrent long-running workers when the reclaim path force-kills a worker mid-transaction. PRAGMA settings from #30973 (synchronous=FULL + wal_autocheckpoint=100) protect against clean-shutdown durability races but do not protect against SIGKILL during a WAL frame write.

Related prior incidents:

Symptom

After ~45 minutes of a workload that fires a new ready task every ~5 minutes (with some skill runs exceeding max_runtime_seconds), kanban operations on the affected profile start failing:

$ hermes kanban heartbeat <task_id>
Error: database disk image is malformed

$ hermes kanban complete <task_id>
Error: database disk image is malformed

$ sqlite3 ~/.hermes/kanban.db "PRAGMA integrity_check;"
*** in database main ***
Page <N>: btreeInitPage() returns error code 11
... (multiple damaged pages)

The actual skill work completes successfully (skills write back to their downstream system of record via API); only the kanban metadata layer is poisoned.

Repro recipe

Workload that surfaces this reliably:

  1. A profile with a poller-style dispatcher that enqueues one new ready task every N minutes (in our case, every 5).
  2. Skills with long runtime (we saw it at 1800s, 2490s, 41min — exceeding the profile's max_runtime_seconds).
  3. Heartbeat interval ~30s during the skill execution.
  4. No max_concurrent_workers ceiling (every ready task spawns its own worker subprocess; up to ~5 workers ran concurrently in our incident).
  5. After M concurrent kill-9s on workers (reclaim path), kanban.db corrupts.

Re-creating in isolation should be possible by running the existing kanban concurrent-writer stress test with a SIGKILL injection partway through a write transaction.

Root-cause hypothesis

Workers don't install a SIGTERM handler that closes the SQLite connection. Reclaim path:

# hermes_cli/kanban_db.py:_terminate_reclaimed_worker
kill(pid, signal.SIGTERM)
for _ in range(10):
    if not _pid_alive(pid):
        return
    time.sleep(0.5)
# 5s elapsed
kill(pid, signal.SIGKILL)

5 seconds is tight when the child is mid-LLM-call. The child receives SIGTERM but doesn't drain its DB connection before SIGKILL lands. If the SIGKILL hits:

  • between BEGIN-allocated WAL frame and matching COMMIT, or
  • during wal_autocheckpoint=100 rollover (frequent because tight), or
  • while another worker is in a read transaction holding a shared lock

… the WAL header can desync from the main DB pages. Once desynced, every subsequent open returns "disk image is malformed".

synchronous=FULL fsyncs committed WAL frames. It cannot rescue an in-flight transaction that gets killed before commit. WAL is normally rollback-safe for crashes, but the combination of:

  • concurrent writers holding write locks
  • one writer killed mid-transaction
  • another writer trying to checkpoint at wal_autocheckpoint=100

…seems to produce a state SQLite's recovery can't reconcile.

Suggested fix directions

In rough order of effort / impact:

  1. Worker-side SIGTERM handler that flushes the kanban DB connection:

    def _on_sigterm(signum, frame):
        try:
            if _kanban_conn is not None:
                _kanban_conn.execute("PRAGMA wal_checkpoint(TRUNCATE)")
                _kanban_conn.close()
        finally:
            sys.exit(143)
    signal.signal(signal.SIGTERM, _on_sigterm)

    Install this in the worker entry point that opens the kanban connection.

  2. Configurable reclaim grace windowHERMES_KANBAN_RECLAIM_GRACE_SECONDS, default 30s. 5s is too aggressive for children that are mid-LLM-call.

  3. BEGIN IMMEDIATE for kanban writes — moves the lock acquisition to the start of the transaction rather than the first write, so contention is serialised at acquire time. Reduces the window in which a kill leaves the WAL inconsistent.

  4. max_concurrent_workers in dispatcher config — bound the number of concurrent worker subprocess spawns per profile. Currently _default_spawn is fire-and-forget per ready task; a flood of ready tasks → unbounded subprocess concurrency. A bounded pool (semaphore) keeps the write-contention surface area small.

  5. Optional, longer-term: route all kanban writes through a single coordinator process (dispatcher) and have workers send write intents via a pipe/socket. Pushes the SQLite single-writer principle to its logical conclusion. Higher complexity, but eliminates the multi-writer correctness burden entirely.

Happy to send a PR for (1) and (2) if those directions are agreeable; they're the smallest deltas with the highest blast-radius reduction. (3) and (4) deserve their own design discussions.

Workaround for affected users

While a fix is in flight:

# When corruption is suspected:
sqlite3 ~/.hermes/kanban.db "PRAGMA integrity_check;"

# Recovery:
sqlite3 ~/.hermes/kanban.db ".recover" > /tmp/recover.sql
sqlite3 ~/.hermes/kanban-new.db < /tmp/recover.sql
mv ~/.hermes/kanban.db ~/.hermes/kanban.db.broken-$(date +%Y-%m-%d)
mv ~/.hermes/kanban-new.db ~/.hermes/kanban.db
# restart the gateway

# Mitigation: throttle concurrent dispatches at the producer layer
# (our poller now caps at 2 concurrent running tasks per profile).

Environment

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/toolsTool registry, model_tools, toolsetstype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions