kanban.db corruption recurs under concurrent reclaim-SIGKILL even with synchronous=FULL + wal_autocheckpoint=100

# kanban.db corruption recurs under concurrent reclaim-SIGKILL even with synchronous=FULL + wal_autocheckpoint=100
---

## Summary

`~/.hermes/kanban.db` returns `database disk image is malformed` on the affected profile under a workload of multiple concurrent long-running workers when the reclaim path force-kills a worker mid-transaction. PRAGMA settings from #30973 (`synchronous=FULL` + `wal_autocheckpoint=100`) protect against clean-shutdown durability races but do not protect against `SIGKILL` during a WAL frame write.

Related prior incidents:
- #30896 — initial concurrent-writer corruption report
- #30973 — `synchronous=NORMAL` → `FULL` + tight `wal_autocheckpoint` fix; resolved the #30896 case but did not generalise to the kill-mid-write path described here

## Symptom

After ~45 minutes of a workload that fires a new ready task every ~5 minutes (with some skill runs exceeding `max_runtime_seconds`), kanban operations on the affected profile start failing:

```
$ hermes kanban heartbeat <task_id>
Error: database disk image is malformed

$ hermes kanban complete <task_id>
Error: database disk image is malformed

$ sqlite3 ~/.hermes/kanban.db "PRAGMA integrity_check;"
*** in database main ***
Page <N>: btreeInitPage() returns error code 11
... (multiple damaged pages)
```

The actual skill work completes successfully (skills write back to their downstream system of record via API); only the kanban metadata layer is poisoned.

## Repro recipe

Workload that surfaces this reliably:
1. A profile with a poller-style dispatcher that enqueues one new ready task every N minutes (in our case, every 5).
2. Skills with long runtime (we saw it at 1800s, 2490s, 41min — exceeding the profile's `max_runtime_seconds`).
3. Heartbeat interval ~30s during the skill execution.
4. No `max_concurrent_workers` ceiling (every ready task spawns its own worker subprocess; up to ~5 workers ran concurrently in our incident).
5. After M concurrent kill-9s on workers (reclaim path), `kanban.db` corrupts.

Re-creating in isolation should be possible by running the existing kanban concurrent-writer stress test with a SIGKILL injection partway through a write transaction.

## Root-cause hypothesis

Workers don't install a SIGTERM handler that closes the SQLite connection. Reclaim path:

```python
# hermes_cli/kanban_db.py:_terminate_reclaimed_worker
kill(pid, signal.SIGTERM)
for _ in range(10):
    if not _pid_alive(pid):
        return
    time.sleep(0.5)
# 5s elapsed
kill(pid, signal.SIGKILL)
```

5 seconds is tight when the child is mid-LLM-call. The child receives SIGTERM but doesn't drain its DB connection before SIGKILL lands. If the SIGKILL hits:
- between `BEGIN`-allocated WAL frame and matching `COMMIT`, or
- during `wal_autocheckpoint=100` rollover (frequent because tight), or
- while another worker is in a read transaction holding a shared lock

… the WAL header can desync from the main DB pages. Once desynced, every subsequent open returns "disk image is malformed".

`synchronous=FULL` fsyncs *committed* WAL frames. It cannot rescue an in-flight transaction that gets killed before commit. WAL is normally rollback-safe for crashes, but the combination of:
- concurrent writers holding write locks
- one writer killed mid-transaction
- another writer trying to checkpoint at `wal_autocheckpoint=100`

…seems to produce a state SQLite's recovery can't reconcile.

## Suggested fix directions

In rough order of effort / impact:

1. **Worker-side SIGTERM handler** that flushes the kanban DB connection:
   ```python
   def _on_sigterm(signum, frame):
       try:
           if _kanban_conn is not None:
               _kanban_conn.execute("PRAGMA wal_checkpoint(TRUNCATE)")
               _kanban_conn.close()
       finally:
           sys.exit(143)
   signal.signal(signal.SIGTERM, _on_sigterm)
   ```
   Install this in the worker entry point that opens the kanban connection.

2. **Configurable reclaim grace window** — `HERMES_KANBAN_RECLAIM_GRACE_SECONDS`, default 30s. 5s is too aggressive for children that are mid-LLM-call.

3. **`BEGIN IMMEDIATE`** for kanban writes — moves the lock acquisition to the start of the transaction rather than the first write, so contention is serialised at acquire time. Reduces the window in which a kill leaves the WAL inconsistent.

4. **`max_concurrent_workers`** in dispatcher config — bound the number of concurrent worker subprocess spawns per profile. Currently `_default_spawn` is fire-and-forget per ready task; a flood of ready tasks → unbounded subprocess concurrency. A bounded pool (semaphore) keeps the write-contention surface area small.

5. **Optional, longer-term**: route all kanban writes through a single coordinator process (dispatcher) and have workers send write intents via a pipe/socket. Pushes the SQLite single-writer principle to its logical conclusion. Higher complexity, but eliminates the multi-writer correctness burden entirely.

Happy to send a PR for (1) and (2) if those directions are agreeable; they're the smallest deltas with the highest blast-radius reduction. (3) and (4) deserve their own design discussions.

## Workaround for affected users

While a fix is in flight:

```bash
# When corruption is suspected:
sqlite3 ~/.hermes/kanban.db "PRAGMA integrity_check;"

# Recovery:
sqlite3 ~/.hermes/kanban.db ".recover" > /tmp/recover.sql
sqlite3 ~/.hermes/kanban-new.db < /tmp/recover.sql
mv ~/.hermes/kanban.db ~/.hermes/kanban.db.broken-$(date +%Y-%m-%d)
mv ~/.hermes/kanban-new.db ~/.hermes/kanban.db
# restart the gateway

# Mitigation: throttle concurrent dispatches at the producer layer
# (our poller now caps at 2 concurrent running tasks per profile).
```

## Environment

- hermes-agent: based on `ca63746f3` (one commit ahead of `7cd1f6e2e` from #30973)
- Python 3.12, Linux x86_64
- SQLite 3.45.x (system default)
- WAL mode, `synchronous=FULL`, `wal_autocheckpoint=100` (per #30973)
- Workload: profile with 5-minute poller-driven dispatch cadence, long-running skills (>30 min in some cases), 5+ concurrent workers during peak


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kanban.db corruption recurs under concurrent reclaim-SIGKILL even with synchronous=FULL + wal_autocheckpoint=100 #31618

kanban.db corruption recurs under concurrent reclaim-SIGKILL even with synchronous=FULL + wal_autocheckpoint=100

Summary

Symptom

Repro recipe

Root-cause hypothesis

Suggested fix directions

Workaround for affected users

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

kanban.db corruption recurs under concurrent reclaim-SIGKILL even with synchronous=FULL + wal_autocheckpoint=100 #31618

Description

kanban.db corruption recurs under concurrent reclaim-SIGKILL even with synchronous=FULL + wal_autocheckpoint=100

Summary

Symptom

Repro recipe

Root-cause hypothesis

Suggested fix directions

Workaround for affected users

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions