kanban dispatcher wedges under multi-thread + subprocess concurrency due to WAL/SHM cache poisoning

## Summary

The kanban dispatcher embedded in `hermes gateway run` wedges after 3-4 verifier-subprocess completions, requiring a gateway restart to recover. The evidence points to a race between the gateway's multi-threaded SQLite connection cycling and verifier-subprocess connection lifecycles in WAL mode. Switching the kanban DB to `journal_mode=DELETE` eliminates the wedge in our environment; we've validated the patch under load. **Submitting as an issue first so the maintainer can decide on the right fix shape — a more surgical WAL-preserving fix may be preferable.**

## Environment

- Hermes Agent v0.14.0 (2026.5.16) — commit `d61785889`
- Linux 6.17.0-29-generic, Ubuntu 24.04
- Filesystem: local XFS (rw,noatime) — not NFS/SMB
- SQLite via Python 3.11 `sqlite3` module
- Affected: any board with multi-process write access (gateway dispatcher + verifier subprocesses)

## Symptom (measured)

The kanban dispatcher embedded in `hermes gateway run` ticks every 60s. After 3-4 verifier-subprocess completions on a single board, every subsequent tick fails with `sqlite3.OperationalError: disk I/O error` in `release_stale_claims` (`hermes_cli/kanban_db.py:2399`). Failures continue indefinitely until `hermes gateway restart`.

## Repro

1. Run `hermes gateway start` against a board with active dispatch (`kanban.dispatch_in_gateway: true`).
2. Dispatch ~4 worker tasks whose verifier subprocesses each open their own kanban DB connection (e.g., a custom verifier profile that calls `kanban_block` from inside a skill).
3. After 3-4 completions, dispatcher ticks start failing with the I/O error.
4. Every subsequent tick fails until `hermes gateway restart`.

## Direct evidence (measured)

**lsof on wedged gateway:**
```
python <PID> orion DEL-r REG ... /<HERMES_HOME>/kanban/boards/<board>/kanban.db-shm
python <PID> orion  26u  REG ... /<HERMES_HOME>/kanban/boards/<board>/kanban.db-wal (deleted)
python <PID> orion  27ur REG ... /<HERMES_HOME>/kanban/boards/<board>/kanban.db-shm (deleted)
python <PID> orion  31u  REG ... /<HERMES_HOME>/kanban/boards/<board>/kanban.db-wal (deleted)  [different inode]
python <PID> orion  35u  REG ... /<HERMES_HOME>/kanban/boards/<board>/kanban.db-wal (deleted)  [different inode]
```

Three different deleted `-wal` inodes in one gateway lifetime. Meanwhile:

- `PRAGMA integrity_check` and `quick_check`: both `ok` throughout
- Fresh `sqlite3.connect()` from a different process: same SELECT runs cleanly
- `hermes kanban dispatch --dry-run` from a fresh CLI process: works
- Only the gateway's long-lived process's connections wedge

## Root cause (our analysis — measured wedge + lsof evidence + inferred mechanism)

The gateway has ~7 threads. All 7 kanban DB connect sites use proper `try/finally: close()` patterns — no individual connection leak (we checked all of them):

- `gateway/run.py:4590` — notifier-watcher (~5s loop)
- `gateway/run.py:4842, 4857, 4878` — sub/unsub/advance helpers
- `gateway/run.py:5164` — dispatcher tick (60s)
- `gateway/run.py:5241` — spawn-budget watcher
- `gateway/run.py:9321` — auto-subscribe handler

The evidence is consistent with this mechanism: in WAL mode, SQLite shares a per-process `unixShmNode` and SHM mmap across all connections to the same DB within one process. The lsof output above shows the gateway holding multiple deleted-inode FDs on `-wal` / `-shm` — three different deleted `-wal` inodes per gateway lifetime, matching the 3-4 verifier-subprocess completions per wedge cycle.

Our inferred mechanism: when (a) a verifier subprocess opens its own connection to call `kanban_block`, (b) all gateway threads happen to be momentarily between connect()s, and (c) the verifier exits as the last DB-holder, SQLite's "last connection cleanup" unlinks `-wal` and `-shm`. New gateway connections see fresh files on disk, but the gateway's still-cached `unixShmNode` references the deleted inode. Subsequent SELECTs in any gateway connection fail with what we believe is `SQLITE_IOERR_SHMMAP` surfacing as the visible `disk I/O error`.

The wedge, the deleted-inode FDs, and the fix's effectiveness are **measured**. The specific `unixShmNode` poisoning mechanism is **inferred** from those measurements + SQLite's documented per-process WAL state semantics. We did not directly capture the SQLite internal error code (would require strace + a recompile to expose).

## Fix we applied (works in our environment, may not be the right one for upstream)

One-line change in `hermes_cli/kanban_db.py` (around line 1049):

```diff
         with _INIT_LOCK:
-            # WAL doesn't work on network filesystems (NFS/SMB/FUSE). Shared helper
-            # falls back to DELETE with one WARNING so kanban stays usable there.
-            from hermes_state import apply_wal_with_fallback
-            apply_wal_with_fallback(conn, db_label=f"kanban.db ({path.name})")
+            conn.execute("PRAGMA journal_mode=DELETE")
             conn.execute("PRAGMA synchronous=NORMAL")
             conn.execute("PRAGMA foreign_keys=ON")
```

Other Hermes DBs that use `apply_wal_with_fallback` (state.db, memory_store.db, response_store.db) are unaffected — only kanban DBs switch.

## Tradeoff this fix accepts

DELETE mode trades away WAL's concurrent-reader concurrency: writers serialize on the DB exclusive lock. For the kanban DB this looks invisible — small DB (~120KB in our case), low write rate (60s dispatcher tick + event-driven helpers, all sub-second), and the multi-process write pattern is exactly what makes WAL fragile here. The previous `apply_wal_with_fallback` call also removed (which exists to fall back to DELETE on WAL-incompatible filesystems like NFS) — DELETE mode is unconditional with this diff, so the NFS-fallback log message no longer fires.

**A more surgical fix may preserve WAL.** Options the maintainer might prefer:

1. **Shared long-lived gateway connection per board**, used by all gateway threads (would need locking) — keeps WAL but eliminates the multi-connection-per-process pattern.
2. **Checkpoint discipline**: pin the WAL/SHM lifecycle so verifier subprocesses can't trigger the "last connection cleanup" unlink (e.g., have the gateway hold a sentinel connection open for the lifetime of the process).
3. **Detect-and-reopen on I/O error**: catch the symptom and refresh connections, but doesn't fix the underlying race.

We picked DELETE because it removes the failure class by construction; we don't have visibility into upstream's wider design considerations (NFS support, expected write concurrency, etc.). The patch above is offered as one resolution; happy to defer to the maintainer's preferred approach.

## Validation (measured)

After applying the patch + gateway restart:
- Existing `-wal` file auto-removed on first connect (PRAGMA migration worked)
- 3-task stress test: all dispatched in one tick, completed cleanly, zero I/O errors
- lsof shows 0 `-wal`/`-shm` FDs on the gateway across multiple verifier-subprocess cycles
- 0 dispatcher tick failures across the validation window

Happy to open a PR with the change shown above if the DELETE-mode direction is what you want — or to take a different cut at it if you'd prefer one of the surgical alternatives.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kanban dispatcher wedges under multi-thread + subprocess concurrency due to WAL/SHM cache poisoning #31158

Summary

Environment

Symptom (measured)

Repro

Direct evidence (measured)

Root cause (our analysis — measured wedge + lsof evidence + inferred mechanism)

Fix we applied (works in our environment, may not be the right one for upstream)

Tradeoff this fix accepts

Validation (measured)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

kanban dispatcher wedges under multi-thread + subprocess concurrency due to WAL/SHM cache poisoning #31158

Description

Summary

Environment

Symptom (measured)

Repro

Direct evidence (measured)

Root cause (our analysis — measured wedge + lsof evidence + inferred mechanism)

Fix we applied (works in our environment, may not be the right one for upstream)

Tradeoff this fix accepts

Validation (measured)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions