[Bug]: Kanban DB intermittent corruption after worker crash — missing WAL checkpoint

## Bug Description

Gateway Kanban dispatcher intermittently reports `kanban.db is not a valid SQLite database` and disables dispatch for the board. The DB auto-recovers after some time, but during the disabled window, ready tasks sit unprocessed.

**Gateway log pattern (every few minutes):**
```
16:31:34 kanban dispatcher: spawned=5    ← dispatch succeeds
16:33:35 kanban.db is not a valid SQLite database; disabling dispatch
16:51:37 kanban dispatcher: spawned=1    ← DB recovers, dispatch resumes  
16:53:38 kanban.db is not a valid SQLite database; disabling dispatch  ← recurs
```

The corruption always appears AFTER worker subprocesses complete (spawned=N → next tick: corrupted).

## Root Cause

Three contributing factors in `hermes_cli/kanban_db.py` + `gateway/run.py`:

**1. No explicit WAL checkpoint management.** `kanban_db.py` has zero `PRAGMA wal_checkpoint` calls anywhere. In contrast, `hermes_state.py` (SessionDB) properly manages WAL with `_try_wal_checkpoint()` every 50 writes and in `close()`. Workers crash without proper connection close → WAL frames partially written → next `connect()` reads inconsistent WAL → `sqlite3.DatabaseError`.

**2. `synchronous=NORMAL`** in `kanban_db.py` `connect()`. With NORMAL, SQLite does NOT fsync on commit. If a worker process crashes between writing WAL frames and the checkpoint, WAL contains partially-written frames.

**3. Fingerprint only tracks `.db` file, not `-wal`** (`gateway/run.py` `_board_db_fingerprint()`). If only `-wal` is corrupted but `.db` mtime/size unchanged → fingerprint unchanged → board stays disabled permanently until gateway restart.

## Steps to Reproduce

1. Run Hermes Gateway with Kanban dispatch enabled
2. Create Kanban tasks that cause worker protocol violations (worker exits without `kanban_complete()`)
3. Gateway spawns workers → some crash
4. Next dispatcher tick: `connect()` fails with "database disk image is malformed"
5. Board disabled until `.db` file mtime changes or gateway restarts

## Expected Behavior

Worker crashes should not leave Kanban DB unreadable. WAL should be checkpointed to prevent partial-frame corruption from blocking the dispatcher.

## Actual Behavior

After worker crashes, Gateway sees DB as corrupted, disables dispatch, cannot recover until `.db` file is externally modified or gateway restarts.

## Proposed Fix

1. **Add WAL checkpoint on connection close** — in `gateway/run.py` before `conn.close()`: `conn.execute("PRAGMA wal_checkpoint(PASSIVE)")` (mirrors `SessionDB.close()` at `hermes_state.py:458`)

2. **Include `-wal` file in fingerprint** — track `(wal_mtime_ns, wal_size)` so dispatcher auto-recovers when only WAL corrupted.

3. **Consider `synchronous=FULL`** — prevents WAL checkpoint crashes from corrupting main DB (trade-off: slightly slower writes).

## Environment

- Hermes Agent v0.14.0
- macOS 15.7.4, Python 3.11.11, SQLite 3.47.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Kanban DB intermittent corruption after worker crash — missing WAL checkpoint #32543

Bug Description

Root Cause

Steps to Reproduce

Expected Behavior

Actual Behavior

Proposed Fix

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: Kanban DB intermittent corruption after worker crash — missing WAL checkpoint #32543

Description

Bug Description

Root Cause

Steps to Reproduce

Expected Behavior

Actual Behavior

Proposed Fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions