[Bug]: Kanban: rapid worker spawn-crash loop (sub-2s/crash) corrupts board SQLite B-tree before failure_limit trips

## Summary

When a kanban task is dispatched to a profile that doesn't exist on the board (or any other deterministic spawn-time crash that returns very quickly), the embedded dispatcher can fire spawn attempts faster than `consecutive_failures` / `DEFAULT_FAILURE_LIMIT` increments + commits. We observed **11 spawn attempts in a handful of seconds** and ended with a corrupted board DB (`PRAGMA integrity_check` reported B-tree errors), recoverable only via `sqlite3 .dump | sed 's/ROLLBACK; -- due to errors$/COMMIT;/' | sqlite3 fresh.db`. Reproduced twice in one session on the same board (2026-05-22).

Distinct from existing issues:
- **#30417 Bug 1** (slow variant): that case is slow (60s ticks, ~1500 crashes over a long period) and non-corrupting. Ours is fast (11 crashes in seconds) and DOES corrupt.
- **#30445** (multi-gateway corruption): not multi-gateway / inode reuse — single-host single-gateway.
- **#30687** (corruption recovery): corruption originates HERE; #30687 covers what happens after the DB is already corrupt.
- **#30678** (board override ignored in worker env): the misrouting root cause that fed our case — a worker called `kanban_create(board="<other-board>")` from worker context, the `board` arg was silently dropped, the assigned profile didn't exist on the worker's pinned board, so the dispatcher hit a deterministic crash loop on the misrouted card.

## Steps to reproduce

1. Create a kanban task on board `B` with `assignee=<profile-that-does-not-exist-on-B>` (e.g. board exists but profile is bound to a different board).
2. Let the embedded dispatcher tick. Worker process is spawned, immediately exits with profile-not-found / skill-not-found.
3. Observe rapid re-spawns (in our case 11 in a handful of seconds — much faster than the 60s tick suggests; appears the dispatcher does not wait a full tick between consecutive crash-retries for the same task).
4. After ~10 rapid cycles, `sqlite3 kanban.db "PRAGMA integrity_check"` reports B-tree corruption.

## Expected behavior

- `consecutive_failures` increments and `failure_limit` (default 2) **hard-stops** spawning *before* corruption can occur.
- DB writes that count crashes are transactional with strong WAL fences so concurrent rapid spawn attempts can't race the failure counter.
- At worst, the dispatcher should rate-limit re-spawn for the same task (exponential backoff) so deterministic crashes can't loop faster than the failure-counter can commit.

## Actual behavior

- ~11 spawns complete before any block is registered. The `task_runs` table + dispatcher state-writes appear to be racing with each other (and/or with the worker's own DB handles in WAL mode), producing B-tree corruption.
- Recovery requires offline `.dump`/`.read` rebuild; the in-tree recovery path (#30687) silently recreates an empty DB on top of the corruption.

## Suggested fix

1. **Enforce `failure_limit` in the spawn path** with a transactional `consecutive_failures += 1; if >= limit: block; commit` BEFORE the next spawn attempt is allowed. This prevents the race entirely.
2. **Per-task exponential backoff** between spawn-crash retries (e.g. 1s → 4s → 16s). Even if (1) misses an edge case, this caps the loop frequency below corruption threshold.
3. **Defense in depth**: validate `assignee` against `kanban_known_profiles_for_board` BEFORE the first spawn, returning `auto_blocked` with a clear `last_failure_error` like *"profile X is not registered for board Y; check dispatch routing"*. (Fixing #30678 also closes most of the inputs to this path, but defense-in-depth here is cheap.)

## Update — partial mitigation submitted

PR #30973 adds `PRAGMA synchronous=FULL` + `PRAGMA wal_autocheckpoint=100` to `connect()`. This addresses the durability gap that lets the WAL race in the first place. It doesn't fix the failure-limit-not-enforced-in-spawn-path root cause, but it raises the bar enough that we observed no further corruption under the same workload that produced 5 corruptions in succession on the unpatched build. The 3 suggested fixes above are still the proper structural fix.

## Environment

- **OS:** Amazon Linux 2023 (AArch64)
- **Python:** 3.11.15
- **Hermes version:** v0.14.0 (2026.5.16); commit `ba9964ff0` on 2026-05-21
- **Affected component:** comp/gateway (kanban dispatcher; embedded gateway-dispatcher path)

## Related issues

- #30678 (misrouting root cause)
- #30417 (slow variant, no corruption)
- #30445 (multi-gateway corruption)
- #30687 (corruption recovery)
- #29320 (circuit-breaker for repeated bails — different signal)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Kanban: rapid worker spawn-crash loop (sub-2s/crash) corrupts board SQLite B-tree before failure_limit trips #30896

Summary

Steps to reproduce

Expected behavior

Actual behavior

Suggested fix

Update — partial mitigation submitted

Environment

Related issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: Kanban: rapid worker spawn-crash loop (sub-2s/crash) corrupts board SQLite B-tree before failure_limit trips #30896

Description

Summary

Steps to reproduce

Expected behavior

Actual behavior

Suggested fix

Update — partial mitigation submitted

Environment

Related issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions