Skip to content

[Bug]: Kanban: rapid worker spawn-crash loop (sub-2s/crash) corrupts board SQLite B-tree before failure_limit trips #30896

@julio-cloudvisor

Description

@julio-cloudvisor

Summary

When a kanban task is dispatched to a profile that doesn't exist on the board (or any other deterministic spawn-time crash that returns very quickly), the embedded dispatcher can fire spawn attempts faster than consecutive_failures / DEFAULT_FAILURE_LIMIT increments + commits. We observed 11 spawn attempts in a handful of seconds and ended with a corrupted board DB (PRAGMA integrity_check reported B-tree errors), recoverable only via sqlite3 .dump | sed 's/ROLLBACK; -- due to errors$/COMMIT;/' | sqlite3 fresh.db. Reproduced twice in one session on the same board (2026-05-22).

Distinct from existing issues:

Steps to reproduce

  1. Create a kanban task on board B with assignee=<profile-that-does-not-exist-on-B> (e.g. board exists but profile is bound to a different board).
  2. Let the embedded dispatcher tick. Worker process is spawned, immediately exits with profile-not-found / skill-not-found.
  3. Observe rapid re-spawns (in our case 11 in a handful of seconds — much faster than the 60s tick suggests; appears the dispatcher does not wait a full tick between consecutive crash-retries for the same task).
  4. After ~10 rapid cycles, sqlite3 kanban.db "PRAGMA integrity_check" reports B-tree corruption.

Expected behavior

  • consecutive_failures increments and failure_limit (default 2) hard-stops spawning before corruption can occur.
  • DB writes that count crashes are transactional with strong WAL fences so concurrent rapid spawn attempts can't race the failure counter.
  • At worst, the dispatcher should rate-limit re-spawn for the same task (exponential backoff) so deterministic crashes can't loop faster than the failure-counter can commit.

Actual behavior

Suggested fix

  1. Enforce failure_limit in the spawn path with a transactional consecutive_failures += 1; if >= limit: block; commit BEFORE the next spawn attempt is allowed. This prevents the race entirely.
  2. Per-task exponential backoff between spawn-crash retries (e.g. 1s → 4s → 16s). Even if (1) misses an edge case, this caps the loop frequency below corruption threshold.
  3. Defense in depth: validate assignee against kanban_known_profiles_for_board BEFORE the first spawn, returning auto_blocked with a clear last_failure_error like "profile X is not registered for board Y; check dispatch routing". (Fixing kanban board override is ignored in worker env because HERMES_KANBAN_DB outranks explicit board #30678 also closes most of the inputs to this path, but defense-in-depth here is cheap.)

Update — partial mitigation submitted

PR #30973 adds PRAGMA synchronous=FULL + PRAGMA wal_autocheckpoint=100 to connect(). This addresses the durability gap that lets the WAL race in the first place. It doesn't fix the failure-limit-not-enforced-in-spawn-path root cause, but it raises the bar enough that we observed no further corruption under the same workload that produced 5 corruptions in succession on the unpatched build. The 3 suggested fixes above are still the proper structural fix.

Environment

  • OS: Amazon Linux 2023 (AArch64)
  • Python: 3.11.15
  • Hermes version: v0.14.0 (2026.5.16); commit ba9964ff0 on 2026-05-21
  • Affected component: comp/gateway (kanban dispatcher; embedded gateway-dispatcher path)

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existscomp/cronCron scheduler and job managementtype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions