You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a kanban task is dispatched to a profile that doesn't exist on the board (or any other deterministic spawn-time crash that returns very quickly), the embedded dispatcher can fire spawn attempts faster than consecutive_failures / DEFAULT_FAILURE_LIMIT increments + commits. We observed 11 spawn attempts in a handful of seconds and ended with a corrupted board DB (PRAGMA integrity_check reported B-tree errors), recoverable only via sqlite3 .dump | sed 's/ROLLBACK; -- due to errors$/COMMIT;/' | sqlite3 fresh.db. Reproduced twice in one session on the same board (2026-05-22).
kanban board override is ignored in worker env because HERMES_KANBAN_DB outranks explicit board #30678 (board override ignored in worker env): the misrouting root cause that fed our case — a worker called kanban_create(board="<other-board>") from worker context, the board arg was silently dropped, the assigned profile didn't exist on the worker's pinned board, so the dispatcher hit a deterministic crash loop on the misrouted card.
Steps to reproduce
Create a kanban task on board B with assignee=<profile-that-does-not-exist-on-B> (e.g. board exists but profile is bound to a different board).
Let the embedded dispatcher tick. Worker process is spawned, immediately exits with profile-not-found / skill-not-found.
Observe rapid re-spawns (in our case 11 in a handful of seconds — much faster than the 60s tick suggests; appears the dispatcher does not wait a full tick between consecutive crash-retries for the same task).
consecutive_failures increments and failure_limit (default 2) hard-stops spawning before corruption can occur.
DB writes that count crashes are transactional with strong WAL fences so concurrent rapid spawn attempts can't race the failure counter.
At worst, the dispatcher should rate-limit re-spawn for the same task (exponential backoff) so deterministic crashes can't loop faster than the failure-counter can commit.
Actual behavior
~11 spawns complete before any block is registered. The task_runs table + dispatcher state-writes appear to be racing with each other (and/or with the worker's own DB handles in WAL mode), producing B-tree corruption.
Enforce failure_limit in the spawn path with a transactional consecutive_failures += 1; if >= limit: block; commit BEFORE the next spawn attempt is allowed. This prevents the race entirely.
Per-task exponential backoff between spawn-crash retries (e.g. 1s → 4s → 16s). Even if (1) misses an edge case, this caps the loop frequency below corruption threshold.
Defense in depth: validate assignee against kanban_known_profiles_for_board BEFORE the first spawn, returning auto_blocked with a clear last_failure_error like "profile X is not registered for board Y; check dispatch routing". (Fixing kanban board override is ignored in worker env because HERMES_KANBAN_DB outranks explicit board #30678 also closes most of the inputs to this path, but defense-in-depth here is cheap.)
Update — partial mitigation submitted
PR #30973 adds PRAGMA synchronous=FULL + PRAGMA wal_autocheckpoint=100 to connect(). This addresses the durability gap that lets the WAL race in the first place. It doesn't fix the failure-limit-not-enforced-in-spawn-path root cause, but it raises the bar enough that we observed no further corruption under the same workload that produced 5 corruptions in succession on the unpatched build. The 3 suggested fixes above are still the proper structural fix.
Environment
OS: Amazon Linux 2023 (AArch64)
Python: 3.11.15
Hermes version: v0.14.0 (2026.5.16); commit ba9964ff0 on 2026-05-21
Summary
When a kanban task is dispatched to a profile that doesn't exist on the board (or any other deterministic spawn-time crash that returns very quickly), the embedded dispatcher can fire spawn attempts faster than
consecutive_failures/DEFAULT_FAILURE_LIMITincrements + commits. We observed 11 spawn attempts in a handful of seconds and ended with a corrupted board DB (PRAGMA integrity_checkreported B-tree errors), recoverable only viasqlite3 .dump | sed 's/ROLLBACK; -- due to errors$/COMMIT;/' | sqlite3 fresh.db. Reproduced twice in one session on the same board (2026-05-22).Distinct from existing issues:
kanban_create(board="<other-board>")from worker context, theboardarg was silently dropped, the assigned profile didn't exist on the worker's pinned board, so the dispatcher hit a deterministic crash loop on the misrouted card.Steps to reproduce
Bwithassignee=<profile-that-does-not-exist-on-B>(e.g. board exists but profile is bound to a different board).sqlite3 kanban.db "PRAGMA integrity_check"reports B-tree corruption.Expected behavior
consecutive_failuresincrements andfailure_limit(default 2) hard-stops spawning before corruption can occur.Actual behavior
task_runstable + dispatcher state-writes appear to be racing with each other (and/or with the worker's own DB handles in WAL mode), producing B-tree corruption..dump/.readrebuild; the in-tree recovery path (Kanban: corrupted board DB + empty top-level DB → silent recreation with total data loss #30687) silently recreates an empty DB on top of the corruption.Suggested fix
failure_limitin the spawn path with a transactionalconsecutive_failures += 1; if >= limit: block; commitBEFORE the next spawn attempt is allowed. This prevents the race entirely.assigneeagainstkanban_known_profiles_for_boardBEFORE the first spawn, returningauto_blockedwith a clearlast_failure_errorlike "profile X is not registered for board Y; check dispatch routing". (Fixing kanban board override is ignored in worker env because HERMES_KANBAN_DB outranks explicit board #30678 also closes most of the inputs to this path, but defense-in-depth here is cheap.)Update — partial mitigation submitted
PR #30973 adds
PRAGMA synchronous=FULL+PRAGMA wal_autocheckpoint=100toconnect(). This addresses the durability gap that lets the WAL race in the first place. It doesn't fix the failure-limit-not-enforced-in-spawn-path root cause, but it raises the bar enough that we observed no further corruption under the same workload that produced 5 corruptions in succession on the unpatched build. The 3 suggested fixes above are still the proper structural fix.Environment
ba9964ff0on 2026-05-21Related issues