Summary
Running v0.14.0 (commit ba9964ff0, 2026-05-21) on macOS, single-host install with a multi-profile kanban setup. A transient outbound-network blip on the host triggered a cascade that surfaced three independent dispatcher/runtime resilience gaps. None caused data loss (DB integrity stayed ok throughout), but each turns a recoverable hiccup into a state that needs manual intervention. Filing together because they share a theme: the dispatcher/runtime doesn't degrade gracefully from transient or deterministic failures.
Environment: Hermes Agent v0.14.0 / commit ba9964ff0, Python 3.11, macOS, gateway as launchd service with embedded 60s-tick dispatcher, multiple profiles (each with its own skills), SQLite kanban board in WAL mode.
Bug 1 — Deterministic spawn-time crash loops forever (circuit breaker doesn't trip on spawn failure)
What happened: A task was pinned to a skill (skills column / --skill) that its assigned profile did not have. Every dispatch tick spawned a worker that died immediately at startup with:
Error: Unknown skill(s): atlas-email-intake-poller
The dispatcher re-spawned it every 60s tick. It accumulated ~1,500 crashed runs in task_runs before being found and manually archived.
Why it matters: This is a deterministic failure — the worker fails identically on every spawn, before doing any work. Retrying can never succeed. The consecutive_failures / DEFAULT_FAILURE_LIMIT circuit breaker (and the auto_blocked path in the dispatcher) exists, but it did not stop this loop — dispatcher output showed spawned=1 crashed=1 auto_blocked=1 promoted=1 repeating indefinitely, i.e. the task was being auto-blocked yet re-promoted and re-spawned each tick.
Expected: After N consecutive spawn-time crashes (e.g. the existing failure limit), the task should be set to blocked (with last_failure_error populated) and stop being re-spawned — not cycle block→promote→spawn→crash forever. A deterministic spawn-time crash (skill missing, bad config) should trip the breaker at least as readily as a runtime timeout does.
Repro:
- Create a skill that exists only in profile A.
- Create a kanban task assigned to profile B, pinned to that skill (
--skill <name>).
- Start the gateway; watch the dispatcher spawn+crash the task every tick without ever latching it off.
Bug 2 — A single transient SQLite I/O error permanently disables board dispatch until manual gateway restart
What happened: During the network blip (heavy concurrent access, ~4MB -wal mid-write), a momentary I/O error occurred:
sqlite3.OperationalError: disk I/O error
ERROR kanban dispatcher: board default database /…/kanban.db is not a valid SQLite
database; disabling dispatch for this board until the file changes or the
gateway restarts.
The DB was not corrupt — PRAGMA integrity_check returned ok, header was valid, all tables readable, disk had 379GB free, and a separate read-only handle kept serving fine. But the dispatcher latched the "disabled" state and stayed disabled even after the file was healthy again, requiring a manual hermes gateway restart to resume.
(Source: gateway/run.py:5182 — "disabling dispatch for this board until the file changes or the gateway restarts.")
Why it matters: Permanently disabling a board on a single transient I/O error means a momentary disk/FS hiccup silently halts all automation until a human notices and restarts. Disabling on genuine corruption is reasonable; latching forever on a transient error is not.
Expected: On the next tick after a disable, cheaply re-validate (e.g. PRAGMA quick_check or a trivial read) and auto-re-enable if it passes — only stay disabled if the DB is actually persistently bad. At minimum, retry-with-backoff rather than latch-until-restart.
Bug 3 — Archiving a parent task silently promotes its children (treats archive as success)
What happened / found in code: Child-task promotion treats an archived parent the same as a done parent:
# hermes_cli/kanban_db.py ~line 2131
if all(p["status"] in ("done", "archived") for p in parents):
# promote child to ready
Why it matters: Archiving is used to cancel/retire a task (e.g. we archived a crash-looping task whose work was moot). But because archived counts as a satisfied dependency, archiving a parent silently advances its children as if the parent had succeeded. A user cancelling a broken/abandoned parent would not expect its dependents to suddenly become ready and run on the assumption the parent's output exists. This can launch downstream work whose precondition was never actually met.
Expected: Distinguish "parent completed successfully (done)" from "parent was cancelled/retired (archived)" for promotion purposes. An archived parent should likely block its children (or require explicit re-parenting), not auto-promote them. At minimum this should be documented and configurable.
Why these are grouped
All three are "transient or deterministic failure → unrecoverable/unexpected state without human intervention":
- Bug 1: deterministic crash → infinite retry (should latch off).
- Bug 2: transient I/O → permanent disable (should auto-recover).
- Bug 3: cancellation (archive) → silent promotion (should not treat cancel as success).
Happy to provide full gateway logs, the task_runs history (~1,500 crashed rows for the looping task), or test against a patch. Thanks for the framework — multi-profile + kanban dispatch is genuinely great to build on.
Summary
Running v0.14.0 (commit
ba9964ff0, 2026-05-21) on macOS, single-host install with a multi-profile kanban setup. A transient outbound-network blip on the host triggered a cascade that surfaced three independent dispatcher/runtime resilience gaps. None caused data loss (DB integrity stayedokthroughout), but each turns a recoverable hiccup into a state that needs manual intervention. Filing together because they share a theme: the dispatcher/runtime doesn't degrade gracefully from transient or deterministic failures.Environment: Hermes Agent v0.14.0 / commit
ba9964ff0, Python 3.11, macOS, gateway as launchd service with embedded 60s-tick dispatcher, multiple profiles (each with its own skills), SQLite kanban board in WAL mode.Bug 1 — Deterministic spawn-time crash loops forever (circuit breaker doesn't trip on spawn failure)
What happened: A task was pinned to a skill (
skillscolumn /--skill) that its assigned profile did not have. Every dispatch tick spawned a worker that died immediately at startup with:The dispatcher re-spawned it every 60s tick. It accumulated ~1,500 crashed runs in
task_runsbefore being found and manually archived.Why it matters: This is a deterministic failure — the worker fails identically on every spawn, before doing any work. Retrying can never succeed. The
consecutive_failures/DEFAULT_FAILURE_LIMITcircuit breaker (and theauto_blockedpath in the dispatcher) exists, but it did not stop this loop — dispatcher output showedspawned=1 crashed=1 auto_blocked=1 promoted=1repeating indefinitely, i.e. the task was being auto-blocked yet re-promoted and re-spawned each tick.Expected: After N consecutive spawn-time crashes (e.g. the existing failure limit), the task should be set to
blocked(withlast_failure_errorpopulated) and stop being re-spawned — not cycle block→promote→spawn→crash forever. A deterministic spawn-time crash (skill missing, bad config) should trip the breaker at least as readily as a runtime timeout does.Repro:
--skill <name>).Bug 2 — A single transient SQLite I/O error permanently disables board dispatch until manual gateway restart
What happened: During the network blip (heavy concurrent access, ~4MB
-walmid-write), a momentary I/O error occurred:The DB was not corrupt —
PRAGMA integrity_checkreturnedok, header was valid, all tables readable, disk had 379GB free, and a separate read-only handle kept serving fine. But the dispatcher latched the "disabled" state and stayed disabled even after the file was healthy again, requiring a manualhermes gateway restartto resume.(Source:
gateway/run.py:5182— "disabling dispatch for this board until the file changes or the gateway restarts.")Why it matters: Permanently disabling a board on a single transient I/O error means a momentary disk/FS hiccup silently halts all automation until a human notices and restarts. Disabling on genuine corruption is reasonable; latching forever on a transient error is not.
Expected: On the next tick after a disable, cheaply re-validate (e.g.
PRAGMA quick_checkor a trivial read) and auto-re-enable if it passes — only stay disabled if the DB is actually persistently bad. At minimum, retry-with-backoff rather than latch-until-restart.Bug 3 — Archiving a parent task silently promotes its children (treats archive as success)
What happened / found in code: Child-task promotion treats an archived parent the same as a
doneparent:Why it matters: Archiving is used to cancel/retire a task (e.g. we archived a crash-looping task whose work was moot). But because
archivedcounts as a satisfied dependency, archiving a parent silently advances its children as if the parent had succeeded. A user cancelling a broken/abandoned parent would not expect its dependents to suddenly becomereadyand run on the assumption the parent's output exists. This can launch downstream work whose precondition was never actually met.Expected: Distinguish "parent completed successfully (
done)" from "parent was cancelled/retired (archived)" for promotion purposes. An archived parent should likely block its children (or require explicit re-parenting), not auto-promote them. At minimum this should be documented and configurable.Why these are grouped
All three are "transient or deterministic failure → unrecoverable/unexpected state without human intervention":
Happy to provide full gateway logs, the
task_runshistory (~1,500 crashed rows for the looping task), or test against a patch. Thanks for the framework — multi-profile + kanban dispatch is genuinely great to build on.