Dispatcher resilience: deterministic spawn-crash loop, transient SQLite I/O latches dispatch off, archived-parent silently promotes children

## Summary

Running v0.14.0 (commit `ba9964ff0`, 2026-05-21) on macOS, single-host install with a multi-profile kanban setup. A transient outbound-network blip on the host triggered a cascade that surfaced **three independent dispatcher/runtime resilience gaps**. None caused data loss (DB integrity stayed `ok` throughout), but each turns a recoverable hiccup into a state that needs manual intervention. Filing together because they share a theme: *the dispatcher/runtime doesn't degrade gracefully from transient or deterministic failures.*

Environment: Hermes Agent v0.14.0 / commit `ba9964ff0`, Python 3.11, macOS, gateway as launchd service with embedded 60s-tick dispatcher, multiple profiles (each with its own skills), SQLite kanban board in WAL mode.

---

## Bug 1 — Deterministic spawn-time crash loops forever (circuit breaker doesn't trip on spawn failure)

**What happened:** A task was pinned to a skill (`skills` column / `--skill`) that its **assigned profile did not have**. Every dispatch tick spawned a worker that died immediately at startup with:

```
Error: Unknown skill(s): atlas-email-intake-poller
```

The dispatcher re-spawned it every 60s tick. It accumulated **~1,500 crashed runs** in `task_runs` before being found and manually archived.

**Why it matters:** This is a *deterministic* failure — the worker fails identically on every spawn, before doing any work. Retrying can never succeed. The `consecutive_failures` / `DEFAULT_FAILURE_LIMIT` circuit breaker (and the `auto_blocked` path in the dispatcher) exists, but it did not stop this loop — dispatcher output showed `spawned=1 crashed=1 auto_blocked=1 promoted=1` repeating indefinitely, i.e. the task was being auto-blocked yet re-promoted and re-spawned each tick.

**Expected:** After N consecutive spawn-time crashes (e.g. the existing failure limit), the task should be set to `blocked` (with `last_failure_error` populated) and **stop being re-spawned** — not cycle block→promote→spawn→crash forever. A deterministic spawn-time crash (skill missing, bad config) should trip the breaker at least as readily as a runtime timeout does.

**Repro:**
1. Create a skill that exists only in profile A.
2. Create a kanban task assigned to profile B, pinned to that skill (`--skill <name>`).
3. Start the gateway; watch the dispatcher spawn+crash the task every tick without ever latching it off.

---

## Bug 2 — A single transient SQLite I/O error permanently disables board dispatch until manual gateway restart

**What happened:** During the network blip (heavy concurrent access, ~4MB `-wal` mid-write), a momentary I/O error occurred:

```
sqlite3.OperationalError: disk I/O error
ERROR kanban dispatcher: board default database /…/kanban.db is not a valid SQLite
  database; disabling dispatch for this board until the file changes or the
  gateway restarts.
```

The DB was **not** corrupt — `PRAGMA integrity_check` returned `ok`, header was valid, all tables readable, disk had 379GB free, and a separate read-only handle kept serving fine. But the dispatcher latched the "disabled" state and **stayed disabled even after the file was healthy again**, requiring a manual `hermes gateway restart` to resume.

(Source: `gateway/run.py:5182` — "disabling dispatch for this board until the file changes or the gateway restarts.")

**Why it matters:** Permanently disabling a board on a *single transient* I/O error means a momentary disk/FS hiccup silently halts all automation until a human notices and restarts. Disabling on genuine corruption is reasonable; latching forever on a transient error is not.

**Expected:** On the next tick after a disable, cheaply re-validate (e.g. `PRAGMA quick_check` or a trivial read) and **auto-re-enable if it passes** — only stay disabled if the DB is actually persistently bad. At minimum, retry-with-backoff rather than latch-until-restart.

---

## Bug 3 — Archiving a parent task silently *promotes* its children (treats archive as success)

**What happened / found in code:** Child-task promotion treats an **archived** parent the same as a `done` parent:

```python
# hermes_cli/kanban_db.py ~line 2131
if all(p["status"] in ("done", "archived") for p in parents):
    # promote child to ready
```

**Why it matters:** Archiving is used to *cancel/retire* a task (e.g. we archived a crash-looping task whose work was moot). But because `archived` counts as a satisfied dependency, **archiving a parent silently advances its children as if the parent had succeeded.** A user cancelling a broken/abandoned parent would not expect its dependents to suddenly become `ready` and run on the assumption the parent's output exists. This can launch downstream work whose precondition was never actually met.

**Expected:** Distinguish "parent completed successfully (`done`)" from "parent was cancelled/retired (`archived`)" for promotion purposes. An archived parent should likely **block** its children (or require explicit re-parenting), not auto-promote them. At minimum this should be documented and configurable.

---

## Why these are grouped

All three are "transient or deterministic failure → unrecoverable/unexpected state without human intervention":
- Bug 1: deterministic crash → infinite retry (should latch off).
- Bug 2: transient I/O → permanent disable (should auto-recover).
- Bug 3: cancellation (archive) → silent promotion (should not treat cancel as success).

Happy to provide full gateway logs, the `task_runs` history (~1,500 crashed rows for the looping task), or test against a patch. Thanks for the framework — multi-profile + kanban dispatch is genuinely great to build on.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dispatcher resilience: deterministic spawn-crash loop, transient SQLite I/O latches dispatch off, archived-parent silently promotes children #30417

Summary

Bug 1 — Deterministic spawn-time crash loops forever (circuit breaker doesn't trip on spawn failure)

Bug 2 — A single transient SQLite I/O error permanently disables board dispatch until manual gateway restart

Bug 3 — Archiving a parent task silently promotes its children (treats archive as success)

Why these are grouped

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Dispatcher resilience: deterministic spawn-crash loop, transient SQLite I/O latches dispatch off, archived-parent silently promotes children #30417

Description

Summary

Bug 1 — Deterministic spawn-time crash loops forever (circuit breaker doesn't trip on spawn failure)

Bug 2 — A single transient SQLite I/O error permanently disables board dispatch until manual gateway restart

Bug 3 — Archiving a parent task silently promotes its children (treats archive as success)

Why these are grouped

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions