kanban dispatcher: 'duplicate column name: consecutive_failures' on first tick after gateway restart

## Summary

`kanban dispatcher` fails with `sqlite3.OperationalError: duplicate column name: consecutive_failures` on the **first tick after every gateway restart**, on a kanban DB that has been migrated by a prior 0.12.x → 0.13 release. Subsequent ticks succeed. Once-per-restart noise in `errors.log`.

## Version

```
Hermes Agent v0.13.0 (2026.5.7)
Python: 3.11.15 (macOS 15, Apple Silicon)
OpenAI SDK: 2.32.0
```

Local main is at `origin/main` + 3 unrelated local patches (none touch kanban). The DB was created and last-migrated under 0.12.x.

## Symptom

`~/.hermes/logs/errors.log` after gateway restart:

```
2026-05-08 14:21:53,349 ERROR gateway.run: kanban dispatcher: tick failed on board default
Traceback (most recent call last):
  File "/Users/leon/.hermes/hermes-agent/gateway/run.py", line 3931, in _tick_once_for_board
    conn = _kb.connect(board=slug)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/leon/.hermes/hermes-agent/hermes_cli/kanban_db.py", line 928, in connect
    _migrate_add_optional_columns(conn)
  File "/Users/leon/.hermes/hermes-agent/hermes_cli/kanban_db.py", line 996, in _migrate_add_optional_columns
    conn.execute(
sqlite3.OperationalError: duplicate column name: consecutive_failures
```

Only 1 ERROR per gateway restart — subsequent dispatcher ticks (every 60s after) succeed silently. Gateway core, Telegram, Weixin, cron all healthy.

## Database state

`~/.hermes/kanban.db` has 0 tasks. The schema already includes `consecutive_failures`, `last_failure_error`, `max_retries` (added during a prior 0.12.x migration) **plus** the legacy `spawn_failures`, `last_spawn_error` columns:

```
$ sqlite3 ~/.hermes/kanban.db "PRAGMA table_info(tasks);" | tail -10
17|spawn_failures|INTEGER|1|0|0
18|worker_pid|INTEGER|0||0
19|last_spawn_error|TEXT|0||0
...
25|skills|TEXT|0||0
26|consecutive_failures|INTEGER|1|0|0
27|last_failure_error|TEXT|0||0
28|max_retries|INTEGER|0||0
```

Note all the columns the migration wants to add are already present (cids 26-28).

## Reproduction (does NOT reproduce in isolation)

A direct reproduction from a fresh Python process **succeeds** — the migration's column-existence guard (`if "consecutive_failures" not in cols`) correctly skips the ALTER TABLE:

```python
import sys, os, sqlite3
sys.path.insert(0, '/Users/leon/.hermes/hermes-agent')
os.chdir('/Users/leon/.hermes/hermes-agent')
from hermes_cli.kanban_db import connect

c = connect(board='default')   # succeeds, no error
c.close()
```

But when invoked from the gateway dispatcher's `_tick_once_for_board` (worker thread via `asyncio.to_thread`), the same call **fails**. There appears to be a context-dependent difference in what `PRAGMA table_info(tasks)` returns at the moment `_migrate_add_optional_columns` queries it.

## Speculation on cause

Two possibilities I can think of:

1. **Concurrent connections during gateway startup**: the dispatcher tick races with another path that also opens the kanban DB (e.g., gateway notifier, board init). One connection sees mid-migration state.

2. **Connection-local schema cache**: under WAL mode + `synchronous=NORMAL`, schema visibility across connections may have fence ordering subtleties on first concurrent open.

The dispatcher path at `gateway/run.py:3931` does:
```python
conn = _kb.connect(board=slug)            # ← line 3931, the failing call
try:
    _kb.init_db(board=slug)               # opens another conn that re-runs init
except Exception:
    pass
```

`init_db()` `discard`s the path from `_INITIALIZED_PATHS` and re-opens, forcing the migration to re-run on a second connection. So per dispatcher tick, the migration is invoked twice on two different connections.

## Suggested fix

Either:

- **Idempotency wrap**: catch `sqlite3.OperationalError` whose message contains `"duplicate column name"` around each `ALTER TABLE` in `_migrate_add_optional_columns` and ignore it. The end state is what we want.
- **Re-query**: refresh `cols` from `PRAGMA table_info(tasks)` immediately before each guard check (the existing comment notes this is intentionally not done — but the assumption that "no step depends on a column added by a previous step in the same call" doesn't protect against another connection mutating the schema between snapshot and check).

I lean toward the idempotency-wrap fix as the simplest robust solution.

## Workaround for affected users

None needed if you don't actively use kanban — the error fires once per restart and doesn't affect anything else. If kanban is in use, the second tick (60s later) succeeds and the dispatcher continues normally.

## Relevant recent commits

- `24d48ffb8 feat(kanban): add specify — auxiliary LLM fleshes out triage tasks (#21435)`
- `ac51c4c1a feat(kanban): per-task max_retries override (#21330)` — added `max_retries` column
- `a2ff19305 chore: follow-up cleanup for Kanban migration fix`

The migration handler in `hermes_cli/kanban_db.py:_migrate_add_optional_columns` is the relevant code path.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kanban dispatcher: 'duplicate column name: consecutive_failures' on first tick after gateway restart #21708

Summary

Version

Symptom

Database state

Reproduction (does NOT reproduce in isolation)

Speculation on cause

Suggested fix

Workaround for affected users

Relevant recent commits

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

kanban dispatcher: 'duplicate column name: consecutive_failures' on first tick after gateway restart #21708

Description

Summary

Version

Symptom

Database state

Reproduction (does NOT reproduce in isolation)

Speculation on cause

Suggested fix

Workaround for affected users

Relevant recent commits

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions