fix(kanban): self-heal schema drift and rate-limit dispatcher tick errors by splashkes · Pull Request #30410 · NousResearch/hermes-agent

splashkes · 2026-05-22T11:54:15Z

What changed

Two defenses inside _kanban_dispatcher_watcher:

Schema-drift self-heal. no such column / no such table errors trigger one init_db-based repair attempt per slug per gateway lifetime, gated by a new schema_repair_attempted set. On retry success the board stays enabled; on retry failure the slug moves into disabled_schema_boards (same mtime-fingerprint shape as disabled_corrupt_boards) and is skipped on subsequent ticks until the file changes or the gateway restarts.
Rate-limited exception logging. logger.exception on the tick path is throttled to one line per (slug, exc_class) per _TICK_EXC_LOG_WINDOW_SECONDS (60s). Suppressed counts surface in the next permitted line.

Why

The session_id migration trap from #28464 — fixed at the migration layer in #28781 and the follow-up 7552e0f — caused the dispatcher to log a full logger.exception traceback every tick on multi-board installs. Reproduced in the wild: 20+ tracebacks per second per gateway, journald saturated, gateway RSS climbed to 3.5 GB over 12h, ssh sessions degraded.

The schema-layer fix closes that specific bug, but the dispatcher's lack of any defense makes every future additive-column bug of the same shape catastrophic. This patch is the dispatcher-layer guard so the next one self-heals instead of taking the gateway down.

#21378 compatibility

The existing regression test test_dispatcher_tick_does_not_call_init_db is updated to allow exactly one init_db(board=slug) call site, and only when it's accompanied by the schema_repair_attempted.add(slug) marker in the source. The marker is the contract that the call is gated to at most once per slug per gateway lifetime — the per-tick race that #21378 fixed cannot reappear. Any additional call site fails the test.

How to test

scripts/run_tests.sh tests/hermes_cli/test_kanban_db.py \
                     tests/hermes_cli/test_kanban_db_init.py \
                     tests/hermes_cli/test_kanban_blocked_sticky.py \
                     tests/hermes_cli/test_kanban_core_functionality.py \
                     tests/hermes_cli/test_kanban_dispatcher_resilience.py \
                     tests/hermes_cli/test_kanban_notify.py

Result on this branch: 347/347 pass, 9.2s.

New tests

tests/hermes_cli/test_kanban_dispatcher_resilience.py:

test_schema_drift_triggers_init_db_repair_and_retry_succeeds — first tick raises no such column: session_id, second connect succeeds; init_db called exactly once, no traceback logged, board stays enabled.
test_schema_drift_persistent_failure_disables_board — every connect raises drift; init_db called exactly once across 5 ticks (not per-tick), board moves into disabled_schema_boards, error log names the missing column so operators know which migration entry to add.
test_persistent_unknown_sql_error_is_rate_limited — database is locked raised on 10 ticks; ≤1 tick-failed log line in the window.
test_schema_drift_recovery_state_clears_on_fingerprint_change — disabled board re-enables when its DB file mtime advances, mirroring the corrupt-board path.

Platforms tested

Linux (WSL2 Ubuntu, Python 3.11.15). The patch touches only Python dispatcher logic; no OS-specific surfaces.

Refs #28464 (the originating bug), #28781 (schema-layer fix, merged), 7552e0f (follow-up index hoist), #21378 (the per-tick init race this PR must not reintroduce). Does not close any open issue — the schema bug is already resolved; this is the orthogonal dispatcher hardening to prevent recurrence.

…rors The dispatcher's per-board tick had no resilience against persistent SQL errors. When the session_id migration trap from NousResearch#28464 hit a long-running gateway, the exception handler ran ``logger.exception`` on every tick — multi-board installs surfaced 20+ tracebacks per second per gateway, saturating journald and pushing the gateway RSS to multi- GB over hours. The schema bug itself was fixed at the migration layer in NousResearch#28781 (and the follow-up 7552e0f), but the dispatcher's lack of defense made every future additive-column bug of the same shape catastrophic. Two changes inside ``_kanban_dispatcher_watcher``: * Schema-drift self-heal. ``no such column`` / ``no such table`` errors trigger one ``init_db``-based repair attempt per slug per gateway lifetime, gated by a new ``schema_repair_attempted`` set. If the retry succeeds the board stays enabled; if it fails the slug moves into ``disabled_schema_boards`` (same mtime-fingerprint shape as ``disabled_corrupt_boards``) and is skipped on subsequent ticks until the file changes or the gateway restarts. The ``init_db`` call is the only one in the dispatcher — the gating set is the contract that the per-tick race fixed in NousResearch#21378 cannot reappear. * Rate-limited exception logging. ``logger.exception`` calls on the tick path are throttled to one line per ``(slug, exc_class)`` per ``_TICK_EXC_LOG_WINDOW_SECONDS`` (60s). Suppressed counts surface in the next permitted line so steady failures stay visible without saturating journald. This caps the worst-case log volume regardless of which future SQL or contention bug shows up. The NousResearch#21378 regression test ``test_dispatcher_tick_does_not_call_init_db`` is updated to allow exactly one ``init_db`` call site, and only when the schema-drift recovery marker (``schema_repair_attempted.add(slug)``) is present in the source. The marker is the contract that the call is gated to at most once per slug per gateway lifetime; any additional call site fails the test. New regression tests in tests/hermes_cli/test_kanban_dispatcher_resilience.py: * ``test_schema_drift_triggers_init_db_repair_and_retry_succeeds`` * ``test_schema_drift_persistent_failure_disables_board`` * ``test_persistent_unknown_sql_error_is_rate_limited`` * ``test_schema_drift_recovery_state_clears_on_fingerprint_change`` Refs NousResearch#28464, NousResearch#28781, NousResearch#21378.

splashkes mentioned this pull request May 22, 2026

kanban: SCHEMA_SQL creates idx_tasks_session_id before _migrate_add_optional_columns adds the column #28844

Closed

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/gateway Gateway runner, session dispatch, delivery labels May 22, 2026

alt-glitch mentioned this pull request May 23, 2026

kanban.db index corruption after frequent gateway restarts — dispatcher disables board permanently #30908

Closed

steveonjava mentioned this pull request May 25, 2026

fix(gateway): replace permanent corrupt-board latch with exponential backoff #31932

Closed

9 tasks

alt-glitch mentioned this pull request May 26, 2026

fix(gateway): catch KanbanDbCorruptError in kanban dispatcher corrupt-board path #32490

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): self-heal schema drift and rate-limit dispatcher tick errors#30410

fix(kanban): self-heal schema drift and rate-limit dispatcher tick errors#30410
splashkes wants to merge 1 commit into
NousResearch:mainfrom
splashkes:fix/kanban-dispatcher-resilience

splashkes commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

splashkes commented May 22, 2026

What changed

Why

#21378 compatibility

How to test

New tests

Platforms tested

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants