Skip to content

fix(kanban): self-heal schema drift and rate-limit dispatcher tick errors#30410

Open
splashkes wants to merge 1 commit into
NousResearch:mainfrom
splashkes:fix/kanban-dispatcher-resilience
Open

fix(kanban): self-heal schema drift and rate-limit dispatcher tick errors#30410
splashkes wants to merge 1 commit into
NousResearch:mainfrom
splashkes:fix/kanban-dispatcher-resilience

Conversation

@splashkes

Copy link
Copy Markdown

What changed

Two defenses inside _kanban_dispatcher_watcher:

  1. Schema-drift self-heal. no such column / no such table errors trigger one init_db-based repair attempt per slug per gateway lifetime, gated by a new schema_repair_attempted set. On retry success the board stays enabled; on retry failure the slug moves into disabled_schema_boards (same mtime-fingerprint shape as disabled_corrupt_boards) and is skipped on subsequent ticks until the file changes or the gateway restarts.
  2. Rate-limited exception logging. logger.exception on the tick path is throttled to one line per (slug, exc_class) per _TICK_EXC_LOG_WINDOW_SECONDS (60s). Suppressed counts surface in the next permitted line.

Why

The session_id migration trap from #28464 — fixed at the migration layer in #28781 and the follow-up 7552e0f — caused the dispatcher to log a full logger.exception traceback every tick on multi-board installs. Reproduced in the wild: 20+ tracebacks per second per gateway, journald saturated, gateway RSS climbed to 3.5 GB over 12h, ssh sessions degraded.

The schema-layer fix closes that specific bug, but the dispatcher's lack of any defense makes every future additive-column bug of the same shape catastrophic. This patch is the dispatcher-layer guard so the next one self-heals instead of taking the gateway down.

#21378 compatibility

The existing regression test test_dispatcher_tick_does_not_call_init_db is updated to allow exactly one init_db(board=slug) call site, and only when it's accompanied by the schema_repair_attempted.add(slug) marker in the source. The marker is the contract that the call is gated to at most once per slug per gateway lifetime — the per-tick race that #21378 fixed cannot reappear. Any additional call site fails the test.

How to test

scripts/run_tests.sh tests/hermes_cli/test_kanban_db.py \
                     tests/hermes_cli/test_kanban_db_init.py \
                     tests/hermes_cli/test_kanban_blocked_sticky.py \
                     tests/hermes_cli/test_kanban_core_functionality.py \
                     tests/hermes_cli/test_kanban_dispatcher_resilience.py \
                     tests/hermes_cli/test_kanban_notify.py

Result on this branch: 347/347 pass, 9.2s.

New tests

tests/hermes_cli/test_kanban_dispatcher_resilience.py:

  • test_schema_drift_triggers_init_db_repair_and_retry_succeeds — first tick raises no such column: session_id, second connect succeeds; init_db called exactly once, no traceback logged, board stays enabled.
  • test_schema_drift_persistent_failure_disables_board — every connect raises drift; init_db called exactly once across 5 ticks (not per-tick), board moves into disabled_schema_boards, error log names the missing column so operators know which migration entry to add.
  • test_persistent_unknown_sql_error_is_rate_limiteddatabase is locked raised on 10 ticks; ≤1 tick-failed log line in the window.
  • test_schema_drift_recovery_state_clears_on_fingerprint_change — disabled board re-enables when its DB file mtime advances, mirroring the corrupt-board path.

Platforms tested

Linux (WSL2 Ubuntu, Python 3.11.15). The patch touches only Python dispatcher logic; no OS-specific surfaces.

Related

Refs #28464 (the originating bug), #28781 (schema-layer fix, merged), 7552e0f (follow-up index hoist), #21378 (the per-tick init race this PR must not reintroduce). Does not close any open issue — the schema bug is already resolved; this is the orthogonal dispatcher hardening to prevent recurrence.

…rors

The dispatcher's per-board tick had no resilience against persistent
SQL errors. When the session_id migration trap from NousResearch#28464 hit a
long-running gateway, the exception handler ran ``logger.exception``
on every tick — multi-board installs surfaced 20+ tracebacks per second
per gateway, saturating journald and pushing the gateway RSS to multi-
GB over hours. The schema bug itself was fixed at the migration layer
in NousResearch#28781 (and the follow-up 7552e0f), but the dispatcher's lack of
defense made every future additive-column bug of the same shape
catastrophic.

Two changes inside ``_kanban_dispatcher_watcher``:

* Schema-drift self-heal. ``no such column`` / ``no such table`` errors
  trigger one ``init_db``-based repair attempt per slug per gateway
  lifetime, gated by a new ``schema_repair_attempted`` set. If the
  retry succeeds the board stays enabled; if it fails the slug moves
  into ``disabled_schema_boards`` (same mtime-fingerprint shape as
  ``disabled_corrupt_boards``) and is skipped on subsequent ticks
  until the file changes or the gateway restarts. The ``init_db``
  call is the only one in the dispatcher — the gating set is the
  contract that the per-tick race fixed in NousResearch#21378 cannot reappear.

* Rate-limited exception logging. ``logger.exception`` calls on the
  tick path are throttled to one line per ``(slug, exc_class)`` per
  ``_TICK_EXC_LOG_WINDOW_SECONDS`` (60s). Suppressed counts surface in
  the next permitted line so steady failures stay visible without
  saturating journald. This caps the worst-case log volume regardless
  of which future SQL or contention bug shows up.

The NousResearch#21378 regression test ``test_dispatcher_tick_does_not_call_init_db``
is updated to allow exactly one ``init_db`` call site, and only when
the schema-drift recovery marker (``schema_repair_attempted.add(slug)``)
is present in the source. The marker is the contract that the call is
gated to at most once per slug per gateway lifetime; any additional
call site fails the test.

New regression tests in tests/hermes_cli/test_kanban_dispatcher_resilience.py:

* ``test_schema_drift_triggers_init_db_repair_and_retry_succeeds``
* ``test_schema_drift_persistent_failure_disables_board``
* ``test_persistent_unknown_sql_error_is_rate_limited``
* ``test_schema_drift_recovery_state_clears_on_fingerprint_change``

Refs NousResearch#28464, NousResearch#28781, NousResearch#21378.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants