fix(kanban): self-heal schema drift and rate-limit dispatcher tick errors#30410
Open
splashkes wants to merge 1 commit into
Open
fix(kanban): self-heal schema drift and rate-limit dispatcher tick errors#30410splashkes wants to merge 1 commit into
splashkes wants to merge 1 commit into
Conversation
…rors The dispatcher's per-board tick had no resilience against persistent SQL errors. When the session_id migration trap from NousResearch#28464 hit a long-running gateway, the exception handler ran ``logger.exception`` on every tick — multi-board installs surfaced 20+ tracebacks per second per gateway, saturating journald and pushing the gateway RSS to multi- GB over hours. The schema bug itself was fixed at the migration layer in NousResearch#28781 (and the follow-up 7552e0f), but the dispatcher's lack of defense made every future additive-column bug of the same shape catastrophic. Two changes inside ``_kanban_dispatcher_watcher``: * Schema-drift self-heal. ``no such column`` / ``no such table`` errors trigger one ``init_db``-based repair attempt per slug per gateway lifetime, gated by a new ``schema_repair_attempted`` set. If the retry succeeds the board stays enabled; if it fails the slug moves into ``disabled_schema_boards`` (same mtime-fingerprint shape as ``disabled_corrupt_boards``) and is skipped on subsequent ticks until the file changes or the gateway restarts. The ``init_db`` call is the only one in the dispatcher — the gating set is the contract that the per-tick race fixed in NousResearch#21378 cannot reappear. * Rate-limited exception logging. ``logger.exception`` calls on the tick path are throttled to one line per ``(slug, exc_class)`` per ``_TICK_EXC_LOG_WINDOW_SECONDS`` (60s). Suppressed counts surface in the next permitted line so steady failures stay visible without saturating journald. This caps the worst-case log volume regardless of which future SQL or contention bug shows up. The NousResearch#21378 regression test ``test_dispatcher_tick_does_not_call_init_db`` is updated to allow exactly one ``init_db`` call site, and only when the schema-drift recovery marker (``schema_repair_attempted.add(slug)``) is present in the source. The marker is the contract that the call is gated to at most once per slug per gateway lifetime; any additional call site fails the test. New regression tests in tests/hermes_cli/test_kanban_dispatcher_resilience.py: * ``test_schema_drift_triggers_init_db_repair_and_retry_succeeds`` * ``test_schema_drift_persistent_failure_disables_board`` * ``test_persistent_unknown_sql_error_is_rate_limited`` * ``test_schema_drift_recovery_state_clears_on_fingerprint_change`` Refs NousResearch#28464, NousResearch#28781, NousResearch#21378.
9 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed
Two defenses inside
_kanban_dispatcher_watcher:no such column/no such tableerrors trigger oneinit_db-based repair attempt per slug per gateway lifetime, gated by a newschema_repair_attemptedset. On retry success the board stays enabled; on retry failure the slug moves intodisabled_schema_boards(same mtime-fingerprint shape asdisabled_corrupt_boards) and is skipped on subsequent ticks until the file changes or the gateway restarts.logger.exceptionon the tick path is throttled to one line per(slug, exc_class)per_TICK_EXC_LOG_WINDOW_SECONDS(60s). Suppressed counts surface in the next permitted line.Why
The session_id migration trap from #28464 — fixed at the migration layer in #28781 and the follow-up 7552e0f — caused the dispatcher to log a full
logger.exceptiontraceback every tick on multi-board installs. Reproduced in the wild: 20+ tracebacks per second per gateway, journald saturated, gateway RSS climbed to 3.5 GB over 12h, ssh sessions degraded.The schema-layer fix closes that specific bug, but the dispatcher's lack of any defense makes every future additive-column bug of the same shape catastrophic. This patch is the dispatcher-layer guard so the next one self-heals instead of taking the gateway down.
#21378 compatibility
The existing regression test
test_dispatcher_tick_does_not_call_init_dbis updated to allow exactly oneinit_db(board=slug)call site, and only when it's accompanied by theschema_repair_attempted.add(slug)marker in the source. The marker is the contract that the call is gated to at most once per slug per gateway lifetime — the per-tick race that #21378 fixed cannot reappear. Any additional call site fails the test.How to test
Result on this branch: 347/347 pass, 9.2s.
New tests
tests/hermes_cli/test_kanban_dispatcher_resilience.py:test_schema_drift_triggers_init_db_repair_and_retry_succeeds— first tick raisesno such column: session_id, second connect succeeds;init_dbcalled exactly once, no traceback logged, board stays enabled.test_schema_drift_persistent_failure_disables_board— every connect raises drift;init_dbcalled exactly once across 5 ticks (not per-tick), board moves intodisabled_schema_boards, error log names the missing column so operators know which migration entry to add.test_persistent_unknown_sql_error_is_rate_limited—database is lockedraised on 10 ticks; ≤1 tick-failed log line in the window.test_schema_drift_recovery_state_clears_on_fingerprint_change— disabled board re-enables when its DB file mtime advances, mirroring the corrupt-board path.Platforms tested
Linux (WSL2 Ubuntu, Python 3.11.15). The patch touches only Python dispatcher logic; no OS-specific surfaces.
Related
Refs #28464 (the originating bug), #28781 (schema-layer fix, merged), 7552e0f (follow-up index hoist), #21378 (the per-tick init race this PR must not reintroduce). Does not close any open issue — the schema bug is already resolved; this is the orthogonal dispatcher hardening to prevent recurrence.