Skip to content

fix(gateway): replace permanent corrupt-board latch with exponential backoff#31932

Closed
steveonjava wants to merge 1 commit into
NousResearch:mainfrom
steveonjava:fix/gateway-disabled-boards-overaggressive-latch
Closed

fix(gateway): replace permanent corrupt-board latch with exponential backoff#31932
steveonjava wants to merge 1 commit into
NousResearch:mainfrom
steveonjava:fix/gateway-disabled-boards-overaggressive-latch

Conversation

@steveonjava

Copy link
Copy Markdown
Contributor

What does this PR do?

This PR replaces a permanent-fingerprint latch in the kanban dispatcher's corrupt-board handler with time-bounded exponential backoff and a PRAGMA quick_check confirmation step. The latch is an in-process resilience heuristic (SECURITY.md §2.4); this change makes it less aggressive on transient I/O errors. Under the project's security policy, improvements to in-process heuristics are welcome as regular PRs (SECURITY.md §3.2).

Changes in scope

  • gateway/run.py: Replace fingerprint-only latch in disabled_corrupt_boards with exponential backoff state dict (disabled_until_ts, backoff_seconds, fingerprint).
  • Add _confirm_corruption(slug, exc) helper that runs PRAGMA quick_check before latching — if quick_check returns ok, the error was transient and no latch is applied.
  • Add DEBUG-level log on every silent-skip tick (currently silent).
  • Update error log on latch: include backoff duration + next-retry hint.
  • Add success path: clear latch state on successful dispatch (disabled_corrupt_boards.pop).
  • Add constants: INITIAL_BACKOFF_SEC = 30.0, MAX_BACKOFF_SEC = 900.0.
  • New tests in tests/hermes_cli/test_kanban_dispatcher_resilience.py.
  • AUTHOR_MAP entry in scripts/release.py.

Prior art & coordination

This change coordinates with PR #30410 (schema-drift fix in the same code block). If #30410 merges first, the implementer must rebase and handle the merge conflict carefully — keep #30410's disabled_schema_boards logic intact and apply the backoff change only to the disabled_corrupt_boards path.

Related Issue

Fixes #30417 (Bug 2: dispatcher resilience, transient SQLite I/O error causing permanent latch)

Refs #30410 (schema drift fix, same code block — coordinate on merge order)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Changes Made

  • Modified gateway/run.py lines ~5118–5189 (_kanban_dispatcher_watcher / _tick_once_for_board):

    • Changed disabled_corrupt_boards dict type from dict[str, tuple] to dict[str, dict] with fields: disabled_until_ts, backoff_seconds, fingerprint
    • Added INITIAL_BACKOFF_SEC = 30.0 and MAX_BACKOFF_SEC = 900.0 constants
    • Added _confirm_corruption(slug, exc) function to run PRAGMA quick_check before latching
    • Updated _tick_once_for_board backoff logic to double backoff on repeated failures, reset on fingerprint change or successful dispatch
    • Added DEBUG log on silent-skip path
    • Added success path to clear latch state
    • Added import math if not present
  • Added comprehensive test suite in tests/hermes_cli/test_kanban_dispatcher_resilience.py:

    • Test transient errors are not latched when quick_check passes
    • Test genuine corruption latches with initial backoff (30s)
    • Test backoff doubles on repeated failures, caps at 900s
    • Test fingerprint changes clear the latch immediately
    • Test successful dispatch clears latch state
    • Test backoff resets when fingerprint changes between failures
    • Test non-corruption errors bypass the latch mechanism
    • Test boards are not dispatched during active backoff window
    • Test malformed quick_check results are treated as corruption
  • Added AUTHOR_MAP entry in scripts/release.py for steveonjava

How to Test

scripts/run_tests.sh tests/hermes_cli/test_kanban_dispatcher_resilience.py -v

All 9 new tests should pass. For full validation:

scripts/run_tests.sh
ruff check . && ruff format --check .
scripts/check-windows-footguns.py

Checklist

  • Read CONTRIBUTING.md
  • Conventional Commits
  • Searched for existing PRs
  • PR contains only related changes
  • pytest tests/ -q passes
  • Tests added for changes (required for bug fixes)
  • Tested on platform
  • Updated docs/config if needed

…backoff

Transient SQLite I/O errors that match the corruption pattern permanently disabled board dispatch. Add PRAGMA quick_check confirmation before latching, and replace the fingerprint-only latch with exponential backoff (30s initial, doubles per failure, 900s cap). Clear latch on fingerprint change or successful dispatch. Refs: NousResearch#30417 (Bug 2)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels May 25, 2026
@steveonjava steveonjava marked this pull request as ready for review May 25, 2026 23:07
@steveonjava

Copy link
Copy Markdown
Contributor Author

Bundled into #32857 for batch review. This draft remains open as a cherry-pick fallback if maintainers prefer surgical landing.

@steveonjava

Copy link
Copy Markdown
Contributor Author

Closing — superseded by upstream c94ad89 "fix(kanban): retry corrupt-board dispatch after quarantine" (merged 2026-05-27). That commit fixes the same root cause (permanent latch after one EIO/transient corrupt read) with a 5-minute fixed quarantine retry. The exponential-backoff variant in this PR is a refinement, not a needed fix, and isn't worth a competing PR. If quarantine tuning becomes an issue in practice I'll open a follow-up against the new mechanism rather than rebase this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dispatcher resilience: deterministic spawn-crash loop, transient SQLite I/O latches dispatch off, archived-parent silently promotes children

2 participants