fix(gateway): replace permanent corrupt-board latch with exponential backoff#31932
Closed
steveonjava wants to merge 1 commit into
Closed
Conversation
…backoff Transient SQLite I/O errors that match the corruption pattern permanently disabled board dispatch. Add PRAGMA quick_check confirmation before latching, and replace the fingerprint-only latch with exponential backoff (30s initial, doubles per failure, 900s cap). Clear latch on fingerprint change or successful dispatch. Refs: NousResearch#30417 (Bug 2) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 26, 2026
Closed
Contributor
Author
|
Bundled into #32857 for batch review. This draft remains open as a cherry-pick fallback if maintainers prefer surgical landing. |
Contributor
Author
|
Closing — superseded by upstream c94ad89 "fix(kanban): retry corrupt-board dispatch after quarantine" (merged 2026-05-27). That commit fixes the same root cause (permanent latch after one EIO/transient corrupt read) with a 5-minute fixed quarantine retry. The exponential-backoff variant in this PR is a refinement, not a needed fix, and isn't worth a competing PR. If quarantine tuning becomes an issue in practice I'll open a follow-up against the new mechanism rather than rebase this one. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR replaces a permanent-fingerprint latch in the kanban dispatcher's corrupt-board handler with time-bounded exponential backoff and a
PRAGMA quick_checkconfirmation step. The latch is an in-process resilience heuristic (SECURITY.md §2.4); this change makes it less aggressive on transient I/O errors. Under the project's security policy, improvements to in-process heuristics are welcome as regular PRs (SECURITY.md §3.2).Changes in scope
gateway/run.py: Replace fingerprint-only latch indisabled_corrupt_boardswith exponential backoff state dict (disabled_until_ts,backoff_seconds,fingerprint)._confirm_corruption(slug, exc)helper that runsPRAGMA quick_checkbefore latching — if quick_check returnsok, the error was transient and no latch is applied.disabled_corrupt_boards.pop).INITIAL_BACKOFF_SEC = 30.0,MAX_BACKOFF_SEC = 900.0.tests/hermes_cli/test_kanban_dispatcher_resilience.py.scripts/release.py.Prior art & coordination
This change coordinates with PR #30410 (schema-drift fix in the same code block). If #30410 merges first, the implementer must rebase and handle the merge conflict carefully — keep #30410's
disabled_schema_boardslogic intact and apply the backoff change only to thedisabled_corrupt_boardspath.Related Issue
Fixes #30417 (Bug 2: dispatcher resilience, transient SQLite I/O error causing permanent latch)
Refs #30410 (schema drift fix, same code block — coordinate on merge order)
Type of Change
Changes Made
Modified
gateway/run.pylines ~5118–5189 (_kanban_dispatcher_watcher/_tick_once_for_board):disabled_corrupt_boardsdict type fromdict[str, tuple]todict[str, dict]with fields:disabled_until_ts,backoff_seconds,fingerprintINITIAL_BACKOFF_SEC = 30.0andMAX_BACKOFF_SEC = 900.0constants_confirm_corruption(slug, exc)function to runPRAGMA quick_checkbefore latching_tick_once_for_boardbackoff logic to double backoff on repeated failures, reset on fingerprint change or successful dispatchimport mathif not presentAdded comprehensive test suite in
tests/hermes_cli/test_kanban_dispatcher_resilience.py:quick_checkpassesquick_checkresults are treated as corruptionAdded AUTHOR_MAP entry in
scripts/release.pyfor steveonjavaHow to Test
All 9 new tests should pass. For full validation:
Checklist