Skip to content

fix(kanban): align failure diagnostics with retry limit#25591

Closed
qWaitCrypto wants to merge 1 commit into
NousResearch:mainfrom
qWaitCrypto:fix/kanban-diagnostics-failure-limit
Closed

fix(kanban): align failure diagnostics with retry limit#25591
qWaitCrypto wants to merge 1 commit into
NousResearch:mainfrom
qWaitCrypto:fix/kanban-diagnostics-failure-limit

Conversation

@qWaitCrypto

@qWaitCrypto qWaitCrypto commented May 14, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes Kanban repeated-failure diagnostics so they line up with the dispatcher's actual circuit-breaker threshold.

On current main, the dispatcher defaults to kanban.failure_limit: 2, but kanban_diagnostics.py still defaulted failure_threshold to 3 and told operators the circuit breaker default was 5. That means a task can auto-block after two consecutive non-success attempts while the repeated-failure diagnostic remains silent.

This PR:

  • Changes the default repeated-failure diagnostic threshold from 3 to 2 to match the dispatcher default.
  • Derives runtime CLI/dashboard diagnostics from kanban.failure_limit unless the user explicitly sets kanban.diagnostics.failure_threshold or the legacy spawn_failure_threshold.
  • Removes the stale hard-coded "default 5" diagnostic text and reports the configured failure limit instead.
  • Adds regression coverage for the default mismatch, custom kanban.failure_limit, and explicit diagnostics-threshold override.

Related Issue

#25641

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • Added config_from_kanban_config() in hermes_cli/kanban_diagnostics.py to translate the runtime kanban config into diagnostics config.
  • Updated hermes_cli/kanban.py and plugins/kanban/dashboard/plugin_api.py to pass the active Kanban config into diagnostics.
  • Updated repeated-failure diagnostic detail/data to include the effective failure_threshold and configured failure_limit.
  • Added tests proving:
    • the default two-failure auto-block threshold now surfaces repeated_failures;
    • custom failure_limit values control diagnostics when no diagnostics override is set;
    • explicit diagnostics thresholds still win over runtime failure_limit.

Reproduction

Before the fix, current main produced no repeated-failure diagnostic at the default dispatcher limit:

kanban.failure_limit = 2
diagnostics.failure_threshold = 3
diagnostic kinds = []
repeated_failures_present = False

After the fix, the same state surfaces the diagnostic and no longer mentions the stale default of 5:

kanban.failure_limit = 2
diagnostics.failure_threshold = 2
diagnostic kinds = ['repeated_failures']
repeated_failures_present = True
detail = The dispatcher circuit breaker is configured for 2 consecutive non-success attempts. Fix the root cause and reclaim or unblock the task to retry.

How to Test

python -m py_compile hermes_cli/kanban_diagnostics.py hermes_cli/kanban.py plugins/kanban/dashboard/plugin_api.py

Result: passed.

/tmp/hermes-provider-refresh-venv/bin/python -m pytest -o addopts= tests/hermes_cli/test_kanban_diagnostics.py

Result: 35 passed in 9.53s.

timeout 60 /tmp/hermes-provider-refresh-venv/bin/python - <<'PY'
# CLI diagnostics smoke: creates a task with two consecutive failures and
# verifies `/kanban diagnostics` surfaces repeated_failures with the default
# failure_limit-derived threshold.
PY

Result: cli diagnostics smoke passed.

timeout 60 /tmp/hermes-provider-refresh-venv/bin/python - <<'PY'
# Dashboard diagnostics smoke: calls plugins.kanban.dashboard.plugin_api
# _compute_task_diagnostics() against the same two-failure state.
PY

Result: dashboard diagnostics smoke passed: repeated_failures.

/tmp/hermes-provider-refresh-venv/bin/python -m pip install fastapi==0.133.1

Result: installed fastapi==0.133.1, starlette==1.0.0, and annotated-doc==0.0.4 into the local test venv so dashboard imports are available.

git diff --check

Result: passed.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: Linux (WSL-style dev environment)

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

See the reproduction and test logs above.

@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins labels May 14, 2026
@teknium1

Copy link
Copy Markdown
Contributor

Merged via #27868 onto current main. Your commit was cherry-picked with authorship preserved. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants