fix(kanban): align failure diagnostics with retry limit by qWaitCrypto · Pull Request #25591 · NousResearch/hermes-agent

qWaitCrypto · 2026-05-14T09:32:29Z

What does this PR do?

Fixes Kanban repeated-failure diagnostics so they line up with the dispatcher's actual circuit-breaker threshold.

On current main, the dispatcher defaults to kanban.failure_limit: 2, but kanban_diagnostics.py still defaulted failure_threshold to 3 and told operators the circuit breaker default was 5. That means a task can auto-block after two consecutive non-success attempts while the repeated-failure diagnostic remains silent.

This PR:

Changes the default repeated-failure diagnostic threshold from 3 to 2 to match the dispatcher default.
Derives runtime CLI/dashboard diagnostics from kanban.failure_limit unless the user explicitly sets kanban.diagnostics.failure_threshold or the legacy spawn_failure_threshold.
Removes the stale hard-coded "default 5" diagnostic text and reports the configured failure limit instead.
Adds regression coverage for the default mismatch, custom kanban.failure_limit, and explicit diagnostics-threshold override.

Related Issue

#25641

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

Added config_from_kanban_config() in hermes_cli/kanban_diagnostics.py to translate the runtime kanban config into diagnostics config.
Updated hermes_cli/kanban.py and plugins/kanban/dashboard/plugin_api.py to pass the active Kanban config into diagnostics.
Updated repeated-failure diagnostic detail/data to include the effective failure_threshold and configured failure_limit.
Added tests proving:
- the default two-failure auto-block threshold now surfaces repeated_failures;
- custom failure_limit values control diagnostics when no diagnostics override is set;
- explicit diagnostics thresholds still win over runtime failure_limit.

Reproduction

Before the fix, current main produced no repeated-failure diagnostic at the default dispatcher limit:

kanban.failure_limit = 2
diagnostics.failure_threshold = 3
diagnostic kinds = []
repeated_failures_present = False

After the fix, the same state surfaces the diagnostic and no longer mentions the stale default of 5:

kanban.failure_limit = 2
diagnostics.failure_threshold = 2
diagnostic kinds = ['repeated_failures']
repeated_failures_present = True
detail = The dispatcher circuit breaker is configured for 2 consecutive non-success attempts. Fix the root cause and reclaim or unblock the task to retry.

How to Test

python -m py_compile hermes_cli/kanban_diagnostics.py hermes_cli/kanban.py plugins/kanban/dashboard/plugin_api.py

Result: passed.

/tmp/hermes-provider-refresh-venv/bin/python -m pytest -o addopts= tests/hermes_cli/test_kanban_diagnostics.py

Result: 35 passed in 9.53s.

timeout 60 /tmp/hermes-provider-refresh-venv/bin/python - <<'PY'
# CLI diagnostics smoke: creates a task with two consecutive failures and
# verifies `/kanban diagnostics` surfaces repeated_failures with the default
# failure_limit-derived threshold.
PY

Result: cli diagnostics smoke passed.

timeout 60 /tmp/hermes-provider-refresh-venv/bin/python - <<'PY'
# Dashboard diagnostics smoke: calls plugins.kanban.dashboard.plugin_api
# _compute_task_diagnostics() against the same two-failure state.
PY

Result: dashboard diagnostics smoke passed: repeated_failures.

/tmp/hermes-provider-refresh-venv/bin/python -m pip install fastapi==0.133.1

Result: installed fastapi==0.133.1, starlette==1.0.0, and annotated-doc==0.0.4 into the local test venv so dashboard imports are available.

git diff --check

Result: passed.

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix/feature (no unrelated commits)
I've run pytest tests/ -q and all tests pass
I've added tests for my changes (required for bug fixes, strongly encouraged for features)
I've tested on my platform: Linux (WSL-style dev environment)

Documentation & Housekeeping

I've updated relevant documentation (README, docs/, docstrings) — or N/A
I've updated cli-config.yaml.example if I added/changed config keys — or N/A
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
I've considered cross-platform impact (Windows, macOS) per the compatibility guide
I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

See the reproduction and test logs above.

teknium1 · 2026-05-18T08:22:20Z

Merged via #27868 onto current main. Your commit was cherry-picked with authorship preserved. Thanks!

alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/plugins Plugin system and bundled plugins labels May 14, 2026

qWaitCrypto mentioned this pull request May 14, 2026

[Bug]: Kanban long-running workers lack actionable diagnostics and configurable log retention #25641

Closed

1 task

fix(kanban): align failure diagnostics with retry limit

5d91493

qWaitCrypto force-pushed the fix/kanban-diagnostics-failure-limit branch from 454ee15 to 5d91493 Compare May 18, 2026 06:23

teknium1 mentioned this pull request May 18, 2026

fix(kanban): align failure diagnostics with retry limit #27868

Merged

teknium1 closed this May 18, 2026

teknium1 mentioned this pull request May 18, 2026

fix(kanban): surface unusable triage auxiliary model (auto-decompose aware) #27871

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kanban): align failure diagnostics with retry limit#25591

fix(kanban): align failure diagnostics with retry limit#25591
qWaitCrypto wants to merge 1 commit into
NousResearch:mainfrom
qWaitCrypto:fix/kanban-diagnostics-failure-limit

qWaitCrypto commented May 14, 2026 •

edited

Loading

Uh oh!

teknium1 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

qWaitCrypto commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Related Issue

Type of Change

Changes Made

Reproduction

How to Test

Checklist

Code

Documentation & Housekeeping

Screenshots / Logs

Uh oh!

teknium1 commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qWaitCrypto commented May 14, 2026 •

edited

Loading