Bug Description
I saw a community write-up describing several pain points when running Kanban for long-running, multi-worker workflows. After checking the current implementation, I found three small but concrete operational issues that can be fixed independently:
-
Repeated-failure diagnostics can lag behind the dispatcher circuit breaker.
The dispatcher uses kanban.failure_limit to decide when a task should auto-block, but the diagnostics layer had its own default threshold. In the default setup, a task could hit the dispatcher's failure limit and become blocked before repeated_failures was surfaced to the operator.
-
Kanban worker log retention is hard-coded.
Worker logs rotate at a fixed 2 MiB threshold and keep only one backup generation (.log.1). This is acceptable as a default, but long-running workers can overwrite early failure evidence before the operator has a chance to inspect it. There is currently no config knob to increase retention.
-
Triage tasks can remain stuck without a clear diagnostic when the specifier helper is missing.
hermes kanban specify depends on auxiliary.triage_specifier. If that helper cannot be resolved, the command reports no auxiliary client configured, but board/task diagnostics do not surface this configuration gap. A rough task can therefore remain in triage without an obvious operator-facing reason.
Steps to Reproduce
Issue 1, repeated-failure diagnostics:
- Use the default Kanban dispatcher failure limit.
- Create a task that records two consecutive non-success worker attempts.
- Check task diagnostics before the fix.
- The dispatcher can already be at its block threshold while
repeated_failures is still silent.
Issue 2, worker log retention:
- Run a long-lived Kanban worker that writes more than 2 MiB of logs.
- Let it rotate multiple times.
- Inspect the worker log directory.
- Only the active log and
.log.1 are retained.
Issue 3, missing triage specifier:
- Use config without a visible
auxiliary.triage_specifier helper.
- Create a task in the
triage column.
- Run board/task diagnostics.
- No diagnostic explains that the triage specifier helper is missing.
Expected Behavior
Kanban should provide actionable operator signals for these cases:
- repeated-failure diagnostics should align with the configured dispatcher failure limit;
- worker log retention should remain conservative by default but be configurable for long-running jobs;
- triage tasks should surface a clear diagnostic when the triage specifier is visibly missing.
Actual Behavior
repeated_failures could remain silent until after the dispatcher threshold had already been reached.
- Worker log retention was fixed at
2 MiB plus one backup generation.
- Missing triage specifier configuration was only visible when manually running
hermes kanban specify; it was not surfaced in board/task diagnostics.
Affected Component
CLI (interactive chat), Tools (terminal, file ops, web, code execution, etc.), Agent Core (conversation loop, context compression, memory), Configuration (config.yaml, .env, hermes setup)
Messaging Platform (if gateway-related)
N/A (CLI only)
Debug Report
N/A. These are code-level Kanban diagnostics/configuration issues reproduced against the current source tree.
Operating System
Linux / WSL-style development environment
Python Version
Python 3.13.9
Hermes Version
Current main
Additional Logs / Traceback (optional)
N/A
Root Cause Analysis (optional)
The three issues are independent but related operationally:
hermes_cli/kanban_diagnostics.py had repeated-failure defaults and messaging that did not consistently follow the runtime kanban.failure_limit.
hermes_cli/kanban_db.py rotated worker logs with a hard-coded backup policy.
- Kanban diagnostics did not have a rule for triage tasks whose
auxiliary.triage_specifier helper is visibly missing.
Proposed Fix (optional)
I split the fixes into focused PRs so each one can be reviewed independently:
These PRs do not attempt a larger Kanban orchestration redesign. They address the small actionable subset of the reported long-running Kanban pain points in the current implementation.
Are you willing to submit a PR for this?
Bug Description
I saw a community write-up describing several pain points when running Kanban for long-running, multi-worker workflows. After checking the current implementation, I found three small but concrete operational issues that can be fixed independently:
Repeated-failure diagnostics can lag behind the dispatcher circuit breaker.
The dispatcher uses
kanban.failure_limitto decide when a task should auto-block, but the diagnostics layer had its own default threshold. In the default setup, a task could hit the dispatcher's failure limit and become blocked beforerepeated_failureswas surfaced to the operator.Kanban worker log retention is hard-coded.
Worker logs rotate at a fixed 2 MiB threshold and keep only one backup generation (
.log.1). This is acceptable as a default, but long-running workers can overwrite early failure evidence before the operator has a chance to inspect it. There is currently no config knob to increase retention.Triage tasks can remain stuck without a clear diagnostic when the specifier helper is missing.
hermes kanban specifydepends onauxiliary.triage_specifier. If that helper cannot be resolved, the command reportsno auxiliary client configured, but board/task diagnostics do not surface this configuration gap. A rough task can therefore remain intriagewithout an obvious operator-facing reason.Steps to Reproduce
Issue 1, repeated-failure diagnostics:
repeated_failuresis still silent.Issue 2, worker log retention:
.log.1are retained.Issue 3, missing triage specifier:
auxiliary.triage_specifierhelper.triagecolumn.Expected Behavior
Kanban should provide actionable operator signals for these cases:
Actual Behavior
repeated_failurescould remain silent until after the dispatcher threshold had already been reached.2 MiBplus one backup generation.hermes kanban specify; it was not surfaced in board/task diagnostics.Affected Component
CLI (interactive chat), Tools (terminal, file ops, web, code execution, etc.), Agent Core (conversation loop, context compression, memory), Configuration (config.yaml, .env, hermes setup)
Messaging Platform (if gateway-related)
N/A (CLI only)
Debug Report
N/A. These are code-level Kanban diagnostics/configuration issues reproduced against the current source tree.
Operating System
Linux / WSL-style development environment
Python Version
Python 3.13.9
Hermes Version
Current
mainAdditional Logs / Traceback (optional)
N/A
Root Cause Analysis (optional)
The three issues are independent but related operationally:
hermes_cli/kanban_diagnostics.pyhad repeated-failure defaults and messaging that did not consistently follow the runtimekanban.failure_limit.hermes_cli/kanban_db.pyrotated worker logs with a hard-coded backup policy.auxiliary.triage_specifierhelper is visibly missing.Proposed Fix (optional)
I split the fixes into focused PRs so each one can be reviewed independently:
kanban.failure_limit.2 MiB/ one-backup default.triage_missing_specifierdiagnostic for CLI/dashboard diagnostics.These PRs do not attempt a larger Kanban orchestration redesign. They address the small actionable subset of the reported long-running Kanban pain points in the current implementation.
Are you willing to submit a PR for this?