Skip to content

[Bug]: Kanban long-running workers lack actionable diagnostics and configurable log retention #25641

@qWaitCrypto

Description

@qWaitCrypto

Bug Description

I saw a community write-up describing several pain points when running Kanban for long-running, multi-worker workflows. After checking the current implementation, I found three small but concrete operational issues that can be fixed independently:

  1. Repeated-failure diagnostics can lag behind the dispatcher circuit breaker.

    The dispatcher uses kanban.failure_limit to decide when a task should auto-block, but the diagnostics layer had its own default threshold. In the default setup, a task could hit the dispatcher's failure limit and become blocked before repeated_failures was surfaced to the operator.

  2. Kanban worker log retention is hard-coded.

    Worker logs rotate at a fixed 2 MiB threshold and keep only one backup generation (.log.1). This is acceptable as a default, but long-running workers can overwrite early failure evidence before the operator has a chance to inspect it. There is currently no config knob to increase retention.

  3. Triage tasks can remain stuck without a clear diagnostic when the specifier helper is missing.

    hermes kanban specify depends on auxiliary.triage_specifier. If that helper cannot be resolved, the command reports no auxiliary client configured, but board/task diagnostics do not surface this configuration gap. A rough task can therefore remain in triage without an obvious operator-facing reason.

Steps to Reproduce

Issue 1, repeated-failure diagnostics:

  1. Use the default Kanban dispatcher failure limit.
  2. Create a task that records two consecutive non-success worker attempts.
  3. Check task diagnostics before the fix.
  4. The dispatcher can already be at its block threshold while repeated_failures is still silent.

Issue 2, worker log retention:

  1. Run a long-lived Kanban worker that writes more than 2 MiB of logs.
  2. Let it rotate multiple times.
  3. Inspect the worker log directory.
  4. Only the active log and .log.1 are retained.

Issue 3, missing triage specifier:

  1. Use config without a visible auxiliary.triage_specifier helper.
  2. Create a task in the triage column.
  3. Run board/task diagnostics.
  4. No diagnostic explains that the triage specifier helper is missing.

Expected Behavior

Kanban should provide actionable operator signals for these cases:

  • repeated-failure diagnostics should align with the configured dispatcher failure limit;
  • worker log retention should remain conservative by default but be configurable for long-running jobs;
  • triage tasks should surface a clear diagnostic when the triage specifier is visibly missing.

Actual Behavior

  • repeated_failures could remain silent until after the dispatcher threshold had already been reached.
  • Worker log retention was fixed at 2 MiB plus one backup generation.
  • Missing triage specifier configuration was only visible when manually running hermes kanban specify; it was not surfaced in board/task diagnostics.

Affected Component

CLI (interactive chat), Tools (terminal, file ops, web, code execution, etc.), Agent Core (conversation loop, context compression, memory), Configuration (config.yaml, .env, hermes setup)

Messaging Platform (if gateway-related)

N/A (CLI only)

Debug Report

N/A. These are code-level Kanban diagnostics/configuration issues reproduced against the current source tree.

Operating System

Linux / WSL-style development environment

Python Version

Python 3.13.9

Hermes Version

Current main

Additional Logs / Traceback (optional)

N/A

Root Cause Analysis (optional)

The three issues are independent but related operationally:

  • hermes_cli/kanban_diagnostics.py had repeated-failure defaults and messaging that did not consistently follow the runtime kanban.failure_limit.
  • hermes_cli/kanban_db.py rotated worker logs with a hard-coded backup policy.
  • Kanban diagnostics did not have a rule for triage tasks whose auxiliary.triage_specifier helper is visibly missing.

Proposed Fix (optional)

I split the fixes into focused PRs so each one can be reviewed independently:

These PRs do not attempt a larger Kanban orchestration redesign. They address the small actionable subset of the reported long-running Kanban pain points in the current implementation.

Are you willing to submit a PR for this?

  • I'd like to fix this myself and submit a PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/pluginsPlugin system and bundled pluginssweeper:implemented-on-mainSweeper: behavior already present on current maintype/featureNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions