[Bug]: Kanban long-running workers lack actionable diagnostics and configurable log retention

## Bug Description

I saw a community write-up describing several pain points when running Kanban for long-running, multi-worker workflows. After checking the current implementation, I found three small but concrete operational issues that can be fixed independently:

1. Repeated-failure diagnostics can lag behind the dispatcher circuit breaker.

   The dispatcher uses `kanban.failure_limit` to decide when a task should auto-block, but the diagnostics layer had its own default threshold. In the default setup, a task could hit the dispatcher's failure limit and become blocked before `repeated_failures` was surfaced to the operator.

2. Kanban worker log retention is hard-coded.

   Worker logs rotate at a fixed 2 MiB threshold and keep only one backup generation (`.log.1`). This is acceptable as a default, but long-running workers can overwrite early failure evidence before the operator has a chance to inspect it. There is currently no config knob to increase retention.

3. Triage tasks can remain stuck without a clear diagnostic when the specifier helper is missing.

   `hermes kanban specify` depends on `auxiliary.triage_specifier`. If that helper cannot be resolved, the command reports `no auxiliary client configured`, but board/task diagnostics do not surface this configuration gap. A rough task can therefore remain in `triage` without an obvious operator-facing reason.

## Steps to Reproduce

Issue 1, repeated-failure diagnostics:

1. Use the default Kanban dispatcher failure limit.
2. Create a task that records two consecutive non-success worker attempts.
3. Check task diagnostics before the fix.
4. The dispatcher can already be at its block threshold while `repeated_failures` is still silent.

Issue 2, worker log retention:

1. Run a long-lived Kanban worker that writes more than 2 MiB of logs.
2. Let it rotate multiple times.
3. Inspect the worker log directory.
4. Only the active log and `.log.1` are retained.

Issue 3, missing triage specifier:

1. Use config without a visible `auxiliary.triage_specifier` helper.
2. Create a task in the `triage` column.
3. Run board/task diagnostics.
4. No diagnostic explains that the triage specifier helper is missing.

## Expected Behavior

Kanban should provide actionable operator signals for these cases:

- repeated-failure diagnostics should align with the configured dispatcher failure limit;
- worker log retention should remain conservative by default but be configurable for long-running jobs;
- triage tasks should surface a clear diagnostic when the triage specifier is visibly missing.

## Actual Behavior

- `repeated_failures` could remain silent until after the dispatcher threshold had already been reached.
- Worker log retention was fixed at `2 MiB` plus one backup generation.
- Missing triage specifier configuration was only visible when manually running `hermes kanban specify`; it was not surfaced in board/task diagnostics.

## Affected Component

CLI (interactive chat), Tools (terminal, file ops, web, code execution, etc.), Agent Core (conversation loop, context compression, memory), Configuration (config.yaml, .env, hermes setup)

## Messaging Platform (if gateway-related)

N/A (CLI only)

## Debug Report

N/A. These are code-level Kanban diagnostics/configuration issues reproduced against the current source tree.

## Operating System

Linux / WSL-style development environment

## Python Version

Python 3.13.9

## Hermes Version

Current `main`

## Additional Logs / Traceback (optional)

N/A

## Root Cause Analysis (optional)

The three issues are independent but related operationally:

- `hermes_cli/kanban_diagnostics.py` had repeated-failure defaults and messaging that did not consistently follow the runtime `kanban.failure_limit`.
- `hermes_cli/kanban_db.py` rotated worker logs with a hard-coded backup policy.
- Kanban diagnostics did not have a rule for triage tasks whose `auxiliary.triage_specifier` helper is visibly missing.

## Proposed Fix (optional)

I split the fixes into focused PRs so each one can be reviewed independently:

- #25591 aligns repeated-failure diagnostics with `kanban.failure_limit`.
- #25639 makes Kanban worker log retention configurable while preserving the existing `2 MiB` / one-backup default.
- #25640 adds a conservative `triage_missing_specifier` diagnostic for CLI/dashboard diagnostics.

These PRs do not attempt a larger Kanban orchestration redesign. They address the small actionable subset of the reported long-running Kanban pain points in the current implementation.

## Are you willing to submit a PR for this?

- [x] I'd like to fix this myself and submit a PR


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Kanban long-running workers lack actionable diagnostics and configurable log retention #25641

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Affected Component

Messaging Platform (if gateway-related)

Debug Report

Operating System

Python Version

Hermes Version

Additional Logs / Traceback (optional)

Root Cause Analysis (optional)

Proposed Fix (optional)

Are you willing to submit a PR for this?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: Kanban long-running workers lack actionable diagnostics and configurable log retention #25641

Description

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Affected Component

Messaging Platform (if gateway-related)

Debug Report

Operating System

Python Version

Hermes Version

Additional Logs / Traceback (optional)

Root Cause Analysis (optional)

Proposed Fix (optional)

Are you willing to submit a PR for this?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions