[Bug]: Kanban workers are not cancelled when reclaimed due to timeout/stale_lock

### Bug Description

I'm trying out Kanban with local models. The orchestrator skill created 3 parallel tasks that immediately started running. 35 minutes later I went to check on them and found they were all still in-progress and there were 7 kanban-worker processes running. The dispatcher assumed the original 3 workers were dead and started more. The original 3 workers couldn't complete their tasks and got stuck trying to debug themselves. Throwing even more workers at the local llama.cpp server caused them all to grind to a crawl.

I've lost the original logs with the timeouts because I had to restart the process to stop it snowballing, but I can see these details in the dashboard:

<img width="461" height="346" alt="Image" src="https://github.com/user-attachments/assets/bba7c73d-29b8-47d6-9e45-f7cf51b0921b" />

In the worker log for this task, I can also see multiple workers were trying to work on the same task, with one unsuccessfully trying to mark the task as complete while the other started again from scratch.

There seem to be several contributing factors:

1. There is a `stale_lock` timeout of 15 minutes - the reclaims & respawns were exactly 15 minutes apart.
2. Reclaiming a worker doesn't actually stop it - it just releases the lock. The old worker keeps running, potentially interfering with the new worker.
3. Agent/worker concurrency can't be limited, so running the 3x ~10-minute tasks in parallel on a local LLM caused all of them to need 20-30 minutes to finish.
4. Once a worker no longer has its lock, it can't update its Kanban board state - can't mark itself as completed/blocked, etc.

### Steps to Reproduce

1. Create a Kanban task with description: "We are trying to reproduce an issue caused by Kanban workers hitting timeouts. Run 'sleep 600' repeatedly until 1 hour has passed. Count each time out loud and note any errors you see. Ignore any warnings and interruptions - keep trying to run sleep until an entire hour has passed.". 
2. Mark the task as ready.
3. Wait 20 minutes.
4. Check the worker log.

### Expected Behavior

There should only be 1 copy of the worker, and it should have counted to 6.

### Actual Behavior

* Workers are reclaimed and restarted, but the reclaimed copies don't always stop running.
* Workers struggle because `sleep` commands keep hitting their own timeouts.


### Affected Component

Tools (terminal, file ops, web, code execution, etc.), Other

### Messaging Platform (if gateway-related)

_No response_

### Debug Report

```shell
As I had to restart the gateway to get it back under control, I lost the server logs that covered the time the dispatcher was misbehaving. 

I would prefer not to post `hermes debug share` as there's a LOT of message history and local paths in there. I could not see any messages related to the lock or worker reclamation.
```

### Operating System

macOS Darwin 25.4.0

### Python Version

3.12.13

### Hermes Version

Git commit aa88dcc57

### Additional Logs / Traceback (optional)

```shell
Here's one of the workers' chat logs. Note how there's 1 worker trying and failing to mark its task as complete, another just getting started with its task, and then out of nowhere a third worker spins up.


╭─ ⚕ Hermes ───────────────────────────────────────────────────────────────────╮
    The task appears to be in a terminal state despite kanban_show reporting running. This may be a race condition. Let me check if the child task received relevant context:
    
  ┊ 📋 preparing kanban_show…
╰──────────────────────────────────────────────────────────────────────────────╯

┌─ Reasoning ──────────────────────────────────────────────────────────────────┐
Let me now look at the tiny fallback implementation and the modality harnesses to
 understand the fallback mechanism more deeply.
└──────────────────────────────────────────────────────────────────────────────┘
  ┊ 📖 preparing read_file…
  ┊ 📖 preparing read_file…

┌─ Reasoning ──────────────────────────────────────────────────────────────────┐
I see - the child task t_9d4e96e5 is waiting for all three parent tasks to complete
, including t_6fe2bd6e. The parent tasks are: t_6fe2bd6e (datasets), t_58226975 (
model quality), and t_d1e39a01 (infrastructure). 

The completion error "unknown id or already terminal" is strange because kanban_show
 still shows it as running. This could be a timing issue where the task was completed
 between my show and complete calls. Let me try one more time with a fresh show call
 first.

Actually wait - looking at the event log again, run 4 is the current run. The task
 is still running. The completion error might be due to some other issue. Let me try
 again.
└──────────────────────────────────────────────────────────────────────────────┘
  ┊ ✔ preparing kanban_complete…

┌─ Reasoning ──────────────────────────────────────────────────────────────────┐
Now let me check the DNA harness and protein harness to see how they handle fallback
s too.
└──────────────────────────────────────────────────────────────────────────────┘
┌─ Reasoning ──────────────────────────────────────────────────────────────────┐
The task completion is failing repeatedly. Let me try a different approach - maybe
 I should check if the task has actually been completed by another run or process
. Let me try one more time with a fresh kanban_show to get the latest state.
└──────────────────────────────────────────────────────────────────────────────┘
  ┊ 📋 preparing kanban_show…

╭─ Hermes Agent v0.12.0 (2026.4.30) · upstream 49c3c2e0 · local db7ceec1 (+9 c─╮
│                                   Available Tools                            │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⡀⠀⣀⣀⠀⢀⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   browser: browser_back, browser_click, ...  │
│  ⠀⠀⠀⠀⠀⠀⢀⣠⣴⣾⣿⣿⣇⠸⣿⣿⠇⣸⣿⣿⣷⣦⣄⡀⠀⠀⠀⠀⠀⠀   browser-cdp: browser_cdp, browser_dialog   │
│  ⠀⢀⣠⣴⣶⠿⠋⣩⡿⣿⡿⠻⣿⡇⢠⡄⢸⣿⠟⢿⣿⢿⣍⠙⠿⣶⣦⣄⡀⠀   clarify: clarify                           │
│  ⠀⠀⠉⠉⠁⠶⠟⠋⠀⠉⠀⢀⣈⣁⡈⢁⣈⣁⡀⠀⠉⠀⠙⠻⠶⠈⠉⠉⠀⠀   code_execution: execute_code               │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣴⣿⡿⠛⢁⡈⠛⢿⣿⣦⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   cronjob: cronjob                           │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠿⣿⣦⣤⣈⠁⢠⣴⣿⠿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   delegation: delegate_task                  │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠻⢿⣿⣦⡉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   file: patch, read_file, search_files,      │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠘⢷⣦⣈⠛⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   write_file                                 │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢠⣴⠦⠈⠙⠿⣦⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   hermes-yuanbao: yb_query_group_info, ...   │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠸⣿⣤⡈⠁⢤⣿⠇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   (and 14 more toolsets...)                  │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠛⠷⠄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀                                              │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⠑⢶⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   MCP Servers                                │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⠁⢰⡆⠈⡿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   macos-use (stdio) — 10 tool(s)             │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⠈⣡⠞⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀                                              │
│  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀   Available Skills                           │
│                                   apple: apple-notes, apple-reminders        │
│     qwen-35b · Nous Research      autonomous-ai-agents: claude-code, codex,  │
│          /Users/lachlan           hermes-agent, opencode                     │
│  Session: 20260507_103504_b5b95b  browser: macos-use                         │
│                                   creative: architecture-diagram,            │
│                                   baoyu-comic, baoyu-infogr...               │
│                                   devops: colanode-selfhost,                 │
│                                   docker-host-networking, dotn...            │
│                                   general: bird, browser-debugging,          │
│                                   browser-scroll-debuggi...                  │
│                                   github: github-auth                        │
│                                   leisure: find-nearby                       │
│                                   media: gif-search, spotify,                │
│                                   youtube-content                            │
│                                   mlops: outlines, peft-fine-tuning,         │
│                                   whisper                                    │
│                                   productivity: airtable, maps, nano-pdf,    │
│                                   ocr-and-documents, po...                   │
│                                   research: arxiv, blogwatcher, llm-wiki,    │
│                                   research-paper-wr...                       │
│                                   smart-home: openhue                        │
│                                   software-development:                      │
│                                   debugging-hermes-tui-commands,             │
│                                   dotnet-windows-t...                        │
│                                                                              │
│                                   47 tools · 76 skills · 1 MCP servers ·     │
│                                   /help for commands                         │
│                                   ⚠ 55 commits behind — run hermes update    │
│                                   to update                                  │
╰──────────────────────────────────────────────────────────────────────────────╯

Query: work kanban task t_6fe2bd6e
Initializing agent...
────────────────────────────────────────
```

### Root Cause Analysis (optional)

Based on what the `kanban-worker` skill says, it looks like there's a `max_runtime_seconds` defined somewhere, but I don't see it anywhere in config.

### Proposed Fix (optional)

_No response_

### Are you willing to submit a PR for this?

- [ ] I'd like to fix this myself and submit a PR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Kanban workers are not cancelled when reclaimed due to timeout/stale_lock #21141

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Affected Component

Messaging Platform (if gateway-related)

Debug Report

Operating System

Python Version

Hermes Version

Additional Logs / Traceback (optional)

Root Cause Analysis (optional)

Proposed Fix (optional)

Are you willing to submit a PR for this?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: Kanban workers are not cancelled when reclaimed due to timeout/stale_lock #21141

Description

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Affected Component

Messaging Platform (if gateway-related)

Debug Report

Operating System

Python Version

Hermes Version

Additional Logs / Traceback (optional)

Root Cause Analysis (optional)

Proposed Fix (optional)

Are you willing to submit a PR for this?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions