Bug Description
I'm trying out Kanban with local models. The orchestrator skill created 3 parallel tasks that immediately started running. 35 minutes later I went to check on them and found they were all still in-progress and there were 7 kanban-worker processes running. The dispatcher assumed the original 3 workers were dead and started more. The original 3 workers couldn't complete their tasks and got stuck trying to debug themselves. Throwing even more workers at the local llama.cpp server caused them all to grind to a crawl.
I've lost the original logs with the timeouts because I had to restart the process to stop it snowballing, but I can see these details in the dashboard:
In the worker log for this task, I can also see multiple workers were trying to work on the same task, with one unsuccessfully trying to mark the task as complete while the other started again from scratch.
There seem to be several contributing factors:
- There is a
stale_lock timeout of 15 minutes - the reclaims & respawns were exactly 15 minutes apart.
- Reclaiming a worker doesn't actually stop it - it just releases the lock. The old worker keeps running, potentially interfering with the new worker.
- Agent/worker concurrency can't be limited, so running the 3x ~10-minute tasks in parallel on a local LLM caused all of them to need 20-30 minutes to finish.
- Once a worker no longer has its lock, it can't update its Kanban board state - can't mark itself as completed/blocked, etc.
Steps to Reproduce
- Create a Kanban task with description: "We are trying to reproduce an issue caused by Kanban workers hitting timeouts. Run 'sleep 600' repeatedly until 1 hour has passed. Count each time out loud and note any errors you see. Ignore any warnings and interruptions - keep trying to run sleep until an entire hour has passed.".
- Mark the task as ready.
- Wait 20 minutes.
- Check the worker log.
Expected Behavior
There should only be 1 copy of the worker, and it should have counted to 6.
Actual Behavior
- Workers are reclaimed and restarted, but the reclaimed copies don't always stop running.
- Workers struggle because
sleep commands keep hitting their own timeouts.
Affected Component
Tools (terminal, file ops, web, code execution, etc.), Other
Messaging Platform (if gateway-related)
No response
Debug Report
As I had to restart the gateway to get it back under control, I lost the server logs that covered the time the dispatcher was misbehaving.
I would prefer not to post `hermes debug share` as there's a LOT of message history and local paths in there. I could not see any messages related to the lock or worker reclamation.
Operating System
macOS Darwin 25.4.0
Python Version
3.12.13
Hermes Version
Git commit aa88dcc
Additional Logs / Traceback (optional)
Here's one of the workers' chat logs. Note how there's 1 worker trying and failing to mark its task as complete, another just getting started with its task, and then out of nowhere a third worker spins up.
╭─ ⚕ Hermes ───────────────────────────────────────────────────────────────────╮
The task appears to be in a terminal state despite kanban_show reporting running. This may be a race condition. Let me check if the child task received relevant context:
┊ 📋 preparing kanban_show…
╰──────────────────────────────────────────────────────────────────────────────╯
┌─ Reasoning ──────────────────────────────────────────────────────────────────┐
Let me now look at the tiny fallback implementation and the modality harnesses to
understand the fallback mechanism more deeply.
└──────────────────────────────────────────────────────────────────────────────┘
┊ 📖 preparing read_file…
┊ 📖 preparing read_file…
┌─ Reasoning ──────────────────────────────────────────────────────────────────┐
I see - the child task t_9d4e96e5 is waiting for all three parent tasks to complete
, including t_6fe2bd6e. The parent tasks are: t_6fe2bd6e (datasets), t_58226975 (
model quality), and t_d1e39a01 (infrastructure).
The completion error "unknown id or already terminal" is strange because kanban_show
still shows it as running. This could be a timing issue where the task was completed
between my show and complete calls. Let me try one more time with a fresh show call
first.
Actually wait - looking at the event log again, run 4 is the current run. The task
is still running. The completion error might be due to some other issue. Let me try
again.
└──────────────────────────────────────────────────────────────────────────────┘
┊ ✔ preparing kanban_complete…
┌─ Reasoning ──────────────────────────────────────────────────────────────────┐
Now let me check the DNA harness and protein harness to see how they handle fallback
s too.
└──────────────────────────────────────────────────────────────────────────────┘
┌─ Reasoning ──────────────────────────────────────────────────────────────────┐
The task completion is failing repeatedly. Let me try a different approach - maybe
I should check if the task has actually been completed by another run or process
. Let me try one more time with a fresh kanban_show to get the latest state.
└──────────────────────────────────────────────────────────────────────────────┘
┊ 📋 preparing kanban_show…
╭─ Hermes Agent v0.12.0 (2026.4.30) · upstream 49c3c2e0 · local db7ceec1 (+9 c─╮
│ Available Tools │
│ ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⡀⠀⣀⣀⠀⢀⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ browser: browser_back, browser_click, ... │
│ ⠀⠀⠀⠀⠀⠀⢀⣠⣴⣾⣿⣿⣇⠸⣿⣿⠇⣸⣿⣿⣷⣦⣄⡀⠀⠀⠀⠀⠀⠀ browser-cdp: browser_cdp, browser_dialog │
│ ⠀⢀⣠⣴⣶⠿⠋⣩⡿⣿⡿⠻⣿⡇⢠⡄⢸⣿⠟⢿⣿⢿⣍⠙⠿⣶⣦⣄⡀⠀ clarify: clarify │
│ ⠀⠀⠉⠉⠁⠶⠟⠋⠀⠉⠀⢀⣈⣁⡈⢁⣈⣁⡀⠀⠉⠀⠙⠻⠶⠈⠉⠉⠀⠀ code_execution: execute_code │
│ ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣴⣿⡿⠛⢁⡈⠛⢿⣿⣦⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ cronjob: cronjob │
│ ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠿⣿⣦⣤⣈⠁⢠⣴⣿⠿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ delegation: delegate_task │
│ ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠻⢿⣿⣦⡉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ file: patch, read_file, search_files, │
│ ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠘⢷⣦⣈⠛⠃⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ write_file │
│ ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢠⣴⠦⠈⠙⠿⣦⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ hermes-yuanbao: yb_query_group_info, ... │
│ ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠸⣿⣤⡈⠁⢤⣿⠇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ (and 14 more toolsets...) │
│ ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠛⠷⠄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ │
│ ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⠑⢶⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ MCP Servers │
│ ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣿⠁⢰⡆⠈⡿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ macos-use (stdio) — 10 tool(s) │
│ ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠳⠈⣡⠞⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ │
│ ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ Available Skills │
│ apple: apple-notes, apple-reminders │
│ qwen-35b · Nous Research autonomous-ai-agents: claude-code, codex, │
│ /Users/lachlan hermes-agent, opencode │
│ Session: 20260507_103504_b5b95b browser: macos-use │
│ creative: architecture-diagram, │
│ baoyu-comic, baoyu-infogr... │
│ devops: colanode-selfhost, │
│ docker-host-networking, dotn... │
│ general: bird, browser-debugging, │
│ browser-scroll-debuggi... │
│ github: github-auth │
│ leisure: find-nearby │
│ media: gif-search, spotify, │
│ youtube-content │
│ mlops: outlines, peft-fine-tuning, │
│ whisper │
│ productivity: airtable, maps, nano-pdf, │
│ ocr-and-documents, po... │
│ research: arxiv, blogwatcher, llm-wiki, │
│ research-paper-wr... │
│ smart-home: openhue │
│ software-development: │
│ debugging-hermes-tui-commands, │
│ dotnet-windows-t... │
│ │
│ 47 tools · 76 skills · 1 MCP servers · │
│ /help for commands │
│ ⚠ 55 commits behind — run hermes update │
│ to update │
╰──────────────────────────────────────────────────────────────────────────────╯
Query: work kanban task t_6fe2bd6e
Initializing agent...
────────────────────────────────────────
Root Cause Analysis (optional)
Based on what the kanban-worker skill says, it looks like there's a max_runtime_seconds defined somewhere, but I don't see it anywhere in config.
Proposed Fix (optional)
No response
Are you willing to submit a PR for this?
Bug Description
I'm trying out Kanban with local models. The orchestrator skill created 3 parallel tasks that immediately started running. 35 minutes later I went to check on them and found they were all still in-progress and there were 7 kanban-worker processes running. The dispatcher assumed the original 3 workers were dead and started more. The original 3 workers couldn't complete their tasks and got stuck trying to debug themselves. Throwing even more workers at the local llama.cpp server caused them all to grind to a crawl.
I've lost the original logs with the timeouts because I had to restart the process to stop it snowballing, but I can see these details in the dashboard:
In the worker log for this task, I can also see multiple workers were trying to work on the same task, with one unsuccessfully trying to mark the task as complete while the other started again from scratch.
There seem to be several contributing factors:
stale_locktimeout of 15 minutes - the reclaims & respawns were exactly 15 minutes apart.Steps to Reproduce
Expected Behavior
There should only be 1 copy of the worker, and it should have counted to 6.
Actual Behavior
sleepcommands keep hitting their own timeouts.Affected Component
Tools (terminal, file ops, web, code execution, etc.), Other
Messaging Platform (if gateway-related)
No response
Debug Report
Operating System
macOS Darwin 25.4.0
Python Version
3.12.13
Hermes Version
Git commit aa88dcc
Additional Logs / Traceback (optional)
Root Cause Analysis (optional)
Based on what the
kanban-workerskill says, it looks like there's amax_runtime_secondsdefined somewhere, but I don't see it anywhere in config.Proposed Fix (optional)
No response
Are you willing to submit a PR for this?