Skip to content

recurring stall loop: kanban workers repeatedly reclaimed with stale_lock, zero output across respawns #23025

@fwends

Description

@fwends

Bug Description

Kanban workers on the default board repeatedly enter a stall loop:

  1. Worker spawns (status: running)
  2. Worker produces no output — workspace stays empty, no files written
  3. Worker is reclaimed after ~15 min (status: reclaimed, error: stale_lock=gregs-MacBook-Pro.local:68653)
  4. Dispatcher immediately respawns a new worker
  5. New worker immediately hits the same pattern
  6. This repeats 3-5 times with zero progress before the task either completes or is abandoned

Affected tasks: t_805fc503, t_57dabea4 (both on kimicoder / kimi-k2.6)

Profile config (kimicoder):

  • max_turns: 200
  • gateway_timeout: 1800
  • terminal.backend: local
  • terminal.timeout: 180

Evidence

Run history for t_805fc503:

  • Run 81: reclaimed (stale_lock)
  • Run 114: reclaimed (stale_lock)
  • Run 115: reclaimed (stale_lock)
  • Run 116: running (current, workspace empty after 10+ min)

Gateway log cycling:

2026-05-10 12:53:33 spawned=1 reclaimed=1
2026-05-10 13:08:35 spawned=1 reclaimed=1
2026-05-10 13:23:37 spawned=1 reclaimed=1

Questions / Investigation needed

  1. What is "stale_lock" actually checking? The lock appears stale even when the worker is actively running (PID alive, model calls being made). Is the lock TTL shorter than the time between gateway heartbeats?

  2. Why does a live worker get marked stale? If the worker is alive and processing, the lock shouldn't be considered stale. Is there a race condition where the gateway marks a lock stale before the worker has a chance to heartbeat?

  3. Why does the respawned worker immediately hit the same stale lock? If the previous worker was killed for being "stale" but the new worker starts fresh, what causes the new worker to also be marked stale within minutes?

  4. Is kimi-k2.6 specifically affected? The pattern consistently shows kimi-k2.6 workers stalling. Could be model-specific (slow token generation causing lock TTL to expire between heartbeats), or could be a generic issue with long-running tasks.

Suggested fixes

  1. Increase lock TTL or make it adaptive: If a worker is actively making model calls (input_tokens > 0 in last heartbeat), extend the lock TTL dynamically.

  2. Add stall detection before lock expiry: If a task has been running for >X minutes with zero tool calls or zero output, trigger a mid-flight warning rather than waiting for the lock to expire.

  3. Log why lock is considered stale: The stale_lock error message should include the actual TTL value and the timestamp of the last heartbeat, so we can diagnose whether it's a timing issue or a logic bug.

  4. Consider: if worker_pid is alive and responsive, don't reclaim. The current logic seems to reclaim based on lock file age alone, not actual worker health.

Impact

Constant recurring problem. Every significant task on kimicoder hits this stall loop. Wastes compute, blocks progress, makes Kanban unreliable. Current workaround: manually reclaim + nudge repeatedly, but pattern always recurs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/pluginsPlugin system and bundled pluginstype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions