recurring stall loop: kanban workers repeatedly reclaimed with stale_lock, zero output across respawns

## Bug Description

Kanban workers on the default board repeatedly enter a stall loop:

1. Worker spawns (status: running)
2. Worker produces no output — workspace stays empty, no files written
3. Worker is reclaimed after ~15 min (status: reclaimed, error: stale_lock=gregs-MacBook-Pro.local:68653)
4. Dispatcher immediately respawns a new worker
5. New worker immediately hits the same pattern
6. This repeats 3-5 times with zero progress before the task either completes or is abandoned

**Affected tasks:** t_805fc503, t_57dabea4 (both on kimicoder / kimi-k2.6)

**Profile config (kimicoder):**
- max_turns: 200
- gateway_timeout: 1800
- terminal.backend: local
- terminal.timeout: 180

## Evidence

Run history for t_805fc503:
- Run 81:  reclaimed (stale_lock)
- Run 114: reclaimed (stale_lock)
- Run 115: reclaimed (stale_lock)
- Run 116: running (current, workspace empty after 10+ min)

Gateway log cycling:
```
2026-05-10 12:53:33 spawned=1 reclaimed=1
2026-05-10 13:08:35 spawned=1 reclaimed=1
2026-05-10 13:23:37 spawned=1 reclaimed=1
```

## Questions / Investigation needed

1. **What is "stale_lock" actually checking?** The lock appears stale even when the worker is actively running (PID alive, model calls being made). Is the lock TTL shorter than the time between gateway heartbeats?

2. **Why does a live worker get marked stale?** If the worker is alive and processing, the lock shouldn't be considered stale. Is there a race condition where the gateway marks a lock stale before the worker has a chance to heartbeat?

3. **Why does the respawned worker immediately hit the same stale lock?** If the previous worker was killed for being "stale" but the new worker starts fresh, what causes the new worker to also be marked stale within minutes?

4. **Is kimi-k2.6 specifically affected?** The pattern consistently shows kimi-k2.6 workers stalling. Could be model-specific (slow token generation causing lock TTL to expire between heartbeats), or could be a generic issue with long-running tasks.

## Suggested fixes

1. **Increase lock TTL or make it adaptive:** If a worker is actively making model calls (input_tokens > 0 in last heartbeat), extend the lock TTL dynamically.

2. **Add stall detection before lock expiry:** If a task has been running for >X minutes with zero tool calls or zero output, trigger a mid-flight warning rather than waiting for the lock to expire.

3. **Log why lock is considered stale:** The stale_lock error message should include the actual TTL value and the timestamp of the last heartbeat, so we can diagnose whether it's a timing issue or a logic bug.

4. **Consider: if worker_pid is alive and responsive, don't reclaim.** The current logic seems to reclaim based on lock file age alone, not actual worker health.

## Impact

Constant recurring problem. Every significant task on kimicoder hits this stall loop. Wastes compute, blocks progress, makes Kanban unreliable. Current workaround: manually reclaim + nudge repeatedly, but pattern always recurs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

recurring stall loop: kanban workers repeatedly reclaimed with stale_lock, zero output across respawns #23025

Bug Description

Evidence

Questions / Investigation needed

Suggested fixes

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

recurring stall loop: kanban workers repeatedly reclaimed with stale_lock, zero output across respawns #23025

Description

Bug Description

Evidence

Questions / Investigation needed

Suggested fixes

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions