Summary
When a kanban worker process dies unexpectedly (OOM, segfault, SIGKILL, system reboot), its claim on the task remains in the DB (claim_lock, claim_expires, worker_pid). The claim TTL (~15 minutes) is supposed to handle this, but in practice:
- The dispatcher doesn't always re-check expired claims on its tick
- The claim expiry timestamp may not be set correctly for all failure modes
- There is no watchdog that periodically scans for and clears stale claims
The result: tasks get permanently stuck in running status with a dead worker's claim. They are invisible to hermes kanban dispatch and don't appear in the blocked/crashed dashboard view. The only fix is manual DB surgery:
UPDATE tasks SET
claim_lock = NULL, claim_expires = NULL, worker_pid = NULL, current_run_id = NULL,
status = 'ready'
WHERE id = 't_xxx';
Steps to Reproduce
- Start a kanban task with a long-running worker
- Kill the worker process with
SIGKILL (or let it OOM)
- Wait 15+ minutes
- Task is still in
running with the dead claim — dispatcher will not pick it up
Expected Behavior
- The dispatcher should check
claim_expires on every tick and clear expired claims
- A periodic watchdog (or gateway startup check) should scan for
status = 'running' tasks with expired claims and reset them to ready
- The dashboard should show "stale claim" as a recovery option with a one-click "reclaim" button
Suggested Fix
- Add a claim-expiry sweep in the dispatcher's main loop
- On gateway startup, run a one-time scan for orphaned claims
- Add
hermes kanban reclaim --stale to bulk-reclaim all expired-claim tasks
Environment
- Hermes Agent v2.x
- The
hermes kanban reclaim <task_id> command exists for single tasks but requires manual discovery
- The kanban-orchestrator skill documents this under "Recovering stuck workers"
Summary
When a kanban worker process dies unexpectedly (OOM, segfault, SIGKILL, system reboot), its claim on the task remains in the DB (
claim_lock,claim_expires,worker_pid). The claim TTL (~15 minutes) is supposed to handle this, but in practice:The result: tasks get permanently stuck in
runningstatus with a dead worker's claim. They are invisible tohermes kanban dispatchand don't appear in the blocked/crashed dashboard view. The only fix is manual DB surgery:Steps to Reproduce
SIGKILL(or let it OOM)runningwith the dead claim — dispatcher will not pick it upExpected Behavior
claim_expireson every tick and clear expired claimsstatus = 'running'tasks with expired claims and reset them toreadySuggested Fix
hermes kanban reclaim --staleto bulk-reclaim all expired-claim tasksEnvironment
hermes kanban reclaim <task_id>command exists for single tasks but requires manual discovery