Skip to content

[Bug]: Kanban stale claim locks from dead workers have no auto-cleanup — tasks permanently stuck until manual intervention #22926

@clickmonkey

Description

@clickmonkey

Summary

When a kanban worker process dies unexpectedly (OOM, segfault, SIGKILL, system reboot), its claim on the task remains in the DB (claim_lock, claim_expires, worker_pid). The claim TTL (~15 minutes) is supposed to handle this, but in practice:

  1. The dispatcher doesn't always re-check expired claims on its tick
  2. The claim expiry timestamp may not be set correctly for all failure modes
  3. There is no watchdog that periodically scans for and clears stale claims

The result: tasks get permanently stuck in running status with a dead worker's claim. They are invisible to hermes kanban dispatch and don't appear in the blocked/crashed dashboard view. The only fix is manual DB surgery:

UPDATE tasks SET
  claim_lock = NULL, claim_expires = NULL, worker_pid = NULL, current_run_id = NULL,
  status = 'ready'
WHERE id = 't_xxx';

Steps to Reproduce

  1. Start a kanban task with a long-running worker
  2. Kill the worker process with SIGKILL (or let it OOM)
  3. Wait 15+ minutes
  4. Task is still in running with the dead claim — dispatcher will not pick it up

Expected Behavior

  1. The dispatcher should check claim_expires on every tick and clear expired claims
  2. A periodic watchdog (or gateway startup check) should scan for status = 'running' tasks with expired claims and reset them to ready
  3. The dashboard should show "stale claim" as a recovery option with a one-click "reclaim" button

Suggested Fix

  • Add a claim-expiry sweep in the dispatcher's main loop
  • On gateway startup, run a one-time scan for orphaned claims
  • Add hermes kanban reclaim --stale to bulk-reclaim all expired-claim tasks

Environment

  • Hermes Agent v2.x
  • The hermes kanban reclaim <task_id> command exists for single tasks but requires manual discovery
  • The kanban-orchestrator skill documents this under "Recovering stuck workers"

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/pluginsPlugin system and bundled pluginstype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions