Skip to content

Public termination endpoint for kanban runs (POST /runs/{run_id}/terminate) #23762

@Interstellar-code

Description

@Interstellar-code

Public termination endpoint for kanban runs (POST /runs/{run_id}/terminate)

Motivation

_terminate_reclaimed_worker in hermes_cli/kanban_db.py (~line 3057) already implements
SIGTERM → grace period → SIGKILL, but no HTTP route exposes it. The only user-facing
termination path today is POST /tasks/{task_id}/reclaim, which is a recovery action
for stuck/dead workers — not a clean "stop this running task" API. Operators who need to
cancel a live, well-behaved worker have no dashboard or API surface to do so without
SSHing into the host.

Adjacent evidence: issue #22176 (CLI interrupt /stop not working) shows user demand for
a stop primitive; a public terminate endpoint would satisfy the same need for tasks already
claimed and running.

Design options

Option A — Open endpoint
POST /runs/{run_id}/terminate sends SIGTERM (→ SIGKILL after grace) immediately. Any
authenticated dashboard caller can terminate any run. Simple; matches the "no RBAC layer"
reality of all other dashboard routes today. Downside: no audit trail, no signal to the
dispatcher that the task was deliberately cancelled vs. crashed.

Option B — Soft-cancel flag (proposed default)
POST /runs/{run_id}/terminate returns 202 immediately and sets a
runs.cancel_requested = 1 flag. The dispatcher's next tick reads the flag, sends SIGTERM,
waits for grace period, SIGKILLs if needed, and closes the run with outcome=cancelled.
?force=true skips the flag and sends SIGKILL directly. Advantages: dispatcher-mediated
semantics match how reclaim/claim work elsewhere; ?force documents destructive intent
explicitly; the flag survives a dashboard restart.

Option C — Scoped admin token
Destructive ops (terminate, kill) require a separate HERMES_ADMIN_TOKEN env var
distinct from the dashboard read token. Safer for shared deployments; adds operational
overhead for solo installs.

Proposed default

Option B. Soft-cancel + ?force escape hatch is the right trade-off: it preserves
dispatcher-mediated semantics (everything goes through the loop), gives the worker a clean
shutdown path, and the ?force flag makes SIGKILL an explicit opt-in rather than the
default. Option C can layer on top later if multi-user RBAC becomes a requirement.

Next steps

Will follow up with a PR implementing Option B after design preference is confirmed in this
thread. Read-only sibling endpoints (GET /workers/active, GET /runs/{run_id},
GET /runs/{run_id}/inspect) land in the companion PR (link to be added once opened).

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/cronCron scheduler and job managementcomp/pluginsPlugin system and bundled pluginstype/featureNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions