Public termination endpoint for kanban runs (POST /runs/{run_id}/terminate)
Motivation
_terminate_reclaimed_worker in hermes_cli/kanban_db.py (~line 3057) already implements
SIGTERM → grace period → SIGKILL, but no HTTP route exposes it. The only user-facing
termination path today is POST /tasks/{task_id}/reclaim, which is a recovery action
for stuck/dead workers — not a clean "stop this running task" API. Operators who need to
cancel a live, well-behaved worker have no dashboard or API surface to do so without
SSHing into the host.
Adjacent evidence: issue #22176 (CLI interrupt /stop not working) shows user demand for
a stop primitive; a public terminate endpoint would satisfy the same need for tasks already
claimed and running.
Design options
Option A — Open endpoint
POST /runs/{run_id}/terminate sends SIGTERM (→ SIGKILL after grace) immediately. Any
authenticated dashboard caller can terminate any run. Simple; matches the "no RBAC layer"
reality of all other dashboard routes today. Downside: no audit trail, no signal to the
dispatcher that the task was deliberately cancelled vs. crashed.
Option B — Soft-cancel flag (proposed default)
POST /runs/{run_id}/terminate returns 202 immediately and sets a
runs.cancel_requested = 1 flag. The dispatcher's next tick reads the flag, sends SIGTERM,
waits for grace period, SIGKILLs if needed, and closes the run with outcome=cancelled.
?force=true skips the flag and sends SIGKILL directly. Advantages: dispatcher-mediated
semantics match how reclaim/claim work elsewhere; ?force documents destructive intent
explicitly; the flag survives a dashboard restart.
Option C — Scoped admin token
Destructive ops (terminate, kill) require a separate HERMES_ADMIN_TOKEN env var
distinct from the dashboard read token. Safer for shared deployments; adds operational
overhead for solo installs.
Proposed default
Option B. Soft-cancel + ?force escape hatch is the right trade-off: it preserves
dispatcher-mediated semantics (everything goes through the loop), gives the worker a clean
shutdown path, and the ?force flag makes SIGKILL an explicit opt-in rather than the
default. Option C can layer on top later if multi-user RBAC becomes a requirement.
Next steps
Will follow up with a PR implementing Option B after design preference is confirmed in this
thread. Read-only sibling endpoints (GET /workers/active, GET /runs/{run_id},
GET /runs/{run_id}/inspect) land in the companion PR (link to be added once opened).
Public termination endpoint for kanban runs (POST /runs/{run_id}/terminate)
Motivation
_terminate_reclaimed_workerinhermes_cli/kanban_db.py(~line 3057) already implementsSIGTERM → grace period → SIGKILL, but no HTTP route exposes it. The only user-facing
termination path today is
POST /tasks/{task_id}/reclaim, which is a recovery actionfor stuck/dead workers — not a clean "stop this running task" API. Operators who need to
cancel a live, well-behaved worker have no dashboard or API surface to do so without
SSHing into the host.
Adjacent evidence: issue #22176 (CLI interrupt /stop not working) shows user demand for
a stop primitive; a public terminate endpoint would satisfy the same need for tasks already
claimed and running.
Design options
Option A — Open endpoint
POST /runs/{run_id}/terminatesends SIGTERM (→ SIGKILL after grace) immediately. Anyauthenticated dashboard caller can terminate any run. Simple; matches the "no RBAC layer"
reality of all other dashboard routes today. Downside: no audit trail, no signal to the
dispatcher that the task was deliberately cancelled vs. crashed.
Option B — Soft-cancel flag (proposed default)
POST /runs/{run_id}/terminatereturns 202 immediately and sets aruns.cancel_requested = 1flag. The dispatcher's next tick reads the flag, sends SIGTERM,waits for grace period, SIGKILLs if needed, and closes the run with
outcome=cancelled.?force=trueskips the flag and sends SIGKILL directly. Advantages: dispatcher-mediatedsemantics match how reclaim/claim work elsewhere;
?forcedocuments destructive intentexplicitly; the flag survives a dashboard restart.
Option C — Scoped admin token
Destructive ops (
terminate,kill) require a separateHERMES_ADMIN_TOKENenv vardistinct from the dashboard read token. Safer for shared deployments; adds operational
overhead for solo installs.
Proposed default
Option B. Soft-cancel +
?forceescape hatch is the right trade-off: it preservesdispatcher-mediated semantics (everything goes through the loop), gives the worker a clean
shutdown path, and the
?forceflag makes SIGKILL an explicit opt-in rather than thedefault. Option C can layer on top later if multi-user RBAC becomes a requirement.
Next steps
Will follow up with a PR implementing Option B after design preference is confirmed in this
thread. Read-only sibling endpoints (
GET /workers/active,GET /runs/{run_id},GET /runs/{run_id}/inspect) land in the companion PR (link to be added once opened).