minions: durable SQLite-backed job queue for subagents, ACP, CLI, and cron#68718
minions: durable SQLite-backed job queue for subagents, ACP, CLI, and cron#68718garrytan wants to merge 11 commits intoopenclaw:mainfrom
Conversation
|
This is not quite ready for review yet! I have another phase or two left before it is complete. |
|
Thank you @garrytan! I have bumped this to @mbelinky and @vincentkoc who worked on tasks and Task Flows Also triaging this now, will update here! |
Greptile SummaryThis PR introduces One P1 remains: Confidence Score: 4/5Safe to merge with one P1 defect: parents using ignore/continue child-fail policy will be permanently stuck in waiting-children when all children fail. Four previously-identified issues were correctly resolved. One genuine P1 remains in applyChildFailPolicy — the parent-resolution call is missing for ignore/continue fail policies. The migration docstring mismatch is P2 (safe in practice). Overall system design and test coverage are solid. src/minions/queue.ts (applyChildFailPolicy function, lines 827–848) Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/minions/queue.ts
Line: 827-848
Comment:
**`ignore`/`continue` child-fail policies never resolve the parent**
`applyChildFailPolicy` calls `resolveParentIfDone` only for the `remove_dep` branch. For `ignore` and `continue` policies the function returns without doing anything, so the parent stays in `'waiting-children'` permanently if all of its children end in `'failed'` or `'dead'`.
The previous thread's fix (adding `'failed'` to the `NOT IN` clause of `resolveParentIfDone`) is necessary but not sufficient: it only matters when `resolveParentIfDone` is actually called. When the last child fails with `ignore` or `continue` policy, `resolveParentIfDone` is never reached, leaving the parent stuck forever.
```ts
// fix: add the missing branch
} else if (policy === "ignore" || policy === "continue") {
resolveParentIfDone(db, parentId, now);
}
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/minions/migration.ts
Line: 111-133
Comment:
**Migration docstring says `status='attached'` but code inserts `status='waiting'`**
The function-level docstring explicitly states:
> "Imported rows get status='attached' (non-claimable) so the orphan-detection sweep at gateway startup can inspect liveness before marking them claimable or dead."
But the INSERT hardcodes `"waiting"` — meaning migrated live tasks are immediately claimable by the worker and will be re-spawned as soon as the worker starts, bypassing any orphan-liveness check.
In practice this is safe for the intended upgrade scenario (full gateway restart, old subagents dead), but it also means the `'attached'` status and the described orphan-detection sweep are dead code / unimplemented design. The docstring should either reflect the real behavior (`'waiting'`) or the insert should use `'attached'` and the orphan sweep should be wired up in `gateway-startup.ts`.
How can I resolve this? If you propose a fix, please make it concise.Reviews (2): Last reviewed commit: "fix(minions): third-round Codex review (..." | Re-trigger Greptile |
| // Track the spawned subagent in minions for crash recovery. | ||
| // Best-effort: if minions is unavailable, the spawn already succeeded. | ||
| if (result.status !== "error") { | ||
| try { | ||
| const { resolveDurabilityMode } = await import("../../minions/durability-config.js"); | ||
| if (resolveDurabilityMode() === "minions") { | ||
| const { MinionQueue } = await import("../../minions/queue.js"); | ||
| const { MinionStore } = await import("../../minions/store.js"); | ||
| const store = MinionStore.openDefault(); | ||
| const minionQueue = new MinionQueue(store); | ||
| minionQueue.add("subagent.spawn", { | ||
| task, | ||
| childSessionKey: result.childSessionKey, | ||
| runId: result.runId, | ||
| label: label || undefined, | ||
| agentId: requestedAgentId, | ||
| model: modelOverride, | ||
| thinking: thinkingOverrideRaw, | ||
| runTimeoutSeconds, | ||
| cleanup, | ||
| mode, | ||
| lightContext, | ||
| requesterSessionKey: opts?.agentSessionKey, | ||
| }, { | ||
| idempotencyKey: result.runId ? `spawn:${result.runId}` : undefined, | ||
| timeoutMs: runTimeoutSeconds ? runTimeoutSeconds * 1000 : undefined, | ||
| }); | ||
| } | ||
| } catch { | ||
| // Best-effort: minion tracking failure doesn't fail the spawn. | ||
| } | ||
| } |
There was a problem hiding this comment.
Duplicate subagent spawn when durability mode is
"minions"
spawnSubagentDirect is called on line 313, and then a "waiting" job is immediately added to the queue (line 358). The MinionWorker will claim that "waiting" job and call spawnSubagentDirect again inside subagentSpawnHandler, spawning a second subagent for every call to sessions_spawn when durability is enabled.
The isRetry flag inside the handler (based on previousSessionKey being set) only changes a progress-phase label — it does not skip the spawn. Both the "fresh spawn" and "crash recovery" branches of the handler unconditionally call spawnSubagentDirect.
The PR description's "spawn-then-track" intent requires that the tracking row be inserted as status = 'active' (with a valid lock_token and lock_until) so the worker's normal claim loop cannot pick it up — only stall detection can re-queue it after a crash. Using minionQueue.add() (which always inserts status = 'waiting') makes every newly-spawned subagent immediately claimable and re-processed by the worker.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/tools/sessions-spawn-tool.ts
Line: 348-379
Comment:
**Duplicate subagent spawn when durability mode is `"minions"`**
`spawnSubagentDirect` is called on line 313, and then a `"waiting"` job is immediately added to the queue (line 358). The MinionWorker will claim that `"waiting"` job and call `spawnSubagentDirect` **again** inside `subagentSpawnHandler`, spawning a second subagent for every call to `sessions_spawn` when durability is enabled.
The `isRetry` flag inside the handler (based on `previousSessionKey` being set) only changes a progress-phase label — it does not skip the spawn. Both the "fresh spawn" and "crash recovery" branches of the handler unconditionally call `spawnSubagentDirect`.
The PR description's "spawn-then-track" intent requires that the tracking row be inserted as `status = 'active'` (with a valid `lock_token` and `lock_until`) so the worker's normal claim loop cannot pick it up — only stall detection can re-queue it after a crash. Using `minionQueue.add()` (which always inserts `status = 'waiting'`) makes every newly-spawned subagent immediately claimable and re-processed by the worker.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Fixed in 64fbe36. The tracking row is now inserted as status='active' with a valid lock_token and lock_until via direct SQL, so the worker's claim loop (which only claims 'waiting') skips it. Only stall detection can re-queue the row after a crash.
- minionQueue.add("subagent.spawn", {...}) // 'waiting' = worker claims = double-spawn
+ store.db.prepare('INSERT ... status=active, lock_token=?, lock_until=?') // not claimable| removeJob(id: number): boolean { | ||
| const terminalList = MINION_TERMINAL_STATUSES.join("','"); | ||
| const rows = this.store.db | ||
| .prepare( | ||
| `DELETE FROM minion_jobs WHERE id = ? AND status IN ('${terminalList}') RETURNING id`, | ||
| ) | ||
| .all(id); | ||
| return rows.length > 0; | ||
| } |
There was a problem hiding this comment.
String-interpolated SQL in
removeJob
MINION_TERMINAL_STATUSES.join("','") is spliced directly into the SQL string. The values are hardcoded compile-time constants so there is no real injection risk today, but the pattern diverges from every other query in this file (which use ? placeholders). If the constant is ever extended or the join is copy-pasted elsewhere, the invariant breaks silently.
Prefer building placeholder lists the same way claim, cancelJob, and prune do:
| removeJob(id: number): boolean { | |
| const terminalList = MINION_TERMINAL_STATUSES.join("','"); | |
| const rows = this.store.db | |
| .prepare( | |
| `DELETE FROM minion_jobs WHERE id = ? AND status IN ('${terminalList}') RETURNING id`, | |
| ) | |
| .all(id); | |
| return rows.length > 0; | |
| } | |
| removeJob(id: number): boolean { | |
| const placeholders = MINION_TERMINAL_STATUSES.map(() => "?").join(","); | |
| const rows = this.store.db | |
| .prepare( | |
| `DELETE FROM minion_jobs WHERE id = ? AND status IN (${placeholders}) RETURNING id`, | |
| ) | |
| .all(id, ...MINION_TERMINAL_STATUSES); | |
| return rows.length > 0; | |
| } |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/minions/queue.ts
Line: 418-426
Comment:
**String-interpolated SQL in `removeJob`**
`MINION_TERMINAL_STATUSES.join("','")` is spliced directly into the SQL string. The values are hardcoded compile-time constants so there is no real injection risk today, but the pattern diverges from every other query in this file (which use `?` placeholders). If the constant is ever extended or the join is copy-pasted elsewhere, the invariant breaks silently.
Prefer building placeholder lists the same way `claim`, `cancelJob`, and `prune` do:
```suggestion
removeJob(id: number): boolean {
const placeholders = MINION_TERMINAL_STATUSES.map(() => "?").join(",");
const rows = this.store.db
.prepare(
`DELETE FROM minion_jobs WHERE id = ? AND status IN (${placeholders}) RETURNING id`,
)
.all(id, ...MINION_TERMINAL_STATUSES);
return rows.length > 0;
}
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Fixed in 64fbe36. Switched to ? placeholders matching every other query in the file:
const placeholders = MINION_TERMINAL_STATUSES.map(() => "?").join(",");
const rows = this.store.db
.prepare(`DELETE FROM minion_jobs WHERE id = ? AND status IN (${placeholders}) RETURNING id`)
.all(id, ...MINION_TERMINAL_STATUSES);| const shutdown = () => { | ||
| this.running = false; | ||
| }; | ||
| process.on("SIGTERM", shutdown); | ||
| process.on("SIGINT", shutdown); |
There was a problem hiding this comment.
SIGTERM/SIGINT listener accumulates on multiple
start() calls
shutdown is registered on the global process each time start() is called, but only removed in the finally block. If start() throws before the while loop (e.g., the first promoteDelayed() throws), the finally block still runs and removes the listener — so that path is fine. The real concern is testing: if tests create multiple MinionWorker instances and call start(), Node.js emits MaxListenersExceededWarning. The maybeStartMinionWorker singleton guard prevents this in production, but unit tests that construct workers directly will hit it. Consider using a once flag or calling process.setMaxListeners in the worker constructor to surface the issue clearly.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/minions/worker.ts
Line: 67-71
Comment:
**SIGTERM/SIGINT listener accumulates on multiple `start()` calls**
`shutdown` is registered on the global `process` each time `start()` is called, but only removed in the `finally` block. If `start()` throws before the `while` loop (e.g., the first `promoteDelayed()` throws), the `finally` block still runs and removes the listener — so that path is fine. The real concern is testing: if tests create multiple `MinionWorker` instances and call `start()`, Node.js emits `MaxListenersExceededWarning`. The `maybeStartMinionWorker` singleton guard prevents this in production, but unit tests that construct workers directly will hit it. Consider using a `once` flag or calling `process.setMaxListeners` in the worker constructor to surface the issue clearly.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Fixed in 64fbe36. Two changes:
process.on→process.onceso listeners don't accumulate across test instances.- Added re-entry guard at the top of
start():if (this.running) return;prevents double-start entirely.
| function resolveParentIfDone(db: DatabaseSync, parentId: number, now: number): void { | ||
| db.prepare( | ||
| `UPDATE minion_jobs SET status = 'waiting', updated_at = ? | ||
| WHERE id = ? AND status = 'waiting-children' | ||
| AND NOT EXISTS ( | ||
| SELECT 1 FROM minion_jobs | ||
| WHERE parent_job_id = ? | ||
| AND status NOT IN ('completed', 'dead', 'cancelled') | ||
| )`, | ||
| ).run(now, parentId, parentId); | ||
| } |
There was a problem hiding this comment.
resolveParentIfDone excludes 'failed' from terminal child statuses
The NOT EXISTS clause checks status NOT IN ('completed', 'dead', 'cancelled'). The 'failed' status is in MINION_TERMINAL_STATUSES and is produced by applyChildFailPolicy's direct SQL update when on_child_fail = 'fail_parent'. A parent whose child just failed via that policy will already be set to 'failed' itself, so the resolution is moot in that branch — but for on_child_fail = 'ignore' or 'continue', a 'failed' child leaves the parent stuck in 'waiting-children' forever because 'failed' is not treated as terminal here. Including 'failed' in the exclusion list (alongside 'dead' and 'cancelled') aligns this with the MINION_TERMINAL_STATUSES definition.
The same gap exists in the public resolveParent method (line 722).
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/minions/queue.ts
Line: 806-816
Comment:
**`resolveParentIfDone` excludes `'failed'` from terminal child statuses**
The NOT EXISTS clause checks `status NOT IN ('completed', 'dead', 'cancelled')`. The `'failed'` status is in `MINION_TERMINAL_STATUSES` and is produced by `applyChildFailPolicy`'s direct SQL update when `on_child_fail = 'fail_parent'`. A parent whose child just failed via that policy will already be set to `'failed'` itself, so the resolution is moot in that branch — but for `on_child_fail = 'ignore'` or `'continue'`, a `'failed'` child leaves the parent stuck in `'waiting-children'` forever because `'failed'` is not treated as terminal here. Including `'failed'` in the exclusion list (alongside `'dead'` and `'cancelled'`) aligns this with the `MINION_TERMINAL_STATUSES` definition.
The same gap exists in the public `resolveParent` method (line 722).
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Fixed in 64fbe36. Added 'failed' to the NOT IN clause in all three locations: resolveParentIfDone, the inline resolution in completeJob, and the public resolveParent method. A parent with on_child_fail='ignore' or 'continue' no longer gets stuck in 'waiting-children' when a child fails.
- AND status NOT IN ('completed', 'dead', 'cancelled')
+ AND status NOT IN ('completed', 'failed', 'dead', 'cancelled')There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5088268ba6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| export function maybeStartMinionWorker(): void { | ||
| if (activeWorker) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
Call minion worker startup during gateway initialization
This adds maybeStartMinionWorker() but nothing in the repository invokes it, so the worker loop never starts at runtime. In that state, enqueued minion jobs are never claimed and the new durability behaviors (stall/timeout recovery, retries, handler execution) do not run, which means crash-recovery is effectively disabled despite enabling minion writes.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 604c942. Wired maybeStartMinionWorker() into src/gateway/server-startup-early.ts via dynamic import, right after startTaskRegistryMaintenance():
import("../minions/gateway-startup.js")
.then((m) => m.maybeStartMinionWorker())
.catch(() => {});Dynamic import keeps the gateway startup path lazy (minions module only loads if the import succeeds). The worker now actually starts alongside the gateway.
| const existingId = taskToMinionId.get(task.taskId); | ||
|
|
||
| if (existingId != null) { |
There was a problem hiding this comment.
Persist task-to-minion identity instead of process-local map
The upsert path depends on the in-memory taskToMinionId map to decide whether to UPDATE or INSERT. After any process restart this map is empty, so existing tasks are inserted again instead of updated, creating duplicate shadow rows and leaving older rows stale (and undeletable via syncTaskDeleteToMinions, which also relies on that map). This will skew minion status/stats over time.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 604c942. Replaced the process-local Map<string, number> with DB-level dedup via idempotency_key = 'task:{taskId}'. On upsert, the sync module queries SELECT id FROM minion_jobs WHERE idempotency_key = ? before deciding INSERT vs UPDATE. Survives restarts, no duplicate rows.
Also switched syncTaskDeleteToMinions to delete by idempotency_key instead of the old map lookup.
|
Fixing Greptile issues! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 604c942ab8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
| storeInitAttempted = true; | ||
| try { | ||
| const { MinionStore: Store } = require("../minions/store.js") as typeof import("../minions/store.js"); |
There was a problem hiding this comment.
Replace bare require in ESM minion sync loader
This module is executed in an ESM package (type: module), so require(...) throws ReferenceError; the error is caught, storeInitAttempted is set, and all subsequent syncTaskToMinions/syncTaskDeleteToMinions calls are permanently no-ops for that process. In practice, task-registry shadow writes to minion_jobs never happen in production after the first call path.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in e318c34. Replaced require() with await import() and made getStore(), syncTaskToMinions(), and syncTaskDeleteToMinions() async. Shadow writes now actually work in the ESM runtime.
| lightContext, | ||
| expectsCompletionMessage: true, | ||
| }, | ||
| {}, |
There was a problem hiding this comment.
Forward requester context when respawning subagent jobs
Crash-recovery respawns call spawnSubagentDirect with an empty context object, so requester identity metadata (session/channel/account/thread) is dropped on retry. Any recovered job can be relaunched under default/main-session context instead of the original caller, which changes ownership checks and routing behavior for resumed subagents.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in e318c34. The handler now destructures requesterSessionKey, requesterChannel, requesterAccountId, requesterTo, requesterThreadId from job.data and forwards them to spawnSubagentDirect's second argument. Crash-recovered subagents retain their original requester identity.
| // status='active' with a lock so the worker's claim loop (which only | ||
| // claims 'waiting') won't touch it. Only stall detection can re-queue | ||
| // the row if the gateway crashes and the lock expires. | ||
| if (result.status !== "error") { |
There was a problem hiding this comment.
Track only accepted spawn results in minion_jobs
The tracking insert runs for every non-error result, including status: "forbidden". Forbidden responses typically have no runnable run/session identifiers, yet this still writes an active minion row, which later appears as stalled/dead noise and can trigger incorrect recovery behavior if stall policy is relaxed. Restricting tracking to accepted spawns avoids poisoning queue state.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in e318c34. Changed result.status !== "error" to result.status === "accepted". Forbidden/rejected spawns no longer write tracking rows.
| .run( | ||
| `legacy.${row.runtime}`, | ||
| "default", | ||
| minionStatus, | ||
| JSON.stringify({ |
There was a problem hiding this comment.
Import legacy tasks into claimable handler-compatible states
Migration writes live legacy tasks as attached with legacy.<runtime> names, but worker claims only waiting jobs whose names match registered handlers (subagent.spawn, acp.spawn, etc.). In this commit there is no runtime transition path that makes these imported rows claimable, so archived in-flight legacy tasks are stranded instead of recoverable.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in e318c34. Two changes:
- Migration now maps runtime → registered handler name (
subagent→subagent.spawn,acp→acp.spawn,cli→cli.spawn,cron→cron.tick) so imported rows match what the worker claims. - Status changed from
attachedtowaitingso imported rows are immediately claimable by the worker for recovery.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e318c340bf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const maxStalled = | ||
| typeof row.max_stalled === "bigint" ? Number(row.max_stalled) : row.max_stalled; | ||
|
|
||
| if (counter + 1 < maxStalled) { |
There was a problem hiding this comment.
Requeue first stalled attempt before dead-lettering jobs
handleStalled uses counter + 1 < maxStalled, which means jobs with the default max_stalled = 1 are marked dead on their very first expired lock instead of being re-queued once. This breaks the advertised crash-recovery path for default-tracked subagent rows (for example those inserted from sessions-spawn-tool), because a single gateway crash never gets a rescue attempt.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 54dbce3. Changed max_stalled default from 1 to 2 in schema DDL. With the old default, 0+1 < 1 was false, so the first stall went straight to dead-letter with no re-queue attempt. Now: first stall re-queues (0+1 < 2 = true), second stall dead-letters (1+1 < 2 = false). Crash recovery works on first attempt.
| `UPDATE minion_jobs SET | ||
| status = 'dead', | ||
| error_text = 'timeout exceeded', | ||
| lock_token = NULL, lock_until = NULL, | ||
| finished_at = ?, updated_at = ? |
There was a problem hiding this comment.
Propagate parent policy when timeout dead-letters a child
handleTimeouts directly updates matching rows to dead and returns them, but unlike failJob it never applies on_child_fail handling or parent-resolution logic. When a child job with parent_job_id times out, the parent can remain stuck in waiting-children indefinitely (or miss fail_parent behavior), which blocks fan-out workflows from completing correctly.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 54dbce3. handleTimeouts now runs inside a transaction and calls applyChildFailPolicy() for each timed-out job that has a parentJobId. Parents with on_child_fail='fail_parent' get marked failed; parents with 'ignore'/'continue' get resolution checks. Fan-out workflows no longer get stuck.
| @@ -305,7 +320,7 @@ export async function runTaskRegistryMaintenance(): Promise<TaskRegistryMaintena | |||
| if (!current) { | |||
| continue; | |||
| } | |||
| if (shouldMarkLost(current, now)) { | |||
| if (!minionsOwnLostDetection && shouldMarkLost(current, now)) { | |||
There was a problem hiding this comment.
Keep task lost-reconciliation enabled for non-minion runs
This guard disables shouldMarkLost entirely once minionsOwnLostDetection is enabled at gateway startup, so the task-registry sweep no longer marks any active tasks as lost. However, shadow-synced registry rows are written as plain active records without lock heartbeats (task-minion-sync), so minion stall/timeout sweeps cannot replace this for ACP/CLI/cron tasks; those tasks can stay running forever after their backing session disappears.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 54dbce3. Removed the minionsOwnLostDetection flag entirely. The existing task-registry sweep now runs as a safety net alongside minions (belt + suspenders). Minion stall detection is faster (sub-second heartbeat vs 5-min grace), but the old sweep catches ACP/CLI/cron tasks that don't have minion lock heartbeats yet.
|
Hi @garrytan thanks for this! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 54dbce3b5f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| `UPDATE minion_jobs SET | ||
| status = 'dead', stalled_counter = stalled_counter + 1, | ||
| error_text = 'max stalled count exceeded', | ||
| lock_token = NULL, lock_until = NULL, | ||
| finished_at = ?, updated_at = ? |
There was a problem hiding this comment.
Run parent-failure handling for stalled jobs marked dead
When handleStalled() dead-letters an expired active job, it only updates the child row and returns it; it never invokes applyChildFailPolicy/parent resolution for rows with parent_job_id. In a fan-out tree, a stalled child that hits max stalled count will become dead while the parent remains waiting-children, so fail_parent/remove_dep behavior is skipped and the parent can remain blocked indefinitely after crash recovery.
Useful? React with 👍 / 👎.
| `UPDATE minion_jobs SET | ||
| status = 'cancelled', lock_token = NULL, lock_until = NULL, | ||
| finished_at = ?, updated_at = ? | ||
| WHERE id IN (${placeholders}) | ||
| AND status IN ('waiting','active','delayed','waiting-children','paused','attached')`, |
There was a problem hiding this comment.
Resolve parent state after cancelling child subtrees
cancelJob() marks descendants as cancelled but does not trigger parent resolution afterward. If an operator cancels a non-root child that is the last unfinished dependency, the parent stays in waiting-children forever because no later child completion will call resolveParentIfDone, effectively wedging that workflow.
Useful? React with 👍 / 👎.
|
Thanks for the work here. I agree with the problem statement: detached work in OpenClaw needs stronger execution semantics than “track it and later mark it lost”. Crash recovery, retry, stall detection, and cascade cancel are all real needs, and this PR clearly puts serious thought into that layer. That said, I do not think merging this into core in its current shape would be the right next step. The main concern here is architectural, rather than whether the queue itself works. OpenClaw already has core product-level abstractions for detached work:
This PR introduces a second orchestration/execution plane in core ( A more robust direction would be to:
Concretely, it seems like the next step should be a seam-extraction PR, rather than landing
The kinds of capabilities that may belong in core are things like:
In recent internal discussions we have agreed to keep the core as lean as possible, and as such there are things that currently would sit better outside of it, such as:
So this is not a rejection of the problem being solved. It is more a request to re-scope the implementation so the solution fits the architecture we want to preserve. What would be most helpful next:
If there is a subset of this that truly needs to remain in core, it would help to isolate the smallest necessary subset and make the case for that subset specifically. But I do not think the full |
BullMQ-inspired queue with 9-state lifecycle, parent-child DAGs, atomic cascade-cancel via recursive CTE, lock-heartbeat stall detection, idempotency via partial unique index, token accounting with parent roll-up, inbox messaging. Backed by node:sqlite (zero new runtime deps). WAL + BEGIN IMMEDIATE for writer serialization. Grounded in Wintermute production evidence: 46,713 pages, 98 zombie processes over 18 days, ~40% subagent timeout rate during peak load.
Four runtime handler stubs (subagent/ACP/CLI/cron) with contract validation. Real subagent handler calls spawnSubagentDirect with crash recovery. Gateway startup wiring (maybeStartMinionWorker) with durability config flag. openclaw minions CLI (list/get/cancel/stats/retry/prune/smoke). Gateway protocol schemas (submit/list/cancel/stats/inbox). Plugin SDK seam at openclaw/plugin-sdk/minions-runtime.
TaskRegistry shadow-write: every create/update/delete mirrors to minion_jobs. Status-map module (bidirectional TaskStatus <-> MinionJobStatus). Lost-detection gated by minionsOwnLostDetection flag. Doctor migration contribution for legacy tasks/*.sqlite -> minion_jobs. sessions-spawn-tool.ts: tracks every subagent spawn in minion_jobs when durability mode is enabled, with runId as idempotency key.
…on data Unit: schema, queue CRUD, parent-child, idempotency, cascade-kill, stall, timeout, worker lifecycle, CAS guard, fan-out, handler registry, status-map. E2E: concurrent claim, timeout dead-letter, crash rescue, deep fan-in, cascade kill, durability, throughput, memory, cron pile-up, idempotent respawn, autopilot multi-task. All benchmarks annotated with real data from a 46,713-page Wintermute deployment.
User-facing: docs/concepts/minions.md, docs/cli/minions.md, CHANGELOG entry. Unrelated bugfix: install security scan now allows pnpm workspace symlinks (node_modules/@openclaw/* -> sibling extension dirs) by verifying the target is a recognized workspace member path.
1. Fix double-spawn bug: insert tracking row as status='active' with lock (not 'waiting') so worker claim loop skips it. Only stall detection re-queues after crash. 2. Fix removeJob SQL: use ? placeholders instead of string interpolation. 3. Fix worker listener leak: use process.once + re-entry guard. 4. Fix resolveParentIfDone: include 'failed' in terminal child statuses so on_child_fail='ignore'/'continue' doesn't leave parent stuck.
…x shadow-sync dedup)
1. Wire maybeStartMinionWorker() into gateway early startup so the worker
actually runs (Codex P1: worker defined but never called).
2. Fix shadow-sync dedup: use idempotency_key='task:{taskId}' instead of
process-local Map so restarts don't create duplicate rows (Codex P2).
…r context, accepted-only tracking, migration handler names)
…switch defaults, doc format, floating promises)
…parent policy, remove lost-detection flag) 1. Change max_stalled default from 1 to 2 so first stall re-queues instead of dead-lettering (crash recovery actually works on first attempt). 2. handleTimeouts now applies on_child_fail parent policy after dead-lettering, preventing parents from getting stuck in waiting-children. 3. Remove minionsOwnLostDetection flag — keep the existing task-registry sweep running as a safety net alongside minions. Belt + suspenders.
|
Thanks I've never done this before. I will try to do that! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b45d69976d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| `INSERT INTO minion_jobs ( | ||
| name, queue, status, data, lock_token, lock_until, | ||
| idempotency_key, timeout_ms, created_at, updated_at, started_at, attempts_started | ||
| ) VALUES (?, ?, 'active', ?, ?, ?, ?, ?, ?, ?, ?, 1)`, |
There was a problem hiding this comment.
Stop tracking accepted spawns as expiring active jobs
In minions durability mode, this insert creates an active job with a 30s lock, but there is no corresponding heartbeat or completion path for the original live subagent run. Once lock_until expires, handleStalled() will treat it as orphaned and requeue it, so normal long-running (or even already-finished) spawns get retried as crash recovery and a duplicate subagent.spawn is launched without any gateway crash. This tracking row needs either lock renewal/terminal updates from the live run lifecycle or a non-claimable status until true orphan reconciliation.
Useful? React with 👍 / 👎.
DM me on X or discord - happy to talk you through it. I can probably do the seam bit as it's a good addition to core anyway. |
|
We landed the first extraction step in #68886 (merge commit That PR keeps the current Tasks behavior intact, but adds a narrow detached-task lifecycle seam in core and tests that callers route through it. That is the direction we want to build on. Given that, I don’t think it makes sense to take this PR as-is. It still introduces a second core execution substrate in parallel with Tasks/TaskFlow, which is the main architectural concern we discussed. The more robust path from here is:
I’m going to close this PR in favor of that direction. If you want to keep pursuing the durability/recovery work, the next step should be a smaller follow-up rebased on current If there is a narrow slice in here that you think still needs to be salvaged immediately, feel free to open a follow-up against |
| function applyChildFailPolicy( | ||
| db: DatabaseSync, | ||
| child: MinionJob, | ||
| errorText: string, | ||
| now: number, | ||
| ): void { | ||
| const parentId = child.parentJobId!; | ||
| const policy: ChildFailPolicy = child.onChildFail; | ||
|
|
||
| if (policy === "fail_parent") { | ||
| db.prepare( | ||
| `UPDATE minion_jobs SET status = 'failed', | ||
| error_text = ?, finished_at = ?, updated_at = ? | ||
| WHERE id = ? AND status = 'waiting-children'`, | ||
| ).run(`child job ${child.id} failed: ${errorText}`, now, now, parentId); | ||
| } else if (policy === "remove_dep") { | ||
| db.prepare( | ||
| "UPDATE minion_jobs SET parent_job_id = NULL, updated_at = ? WHERE id = ?", | ||
| ).run(now, child.id); | ||
| resolveParentIfDone(db, parentId, now); | ||
| } | ||
| } |
There was a problem hiding this comment.
ignore/continue child-fail policies never resolve the parent
applyChildFailPolicy calls resolveParentIfDone only for the remove_dep branch. For ignore and continue policies the function returns without doing anything, so the parent stays in 'waiting-children' permanently if all of its children end in 'failed' or 'dead'.
The previous thread's fix (adding 'failed' to the NOT IN clause of resolveParentIfDone) is necessary but not sufficient: it only matters when resolveParentIfDone is actually called. When the last child fails with ignore or continue policy, resolveParentIfDone is never reached, leaving the parent stuck forever.
// fix: add the missing branch
} else if (policy === "ignore" || policy === "continue") {
resolveParentIfDone(db, parentId, now);
}Prompt To Fix With AI
This is a comment left during a code review.
Path: src/minions/queue.ts
Line: 827-848
Comment:
**`ignore`/`continue` child-fail policies never resolve the parent**
`applyChildFailPolicy` calls `resolveParentIfDone` only for the `remove_dep` branch. For `ignore` and `continue` policies the function returns without doing anything, so the parent stays in `'waiting-children'` permanently if all of its children end in `'failed'` or `'dead'`.
The previous thread's fix (adding `'failed'` to the `NOT IN` clause of `resolveParentIfDone`) is necessary but not sufficient: it only matters when `resolveParentIfDone` is actually called. When the last child fails with `ignore` or `continue` policy, `resolveParentIfDone` is never reached, leaving the parent stuck forever.
```ts
// fix: add the missing branch
} else if (policy === "ignore" || policy === "continue") {
resolveParentIfDone(db, parentId, now);
}
```
How can I resolve this? If you propose a fix, please make it concise.|
The core contract work this needed is now landed in #68915 ( What landed:
What did not land:
So this PR is not closed as superseded, but it should now be reworked on top of the landed core seam instead of carrying its own overlapping core substrate. The expected follow-up shape is: keep |
|
Codex review: found issues before merge. Summary Reproducibility: Do we have a high-confidence way to reproduce the issue? Yes for the patch-level defects: the diff shows deterministic state-machine paths for expiring tracked spawns and stuck parent jobs. The broader durable-executor request is a design change rather than a single current-main bug reproduction. Next step before merge Security Review findings
Review detailsBest possible solution: Keep Tasks and Task Flow as the canonical ledger/orchestration layer, keep the landed detached-runtime registration seam, and rework durable queue mechanics as a smaller executor behind that contract while splitting scanner changes into their own PR. Do we have a high-confidence way to reproduce the issue? Do we have a high-confidence way to reproduce the issue? Yes for the patch-level defects: the diff shows deterministic state-machine paths for expiring tracked spawns and stuck parent jobs. The broader durable-executor request is a design change rather than a single current-main bug reproduction. Is this the best way to solve the issue? Is this the best way to solve the issue? No. The branch introduces a second core substrate beside Tasks/Task Flow; the narrower maintainable path is an executor implementation behind the existing detached-runtime seam. Full review comments:
Overall correctness: patch is incorrect Security concerns:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against 89f73a5ef2b0. |
Summary
Problem: OpenClaw's subagent dispatch has no persistence layer. Gateway crashes lose all in-flight work. Sub-agents that timeout or crash leave orphaned processes and session state. There is no cascade cancel (grandchildren keep burning tokens). No idempotency (duplicate spawns race). No token accounting across parent-child trees. Production evidence from a 46,713-page deployment: 98 zombie processes over 18 days, ~40% sub-agent timeout failure rate during peak load, 4,330 GitHub issues mentioning zombie/orphan/daemon (RFC: Process Registry).
Why it matters: Every OpenClaw deployment that graduates from "chatbot" to "always-on agent with background jobs" hits this wall. The failure modes compound: cron pile-up starves user messages, crashed workers leave unreclaimable work, and there is no single command that shows "what is running right now."
What changed: New
src/minions/subsystem — a BullMQ-inspired durable job queue backed bynode:sqlite(zero new runtime dependencies). 9-state lifecycle, parent-child DAGs with atomic cascade-cancel via recursive CTE, lock-heartbeat with sub-second stall detection, idempotency via partial unique index, token accounting with parent roll-up, inbox messaging for mid-flight steering. Real subagent handler that callsspawnSubagentDirectwith crash recovery via stall re-queue. Gateway startup wiring (maybeStartMinionWorker) withminions.durabilityconfig flag.sessions-spawn-tool.tsintegration — every subagent spawn is tracked inminion_jobsfor crash recovery when durability mode is enabled. Fan-out helper for parallel child dispatch. Plugin SDK seam atopenclaw/plugin-sdk/minions-runtime. Gateway protocol schemas (submit/list/cancel/stats/inbox).openclaw minionsCLI for queue inspection. TaskRegistry shadow-write facade. Doctor migration contribution.What did NOT change (scope boundary):
TaskRegistrypublic API is unchanged — all 40+ consuming files keep calling the same functions with the same shapes.sessions-spawn-tool.tsstill callsspawnSubagentDirectfirst, then writes the tracking record to minion_jobs (best-effort, never fails the spawn).node:sqlite(DatabaseSync) which is already the pattern insrc/tasks/task-registry.store.sqlite.ts.minions.durabilityis not set or set to"legacy", the worker doesn't start and behavior is identical to today.How it works end-to-end
Evidence from production
Data queried live from a 46,713-page Wintermute deployment (via GBrain MCP, 2026-04-19):
stalled_counter=1, attempts_started=2— crashed mid-sync, stall-detected, rescuedBenchmark results (from e2e test suite)
Test coverage
94 test cases across 10 test files:
All tests pass. Full repo suite: 38,571 passed, 0 failed (61 Vitest shards).
Key design decisions
node:sqlitenot pglite — zero new deps; matches existingsrc/tasks/pattern. WAL +BEGIN IMMEDIATEfor writer serialization.Spawn-then-track, not queue-first —
sessions-spawn-tool.tsstill callsspawnSubagentDirectfirst (preserving the synchronous return contract), then writes a tracking record tominion_jobs. The worker's stall detection is the safety net that catches crashed subagents and re-spawns them.minions.durabilityconfig flag — default"minions"(active). Set to"legacy"to disable the worker and revert to today's behavior. One config line rollback.CAS guard on complete/fail —
UPDATE ... RETURNINGwithWHERE attempts_made = ?prevents double-execution when lock renewal starves under event-loop pressure.2-step recursive CTE for cascade-cancel —
WITH RECURSIVEcollects descendant IDs, then chunkedUPDATE WHERE id IN (...). UsesUNION+ depth cap +CHECK (parent_job_id != id)for cycle protection.Change Type (select all)
Scope (select all touched areas)
Linked Issue/PR
claudeprocesses, swap exhaustion #44790 (ACP session orphaned processes)Root Cause (if applicable)
OpenClaw's subagent dispatch is fire-and-forget with no durable state. The gateway runs sessions in-process; when the process dies, all in-flight work is lost. The reconcile loop in
task-registry.maintenance.tspolls every 60s with a 5-minute grace period, but it can only mark tasks aslost(terminal) — it cannot resume them. Minions replaces this with sub-second lock-heartbeat renewal and durable SQLite state that survives process death, with a real handler that re-spawns crashed subagents.