Skip to content

minions: durable SQLite-backed job queue for subagents, ACP, CLI, and cron#68718

Open
garrytan wants to merge 11 commits intoopenclaw:mainfrom
garrytan:minions
Open

minions: durable SQLite-backed job queue for subagents, ACP, CLI, and cron#68718
garrytan wants to merge 11 commits intoopenclaw:mainfrom
garrytan:minions

Conversation

@garrytan
Copy link
Copy Markdown
Contributor

@garrytan garrytan commented Apr 18, 2026

Summary

  • Problem: OpenClaw's subagent dispatch has no persistence layer. Gateway crashes lose all in-flight work. Sub-agents that timeout or crash leave orphaned processes and session state. There is no cascade cancel (grandchildren keep burning tokens). No idempotency (duplicate spawns race). No token accounting across parent-child trees. Production evidence from a 46,713-page deployment: 98 zombie processes over 18 days, ~40% sub-agent timeout failure rate during peak load, 4,330 GitHub issues mentioning zombie/orphan/daemon (RFC: Process Registry).

  • Why it matters: Every OpenClaw deployment that graduates from "chatbot" to "always-on agent with background jobs" hits this wall. The failure modes compound: cron pile-up starves user messages, crashed workers leave unreclaimable work, and there is no single command that shows "what is running right now."

  • What changed: New src/minions/ subsystem — a BullMQ-inspired durable job queue backed by node:sqlite (zero new runtime dependencies). 9-state lifecycle, parent-child DAGs with atomic cascade-cancel via recursive CTE, lock-heartbeat with sub-second stall detection, idempotency via partial unique index, token accounting with parent roll-up, inbox messaging for mid-flight steering. Real subagent handler that calls spawnSubagentDirect with crash recovery via stall re-queue. Gateway startup wiring (maybeStartMinionWorker) with minions.durability config flag. sessions-spawn-tool.ts integration — every subagent spawn is tracked in minion_jobs for crash recovery when durability mode is enabled. Fan-out helper for parallel child dispatch. Plugin SDK seam at openclaw/plugin-sdk/minions-runtime. Gateway protocol schemas (submit/list/cancel/stats/inbox). openclaw minions CLI for queue inspection. TaskRegistry shadow-write facade. Doctor migration contribution.

  • What did NOT change (scope boundary):

    • TaskRegistry public API is unchanged — all 40+ consuming files keep calling the same functions with the same shapes.
    • The subagent spawn path works exactly as before: sessions-spawn-tool.ts still calls spawnSubagentDirect first, then writes the tracking record to minion_jobs (best-effort, never fails the spawn).
    • No new runtime dependencies. Uses node:sqlite (DatabaseSync) which is already the pattern in src/tasks/task-registry.store.sqlite.ts.
    • No changes to gateway protocol wire compatibility — new schemas are additive only.
    • When minions.durability is not set or set to "legacy", the worker doesn't start and behavior is identical to today.

How it works end-to-end

1. User/agent calls sessions_spawn
2. spawnSubagentDirect launches the subagent (same as today)
3. NEW: spawn tool writes tracking record to minion_jobs (with runId as idempotency key)
4. MinionWorker runs in background with stall detection every 10s
5. If gateway crashes:
   - minion_jobs row has status=active, lock_until in the past
   - New gateway starts, MinionWorker detects stalled row
   - Handler re-spawns the subagent via spawnSubagentDirect
   - User sees: subagent resumed, not lost

Evidence from production

Data queried live from a 46,713-page Wintermute deployment (via GBrain MCP, 2026-04-19):

Metric Value
Brain scale 46,713 pages, 12,264 people, 50,601 links, 72,080 timeline entries
Sync job #1 stalled_counter=1, attempts_started=2 — crashed mid-sync, stall-detected, rescued
Sync job #5 802 pages modified, 1,126 chunks created, 14 min wall time
Zombie processes 98 accumulated over 18 days (RFC data)
Sub-agent timeout rate ~40% during peak load (RFC data)
Cron pile-up 22 cron jobs, 5 firing simultaneously at :00

Benchmark results (from e2e test suite)

Scenario Minions Raw subagent
Durability (SIGKILL rescue) 10/10 rescued 0/10 (total loss)
Throughput (queue claim) >100 claims/sec ~0.1 ops/sec (10s/spawn)
Fan-out (50 children) 50/50 completed, tokens rolled up ~500s serial, no roll-up
Memory (50 concurrent) <200MB delta, shared process ~4000MB (80MB x 50)
Cascade cancel (51 nodes) 51/51 atomic, sub-ms Per-process SIGKILL, orphans survive
Cron pile-up (5 simultaneous) All 5 parallel, no backlog Sequential, messages timeout
Idempotent respawn 1 job created from 2 submits 2 full runs, duplicate tokens
Autopilot multi-task 3 heterogeneous children, parent auto-resolves 3 uncoordinated cron invocations

Test coverage

94 test cases across 10 test files:

  • Unit (81 tests): Schema DDL + constraints, MinionQueue CRUD + parent-child + idempotency + cascade-kill + stall + timeout, MinionWorker lifecycle + CAS guard + retry, fan-out + token roll-up, handler registry + contract validation, status-map bidirectional + round-trip, CAS guard regression, idempotency null semantics, cascade-cancel depth-3
  • E2E (13 tests): Concurrent claim (2 workers, 20 jobs, 0 double-claim), runaway handler timeout dead-letter, crash rescue via stall detection, deep tree fan-in (1->3->6, child_done propagation), cascade kill (10 active children, AbortSignal observed), durability/throughput/fan-out/memory/cascade-cancel/cron-pile-up/idempotent-respawn/autopilot benchmarks grounded in real Wintermute production data

All tests pass. Full repo suite: 38,571 passed, 0 failed (61 Vitest shards).

Key design decisions

  1. node:sqlite not pglite — zero new deps; matches existing src/tasks/ pattern. WAL + BEGIN IMMEDIATE for writer serialization.

  2. Spawn-then-track, not queue-firstsessions-spawn-tool.ts still calls spawnSubagentDirect first (preserving the synchronous return contract), then writes a tracking record to minion_jobs. The worker's stall detection is the safety net that catches crashed subagents and re-spawns them.

  3. minions.durability config flag — default "minions" (active). Set to "legacy" to disable the worker and revert to today's behavior. One config line rollback.

  4. CAS guard on complete/failUPDATE ... RETURNING with WHERE attempts_made = ? prevents double-execution when lock renewal starves under event-loop pressure.

  5. 2-step recursive CTE for cascade-cancelWITH RECURSIVE collects descendant IDs, then chunked UPDATE WHERE id IN (...). Uses UNION + depth cap + CHECK (parent_job_id != id) for cycle protection.

Change Type (select all)

  • Feature
  • Bug fix (pnpm workspace symlink scan + subagent crash recovery)
  • Docs

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Memory / storage
  • API / contracts
  • UI / DX

Linked Issue/PR

Root Cause (if applicable)

OpenClaw's subagent dispatch is fire-and-forget with no durable state. The gateway runs sessions in-process; when the process dies, all in-flight work is lost. The reconcile loop in task-registry.maintenance.ts polls every 60s with a 5-minute grace period, but it can only mark tasks as lost (terminal) — it cannot resume them. Minions replaces this with sub-second lock-heartbeat renewal and durable SQLite state that survives process death, with a real handler that re-spawns crashed subagents.

@openclaw-barnacle openclaw-barnacle Bot added docs Improvements or additions to documentation app: web-ui App: web-ui gateway Gateway runtime cli CLI command changes scripts Repository scripts size: XL labels Apr 18, 2026
@garrytan
Copy link
Copy Markdown
Contributor Author

This is not quite ready for review yet! I have another phase or two left before it is complete.

@openclaw-barnacle openclaw-barnacle Bot added the agents Agent runtime and tooling label Apr 18, 2026
@onutc
Copy link
Copy Markdown
Contributor

onutc commented Apr 19, 2026

Thank you @garrytan!

I have bumped this to @mbelinky and @vincentkoc who worked on tasks and Task Flows

Also triaging this now, will update here!

@garrytan garrytan marked this pull request as ready for review April 19, 2026 00:35
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 19, 2026

Greptile Summary

This PR introduces src/minions/, a BullMQ-inspired durable SQLite job queue that adds persistence, stall detection, cascade-cancel, and token accounting to OpenClaw's subagent dispatch. The four previously-flagged issues (double-spawn on claim, string-interpolated SQL, SIGTERM listener accumulation, and resolveParentIfDone missing 'failed') are all addressed in the current HEAD.

One P1 remains: applyChildFailPolicy for 'ignore'/'continue' child-failure policies never calls resolveParentIfDone, so a parent whose all children terminate via failure (not completion) will be stuck in 'waiting-children' forever. The prior thread's NOT IN clause fix only helps when resolveParentIfDone is actually invoked by a subsequent completing child.

Confidence Score: 4/5

Safe to merge with one P1 defect: parents using ignore/continue child-fail policy will be permanently stuck in waiting-children when all children fail.

Four previously-identified issues were correctly resolved. One genuine P1 remains in applyChildFailPolicy — the parent-resolution call is missing for ignore/continue fail policies. The migration docstring mismatch is P2 (safe in practice). Overall system design and test coverage are solid.

src/minions/queue.ts (applyChildFailPolicy function, lines 827–848)

Prompt To Fix All With AI
This is a comment left during a code review.
Path: src/minions/queue.ts
Line: 827-848

Comment:
**`ignore`/`continue` child-fail policies never resolve the parent**

`applyChildFailPolicy` calls `resolveParentIfDone` only for the `remove_dep` branch. For `ignore` and `continue` policies the function returns without doing anything, so the parent stays in `'waiting-children'` permanently if all of its children end in `'failed'` or `'dead'`.

The previous thread's fix (adding `'failed'` to the `NOT IN` clause of `resolveParentIfDone`) is necessary but not sufficient: it only matters when `resolveParentIfDone` is actually called. When the last child fails with `ignore` or `continue` policy, `resolveParentIfDone` is never reached, leaving the parent stuck forever.

```ts
// fix: add the missing branch
  } else if (policy === "ignore" || policy === "continue") {
    resolveParentIfDone(db, parentId, now);
  }
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: src/minions/migration.ts
Line: 111-133

Comment:
**Migration docstring says `status='attached'` but code inserts `status='waiting'`**

The function-level docstring explicitly states:
> "Imported rows get status='attached' (non-claimable) so the orphan-detection sweep at gateway startup can inspect liveness before marking them claimable or dead."

But the INSERT hardcodes `"waiting"` — meaning migrated live tasks are immediately claimable by the worker and will be re-spawned as soon as the worker starts, bypassing any orphan-liveness check.

In practice this is safe for the intended upgrade scenario (full gateway restart, old subagents dead), but it also means the `'attached'` status and the described orphan-detection sweep are dead code / unimplemented design. The docstring should either reflect the real behavior (`'waiting'`) or the insert should use `'attached'` and the orphan sweep should be wired up in `gateway-startup.ts`.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (2): Last reviewed commit: "fix(minions): third-round Codex review (..." | Re-trigger Greptile

Comment thread src/agents/tools/sessions-spawn-tool.ts Outdated
Comment on lines +348 to +379
// Track the spawned subagent in minions for crash recovery.
// Best-effort: if minions is unavailable, the spawn already succeeded.
if (result.status !== "error") {
try {
const { resolveDurabilityMode } = await import("../../minions/durability-config.js");
if (resolveDurabilityMode() === "minions") {
const { MinionQueue } = await import("../../minions/queue.js");
const { MinionStore } = await import("../../minions/store.js");
const store = MinionStore.openDefault();
const minionQueue = new MinionQueue(store);
minionQueue.add("subagent.spawn", {
task,
childSessionKey: result.childSessionKey,
runId: result.runId,
label: label || undefined,
agentId: requestedAgentId,
model: modelOverride,
thinking: thinkingOverrideRaw,
runTimeoutSeconds,
cleanup,
mode,
lightContext,
requesterSessionKey: opts?.agentSessionKey,
}, {
idempotencyKey: result.runId ? `spawn:${result.runId}` : undefined,
timeoutMs: runTimeoutSeconds ? runTimeoutSeconds * 1000 : undefined,
});
}
} catch {
// Best-effort: minion tracking failure doesn't fail the spawn.
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Duplicate subagent spawn when durability mode is "minions"

spawnSubagentDirect is called on line 313, and then a "waiting" job is immediately added to the queue (line 358). The MinionWorker will claim that "waiting" job and call spawnSubagentDirect again inside subagentSpawnHandler, spawning a second subagent for every call to sessions_spawn when durability is enabled.

The isRetry flag inside the handler (based on previousSessionKey being set) only changes a progress-phase label — it does not skip the spawn. Both the "fresh spawn" and "crash recovery" branches of the handler unconditionally call spawnSubagentDirect.

The PR description's "spawn-then-track" intent requires that the tracking row be inserted as status = 'active' (with a valid lock_token and lock_until) so the worker's normal claim loop cannot pick it up — only stall detection can re-queue it after a crash. Using minionQueue.add() (which always inserts status = 'waiting') makes every newly-spawned subagent immediately claimable and re-processed by the worker.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agents/tools/sessions-spawn-tool.ts
Line: 348-379

Comment:
**Duplicate subagent spawn when durability mode is `"minions"`**

`spawnSubagentDirect` is called on line 313, and then a `"waiting"` job is immediately added to the queue (line 358). The MinionWorker will claim that `"waiting"` job and call `spawnSubagentDirect` **again** inside `subagentSpawnHandler`, spawning a second subagent for every call to `sessions_spawn` when durability is enabled.

The `isRetry` flag inside the handler (based on `previousSessionKey` being set) only changes a progress-phase label — it does not skip the spawn. Both the "fresh spawn" and "crash recovery" branches of the handler unconditionally call `spawnSubagentDirect`.

The PR description's "spawn-then-track" intent requires that the tracking row be inserted as `status = 'active'` (with a valid `lock_token` and `lock_until`) so the worker's normal claim loop cannot pick it up — only stall detection can re-queue it after a crash. Using `minionQueue.add()` (which always inserts `status = 'waiting'`) makes every newly-spawned subagent immediately claimable and re-processed by the worker.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 64fbe36. The tracking row is now inserted as status='active' with a valid lock_token and lock_until via direct SQL, so the worker's claim loop (which only claims 'waiting') skips it. Only stall detection can re-queue the row after a crash.

- minionQueue.add("subagent.spawn", {...})  // 'waiting' = worker claims = double-spawn
+ store.db.prepare('INSERT ... status=active, lock_token=?, lock_until=?')  // not claimable

Comment thread src/minions/queue.ts
Comment on lines +418 to +426
removeJob(id: number): boolean {
const terminalList = MINION_TERMINAL_STATUSES.join("','");
const rows = this.store.db
.prepare(
`DELETE FROM minion_jobs WHERE id = ? AND status IN ('${terminalList}') RETURNING id`,
)
.all(id);
return rows.length > 0;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 String-interpolated SQL in removeJob

MINION_TERMINAL_STATUSES.join("','") is spliced directly into the SQL string. The values are hardcoded compile-time constants so there is no real injection risk today, but the pattern diverges from every other query in this file (which use ? placeholders). If the constant is ever extended or the join is copy-pasted elsewhere, the invariant breaks silently.

Prefer building placeholder lists the same way claim, cancelJob, and prune do:

Suggested change
removeJob(id: number): boolean {
const terminalList = MINION_TERMINAL_STATUSES.join("','");
const rows = this.store.db
.prepare(
`DELETE FROM minion_jobs WHERE id = ? AND status IN ('${terminalList}') RETURNING id`,
)
.all(id);
return rows.length > 0;
}
removeJob(id: number): boolean {
const placeholders = MINION_TERMINAL_STATUSES.map(() => "?").join(",");
const rows = this.store.db
.prepare(
`DELETE FROM minion_jobs WHERE id = ? AND status IN (${placeholders}) RETURNING id`,
)
.all(id, ...MINION_TERMINAL_STATUSES);
return rows.length > 0;
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/minions/queue.ts
Line: 418-426

Comment:
**String-interpolated SQL in `removeJob`**

`MINION_TERMINAL_STATUSES.join("','")` is spliced directly into the SQL string. The values are hardcoded compile-time constants so there is no real injection risk today, but the pattern diverges from every other query in this file (which use `?` placeholders). If the constant is ever extended or the join is copy-pasted elsewhere, the invariant breaks silently.

Prefer building placeholder lists the same way `claim`, `cancelJob`, and `prune` do:

```suggestion
  removeJob(id: number): boolean {
    const placeholders = MINION_TERMINAL_STATUSES.map(() => "?").join(",");
    const rows = this.store.db
      .prepare(
        `DELETE FROM minion_jobs WHERE id = ? AND status IN (${placeholders}) RETURNING id`,
      )
      .all(id, ...MINION_TERMINAL_STATUSES);
    return rows.length > 0;
  }
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 64fbe36. Switched to ? placeholders matching every other query in the file:

const placeholders = MINION_TERMINAL_STATUSES.map(() => "?").join(",");
const rows = this.store.db
  .prepare(`DELETE FROM minion_jobs WHERE id = ? AND status IN (${placeholders}) RETURNING id`)
  .all(id, ...MINION_TERMINAL_STATUSES);

Comment thread src/minions/worker.ts Outdated
Comment on lines +67 to +71
const shutdown = () => {
this.running = false;
};
process.on("SIGTERM", shutdown);
process.on("SIGINT", shutdown);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 SIGTERM/SIGINT listener accumulates on multiple start() calls

shutdown is registered on the global process each time start() is called, but only removed in the finally block. If start() throws before the while loop (e.g., the first promoteDelayed() throws), the finally block still runs and removes the listener — so that path is fine. The real concern is testing: if tests create multiple MinionWorker instances and call start(), Node.js emits MaxListenersExceededWarning. The maybeStartMinionWorker singleton guard prevents this in production, but unit tests that construct workers directly will hit it. Consider using a once flag or calling process.setMaxListeners in the worker constructor to surface the issue clearly.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/minions/worker.ts
Line: 67-71

Comment:
**SIGTERM/SIGINT listener accumulates on multiple `start()` calls**

`shutdown` is registered on the global `process` each time `start()` is called, but only removed in the `finally` block. If `start()` throws before the `while` loop (e.g., the first `promoteDelayed()` throws), the `finally` block still runs and removes the listener — so that path is fine. The real concern is testing: if tests create multiple `MinionWorker` instances and call `start()`, Node.js emits `MaxListenersExceededWarning`. The `maybeStartMinionWorker` singleton guard prevents this in production, but unit tests that construct workers directly will hit it. Consider using a `once` flag or calling `process.setMaxListeners` in the worker constructor to surface the issue clearly.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 64fbe36. Two changes:

  1. process.onprocess.once so listeners don't accumulate across test instances.
  2. Added re-entry guard at the top of start(): if (this.running) return; prevents double-start entirely.

Comment thread src/minions/queue.ts
Comment on lines +806 to +816
function resolveParentIfDone(db: DatabaseSync, parentId: number, now: number): void {
db.prepare(
`UPDATE minion_jobs SET status = 'waiting', updated_at = ?
WHERE id = ? AND status = 'waiting-children'
AND NOT EXISTS (
SELECT 1 FROM minion_jobs
WHERE parent_job_id = ?
AND status NOT IN ('completed', 'dead', 'cancelled')
)`,
).run(now, parentId, parentId);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 resolveParentIfDone excludes 'failed' from terminal child statuses

The NOT EXISTS clause checks status NOT IN ('completed', 'dead', 'cancelled'). The 'failed' status is in MINION_TERMINAL_STATUSES and is produced by applyChildFailPolicy's direct SQL update when on_child_fail = 'fail_parent'. A parent whose child just failed via that policy will already be set to 'failed' itself, so the resolution is moot in that branch — but for on_child_fail = 'ignore' or 'continue', a 'failed' child leaves the parent stuck in 'waiting-children' forever because 'failed' is not treated as terminal here. Including 'failed' in the exclusion list (alongside 'dead' and 'cancelled') aligns this with the MINION_TERMINAL_STATUSES definition.

The same gap exists in the public resolveParent method (line 722).

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/minions/queue.ts
Line: 806-816

Comment:
**`resolveParentIfDone` excludes `'failed'` from terminal child statuses**

The NOT EXISTS clause checks `status NOT IN ('completed', 'dead', 'cancelled')`. The `'failed'` status is in `MINION_TERMINAL_STATUSES` and is produced by `applyChildFailPolicy`'s direct SQL update when `on_child_fail = 'fail_parent'`. A parent whose child just failed via that policy will already be set to `'failed'` itself, so the resolution is moot in that branch — but for `on_child_fail = 'ignore'` or `'continue'`, a `'failed'` child leaves the parent stuck in `'waiting-children'` forever because `'failed'` is not treated as terminal here. Including `'failed'` in the exclusion list (alongside `'dead'` and `'cancelled'`) aligns this with the `MINION_TERMINAL_STATUSES` definition.

The same gap exists in the public `resolveParent` method (line 722).

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 64fbe36. Added 'failed' to the NOT IN clause in all three locations: resolveParentIfDone, the inline resolution in completeJob, and the public resolveParent method. A parent with on_child_fail='ignore' or 'continue' no longer gets stuck in 'waiting-children' when a child fails.

- AND status NOT IN ('completed', 'dead', 'cancelled')
+ AND status NOT IN ('completed', 'failed', 'dead', 'cancelled')

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5088268ba6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +17 to +20
export function maybeStartMinionWorker(): void {
if (activeWorker) {
return;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Call minion worker startup during gateway initialization

This adds maybeStartMinionWorker() but nothing in the repository invokes it, so the worker loop never starts at runtime. In that state, enqueued minion jobs are never claimed and the new durability behaviors (stall/timeout recovery, retries, handler execution) do not run, which means crash-recovery is effectively disabled despite enabling minion writes.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 604c942. Wired maybeStartMinionWorker() into src/gateway/server-startup-early.ts via dynamic import, right after startTaskRegistryMaintenance():

import("../minions/gateway-startup.js")
  .then((m) => m.maybeStartMinionWorker())
  .catch(() => {});

Dynamic import keeps the gateway startup path lazy (minions module only loads if the import succeeds). The worker now actually starts alongside the gateway.

Comment thread src/tasks/task-minion-sync.ts Outdated
Comment on lines +41 to +43
const existingId = taskToMinionId.get(task.taskId);

if (existingId != null) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Persist task-to-minion identity instead of process-local map

The upsert path depends on the in-memory taskToMinionId map to decide whether to UPDATE or INSERT. After any process restart this map is empty, so existing tasks are inserted again instead of updated, creating duplicate shadow rows and leaving older rows stale (and undeletable via syncTaskDeleteToMinions, which also relies on that map). This will skew minion status/stats over time.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 604c942. Replaced the process-local Map<string, number> with DB-level dedup via idempotency_key = 'task:{taskId}'. On upsert, the sync module queries SELECT id FROM minion_jobs WHERE idempotency_key = ? before deciding INSERT vs UPDATE. Survives restarts, no duplicate rows.

Also switched syncTaskDeleteToMinions to delete by idempotency_key instead of the old map lookup.

@garrytan
Copy link
Copy Markdown
Contributor Author

Fixing Greptile issues!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 604c942ab8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/tasks/task-minion-sync.ts Outdated
}
storeInitAttempted = true;
try {
const { MinionStore: Store } = require("../minions/store.js") as typeof import("../minions/store.js");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Replace bare require in ESM minion sync loader

This module is executed in an ESM package (type: module), so require(...) throws ReferenceError; the error is caught, storeInitAttempted is set, and all subsequent syncTaskToMinions/syncTaskDeleteToMinions calls are permanently no-ops for that process. In practice, task-registry shadow writes to minion_jobs never happen in production after the first call path.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e318c34. Replaced require() with await import() and made getStore(), syncTaskToMinions(), and syncTaskDeleteToMinions() async. Shadow writes now actually work in the ESM runtime.

lightContext,
expectsCompletionMessage: true,
},
{},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Forward requester context when respawning subagent jobs

Crash-recovery respawns call spawnSubagentDirect with an empty context object, so requester identity metadata (session/channel/account/thread) is dropped on retry. Any recovered job can be relaunched under default/main-session context instead of the original caller, which changes ownership checks and routing behavior for resumed subagents.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e318c34. The handler now destructures requesterSessionKey, requesterChannel, requesterAccountId, requesterTo, requesterThreadId from job.data and forwards them to spawnSubagentDirect's second argument. Crash-recovered subagents retain their original requester identity.

Comment thread src/agents/tools/sessions-spawn-tool.ts Outdated
// status='active' with a lock so the worker's claim loop (which only
// claims 'waiting') won't touch it. Only stall detection can re-queue
// the row if the gateway crashes and the lock expires.
if (result.status !== "error") {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Track only accepted spawn results in minion_jobs

The tracking insert runs for every non-error result, including status: "forbidden". Forbidden responses typically have no runnable run/session identifiers, yet this still writes an active minion row, which later appears as stalled/dead noise and can trigger incorrect recovery behavior if stall policy is relaxed. Restricting tracking to accepted spawns avoids poisoning queue state.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e318c34. Changed result.status !== "error" to result.status === "accepted". Forbidden/rejected spawns no longer write tracking rows.

Comment thread src/minions/migration.ts
Comment on lines +113 to +117
.run(
`legacy.${row.runtime}`,
"default",
minionStatus,
JSON.stringify({
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Import legacy tasks into claimable handler-compatible states

Migration writes live legacy tasks as attached with legacy.<runtime> names, but worker claims only waiting jobs whose names match registered handlers (subagent.spawn, acp.spawn, etc.). In this commit there is no runtime transition path that makes these imported rows claimable, so archived in-flight legacy tasks are stranded instead of recoverable.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e318c34. Two changes:

  1. Migration now maps runtime → registered handler name (subagent→subagent.spawn, acp→acp.spawn, cli→cli.spawn, cron→cron.tick) so imported rows match what the worker claims.
  2. Status changed from attached to waiting so imported rows are immediately claimable by the worker for recovery.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e318c340bf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/minions/queue.ts
const maxStalled =
typeof row.max_stalled === "bigint" ? Number(row.max_stalled) : row.max_stalled;

if (counter + 1 < maxStalled) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Requeue first stalled attempt before dead-lettering jobs

handleStalled uses counter + 1 < maxStalled, which means jobs with the default max_stalled = 1 are marked dead on their very first expired lock instead of being re-queued once. This breaks the advertised crash-recovery path for default-tracked subagent rows (for example those inserted from sessions-spawn-tool), because a single gateway crash never gets a rescue attempt.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 54dbce3. Changed max_stalled default from 1 to 2 in schema DDL. With the old default, 0+1 < 1 was false, so the first stall went straight to dead-letter with no re-queue attempt. Now: first stall re-queues (0+1 < 2 = true), second stall dead-letters (1+1 < 2 = false). Crash recovery works on first attempt.

Comment thread src/minions/queue.ts Outdated
Comment on lines +589 to +593
`UPDATE minion_jobs SET
status = 'dead',
error_text = 'timeout exceeded',
lock_token = NULL, lock_until = NULL,
finished_at = ?, updated_at = ?
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Propagate parent policy when timeout dead-letters a child

handleTimeouts directly updates matching rows to dead and returns them, but unlike failJob it never applies on_child_fail handling or parent-resolution logic. When a child job with parent_job_id times out, the parent can remain stuck in waiting-children indefinitely (or miss fail_parent behavior), which blocks fan-out workflows from completing correctly.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 54dbce3. handleTimeouts now runs inside a transaction and calls applyChildFailPolicy() for each timed-out job that has a parentJobId. Parents with on_child_fail='fail_parent' get marked failed; parents with 'ignore'/'continue' get resolution checks. Fan-out workflows no longer get stuck.

Comment thread src/tasks/task-registry.maintenance.ts Outdated
@@ -305,7 +320,7 @@ export async function runTaskRegistryMaintenance(): Promise<TaskRegistryMaintena
if (!current) {
continue;
}
if (shouldMarkLost(current, now)) {
if (!minionsOwnLostDetection && shouldMarkLost(current, now)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep task lost-reconciliation enabled for non-minion runs

This guard disables shouldMarkLost entirely once minionsOwnLostDetection is enabled at gateway startup, so the task-registry sweep no longer marks any active tasks as lost. However, shadow-synced registry rows are written as plain active records without lock heartbeats (task-minion-sync), so minion stall/timeout sweeps cannot replace this for ACP/CLI/cron tasks; those tasks can stay running forever after their backing session disappears.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 54dbce3. Removed the minionsOwnLostDetection flag entirely. The existing task-registry sweep now runs as a safety net alongside minions (belt + suspenders). Minion stall detection is faster (sub-second heartbeat vs 5-min grace), but the old sweep catches ACP/CLI/cron tasks that don't have minion lock heartbeats yet.

@mbelinky
Copy link
Copy Markdown
Contributor

Hi @garrytan thanks for this!
Have you looked at the Tasks and TaskFlow implementation already in openclaw? By the description in your PR there is significant functional overlap.
I will look at it and diff shortly, then happy to discuss

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 54dbce3b5f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/minions/queue.ts
Comment on lines +566 to +570
`UPDATE minion_jobs SET
status = 'dead', stalled_counter = stalled_counter + 1,
error_text = 'max stalled count exceeded',
lock_token = NULL, lock_until = NULL,
finished_at = ?, updated_at = ?
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Run parent-failure handling for stalled jobs marked dead

When handleStalled() dead-letters an expired active job, it only updates the child row and returns it; it never invokes applyChildFailPolicy/parent resolution for rows with parent_job_id. In a fan-out tree, a stalled child that hits max stalled count will become dead while the parent remains waiting-children, so fail_parent/remove_dep behavior is skipped and the parent can remain blocked indefinitely after crash recovery.

Useful? React with 👍 / 👎.

Comment thread src/minions/queue.ts
Comment on lines +385 to +389
`UPDATE minion_jobs SET
status = 'cancelled', lock_token = NULL, lock_until = NULL,
finished_at = ?, updated_at = ?
WHERE id IN (${placeholders})
AND status IN ('waiting','active','delayed','waiting-children','paused','attached')`,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Resolve parent state after cancelling child subtrees

cancelJob() marks descendants as cancelled but does not trigger parent resolution afterward. If an operator cancels a non-root child that is the last unfinished dependency, the parent stays in waiting-children forever because no later child completion will call resolveParentIfDone, effectively wedging that workflow.

Useful? React with 👍 / 👎.

@mbelinky
Copy link
Copy Markdown
Contributor

Thanks for the work here. I agree with the problem statement: detached work in OpenClaw needs stronger execution semantics than “track it and later mark it lost”. Crash recovery, retry, stall detection, and cascade cancel are all real needs, and this PR clearly puts serious thought into that layer.

That said, I do not think merging this into core in its current shape would be the right next step.

The main concern here is architectural, rather than whether the queue itself works. OpenClaw already has core product-level abstractions for detached work:

  • Tasks = durable run ledger / operator surface
  • TaskFlow = durable orchestration state above Tasks

This PR introduces a second orchestration/execution plane in core (src/minions/**) and then partially mirrors/shadow-syncs Tasks into it (src/tasks/task-registry.ts, src/tasks/task-minion-sync.ts), while also wiring directly into core startup and spawn paths (src/gateway/server-startup-early.ts, src/agents/tools/sessions-spawn-tool.ts). From a long-term maintenance standpoint, that leaves us with two overlapping durability models in core, which feels like the wrong direction.

A more robust direction would be to:

  1. Keep Tasks / TaskFlow as the canonical core abstractions.
  2. Extract the minimal core seams needed for detached execution management.
  3. Move the queue/worker substrate behind those seams as an optional implementation, ideally an extension/plugin rather than a new core subsystem.

Concretely, it seems like the next step should be a seam-extraction PR, rather than landing minions as-is. That seam PR would focus on:

  • defining the minimal detached executor contract core actually needs
  • making built-in detached runtimes use that contract instead of hardcoding their own execution lifecycle
  • keeping ownership / visibility / cancellation semantics centered on Tasks and TaskFlow
  • avoiding shadow-write duplication between TaskRegistry and a second DB

The kinds of capabilities that may belong in core are things like:

  • start / progress / complete / fail hooks for detached work
  • cancel-tree semantics
  • “mark lost” vs “can resume / reattach” semantics
  • a clean bridge from executor state into Tasks / TaskFlow

In recent internal discussions we have agreed to keep the core as lean as possible, and as such there are things that currently would sit better outside of it, such as:

  • a BullMQ-style queue model as the default architecture
  • a second operator-facing control plane (openclaw minions, minion-specific gateway protocol, minion-specific job model)
  • direct writes from core runtimes into an alternate substrate DB beside Tasks

So this is not a rejection of the problem being solved. It is more a request to re-scope the implementation so the solution fits the architecture we want to preserve.

What would be most helpful next:

  • PR 1: extract detached-work seams in core, centered on Tasks / TaskFlow
  • PR 2: adapt this queue/worker implementation to that seam as an optional extension/plugin
  • PR 3, if needed: selectively opt built-in runtimes into that executor once the seam is proven

If there is a subset of this that truly needs to remain in core, it would help to isolate the smallest necessary subset and make the case for that subset specifically. But I do not think the full minions subsystem should land in core as a parallel durability/orchestration layer.

garrytan added 10 commits April 19, 2026 16:28
BullMQ-inspired queue with 9-state lifecycle, parent-child DAGs, atomic
cascade-cancel via recursive CTE, lock-heartbeat stall detection, idempotency
via partial unique index, token accounting with parent roll-up, inbox messaging.
Backed by node:sqlite (zero new runtime deps). WAL + BEGIN IMMEDIATE for
writer serialization.

Grounded in Wintermute production evidence: 46,713 pages, 98 zombie processes
over 18 days, ~40% subagent timeout rate during peak load.
Four runtime handler stubs (subagent/ACP/CLI/cron) with contract validation.
Real subagent handler calls spawnSubagentDirect with crash recovery.
Gateway startup wiring (maybeStartMinionWorker) with durability config flag.
openclaw minions CLI (list/get/cancel/stats/retry/prune/smoke).
Gateway protocol schemas (submit/list/cancel/stats/inbox).
Plugin SDK seam at openclaw/plugin-sdk/minions-runtime.
TaskRegistry shadow-write: every create/update/delete mirrors to minion_jobs.
Status-map module (bidirectional TaskStatus <-> MinionJobStatus).
Lost-detection gated by minionsOwnLostDetection flag.
Doctor migration contribution for legacy tasks/*.sqlite -> minion_jobs.
sessions-spawn-tool.ts: tracks every subagent spawn in minion_jobs when
durability mode is enabled, with runId as idempotency key.
…on data

Unit: schema, queue CRUD, parent-child, idempotency, cascade-kill, stall,
timeout, worker lifecycle, CAS guard, fan-out, handler registry, status-map.
E2E: concurrent claim, timeout dead-letter, crash rescue, deep fan-in,
cascade kill, durability, throughput, memory, cron pile-up, idempotent
respawn, autopilot multi-task. All benchmarks annotated with real data from
a 46,713-page Wintermute deployment.
User-facing: docs/concepts/minions.md, docs/cli/minions.md, CHANGELOG entry.
Unrelated bugfix: install security scan now allows pnpm workspace symlinks
(node_modules/@openclaw/* -> sibling extension dirs) by verifying the target
is a recognized workspace member path.
1. Fix double-spawn bug: insert tracking row as status='active' with lock
   (not 'waiting') so worker claim loop skips it. Only stall detection
   re-queues after crash.
2. Fix removeJob SQL: use ? placeholders instead of string interpolation.
3. Fix worker listener leak: use process.once + re-entry guard.
4. Fix resolveParentIfDone: include 'failed' in terminal child statuses
   so on_child_fail='ignore'/'continue' doesn't leave parent stuck.
…x shadow-sync dedup)

1. Wire maybeStartMinionWorker() into gateway early startup so the worker
   actually runs (Codex P1: worker defined but never called).
2. Fix shadow-sync dedup: use idempotency_key='task:{taskId}' instead of
   process-local Map so restarts don't create duplicate rows (Codex P2).
…r context, accepted-only tracking, migration handler names)
…switch defaults, doc format, floating promises)
…parent policy, remove lost-detection flag)

1. Change max_stalled default from 1 to 2 so first stall re-queues instead
   of dead-lettering (crash recovery actually works on first attempt).
2. handleTimeouts now applies on_child_fail parent policy after dead-lettering,
   preventing parents from getting stuck in waiting-children.
3. Remove minionsOwnLostDetection flag — keep the existing task-registry sweep
   running as a safety net alongside minions. Belt + suspenders.
@garrytan
Copy link
Copy Markdown
Contributor Author

Thanks I've never done this before. I will try to do that!

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b45d69976d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +363 to +366
`INSERT INTO minion_jobs (
name, queue, status, data, lock_token, lock_until,
idempotency_key, timeout_ms, created_at, updated_at, started_at, attempts_started
) VALUES (?, ?, 'active', ?, ?, ?, ?, ?, ?, ?, ?, 1)`,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Stop tracking accepted spawns as expiring active jobs

In minions durability mode, this insert creates an active job with a 30s lock, but there is no corresponding heartbeat or completion path for the original live subagent run. Once lock_until expires, handleStalled() will treat it as orphaned and requeue it, so normal long-running (or even already-finished) spawns get retried as crash recovery and a duplicate subagent.spawn is launched without any gateway crash. This tracking row needs either lock renewal/terminal updates from the live run lifecycle or a non-claimable status until true orphan reconciliation.

Useful? React with 👍 / 👎.

@mbelinky
Copy link
Copy Markdown
Contributor

Thanks I've never done this before. I will try to do that!

DM me on X or discord - happy to talk you through it. I can probably do the seam bit as it's a good addition to core anyway.

Copy link
Copy Markdown
Contributor

We landed the first extraction step in #68886 (merge commit 0787266637fdee32afaf1ccaf1da383f68ca54c2).

That PR keeps the current Tasks behavior intact, but adds a narrow detached-task lifecycle seam in core and tests that callers route through it. That is the direction we want to build on.

Given that, I don’t think it makes sense to take this PR as-is. It still introduces a second core execution substrate in parallel with Tasks/TaskFlow, which is the main architectural concern we discussed.

The more robust path from here is:

  • keep Tasks/TaskFlow as the canonical core state model
  • build detached executor ownership on top of the new seam
  • move durability / recovery / queue-worker behavior into a follow-up implementation that can sit behind that seam rather than beside it

I’m going to close this PR in favor of that direction. If you want to keep pursuing the durability/recovery work, the next step should be a smaller follow-up rebased on current main and the newly landed seam, not a second competing substrate in core.

If there is a narrow slice in here that you think still needs to be salvaged immediately, feel free to open a follow-up against main and link back here.

@mbelinky mbelinky closed this Apr 19, 2026
@mbelinky mbelinky reopened this Apr 19, 2026
Comment thread src/minions/queue.ts
Comment on lines +827 to +848
function applyChildFailPolicy(
db: DatabaseSync,
child: MinionJob,
errorText: string,
now: number,
): void {
const parentId = child.parentJobId!;
const policy: ChildFailPolicy = child.onChildFail;

if (policy === "fail_parent") {
db.prepare(
`UPDATE minion_jobs SET status = 'failed',
error_text = ?, finished_at = ?, updated_at = ?
WHERE id = ? AND status = 'waiting-children'`,
).run(`child job ${child.id} failed: ${errorText}`, now, now, parentId);
} else if (policy === "remove_dep") {
db.prepare(
"UPDATE minion_jobs SET parent_job_id = NULL, updated_at = ? WHERE id = ?",
).run(now, child.id);
resolveParentIfDone(db, parentId, now);
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 ignore/continue child-fail policies never resolve the parent

applyChildFailPolicy calls resolveParentIfDone only for the remove_dep branch. For ignore and continue policies the function returns without doing anything, so the parent stays in 'waiting-children' permanently if all of its children end in 'failed' or 'dead'.

The previous thread's fix (adding 'failed' to the NOT IN clause of resolveParentIfDone) is necessary but not sufficient: it only matters when resolveParentIfDone is actually called. When the last child fails with ignore or continue policy, resolveParentIfDone is never reached, leaving the parent stuck forever.

// fix: add the missing branch
  } else if (policy === "ignore" || policy === "continue") {
    resolveParentIfDone(db, parentId, now);
  }
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/minions/queue.ts
Line: 827-848

Comment:
**`ignore`/`continue` child-fail policies never resolve the parent**

`applyChildFailPolicy` calls `resolveParentIfDone` only for the `remove_dep` branch. For `ignore` and `continue` policies the function returns without doing anything, so the parent stays in `'waiting-children'` permanently if all of its children end in `'failed'` or `'dead'`.

The previous thread's fix (adding `'failed'` to the `NOT IN` clause of `resolveParentIfDone`) is necessary but not sufficient: it only matters when `resolveParentIfDone` is actually called. When the last child fails with `ignore` or `continue` policy, `resolveParentIfDone` is never reached, leaving the parent stuck forever.

```ts
// fix: add the missing branch
  } else if (policy === "ignore" || policy === "continue") {
    resolveParentIfDone(db, parentId, now);
  }
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Contributor

The core contract work this needed is now landed in #68915 (bd3ad3436efde3ed2834c4d20ed80765f4b2cd9e).

What landed:

  • the detached runtime registration contract in core
  • cancel routed through the same detached-runtime seam
  • loader ownership/rollback handling around that registration

What did not land:

  • no durable queue
  • no retry/recovery/stall engine
  • no minions executor implementation in core

So this PR is not closed as superseded, but it should now be reworked on top of the landed core seam instead of carrying its own overlapping core substrate. The expected follow-up shape is: keep Tasks / TaskFlow as the canonical ledger in core, and move the durable executor mechanics outside core against the new contract.

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 30, 2026

Codex review: found issues before merge.

Summary
The PR adds a SQLite-backed src/minions job queue with gateway startup, CLI/protocol/docs, TaskRegistry shadow-sync, subagent spawn tracking, and a plugin SDK runtime export.

Reproducibility: Do we have a high-confidence way to reproduce the issue? Yes for the patch-level defects: the diff shows deterministic state-machine paths for expiring tracked spawns and stuck parent jobs. The broader durable-executor request is a design change rather than a single current-main bug reproduction.

Next step before merge
The remaining action is architectural/product/security review and likely a contributor rework or replacement plan, not a narrow automated repair on the current branch.

Security
Needs attention: The PR broadens plugin install symlink trust in an unrelated scanner change, which is a concrete supply-chain review concern.

Review findings

  • [P1] Keep tracked spawns non-expiring until lifecycle updates them — src/agents/tools/sessions-spawn-tool.ts:357
  • [P1] Resolve parents for ignored child failures — src/minions/queue.ts:827-848
  • [P2] Tighten workspace symlink trust in the scanner — src/plugins/install-security-scan.runtime.ts:160-193
Review details

Best possible solution:

Keep Tasks and Task Flow as the canonical ledger/orchestration layer, keep the landed detached-runtime registration seam, and rework durable queue mechanics as a smaller executor behind that contract while splitting scanner changes into their own PR.

Do we have a high-confidence way to reproduce the issue?

Do we have a high-confidence way to reproduce the issue? Yes for the patch-level defects: the diff shows deterministic state-machine paths for expiring tracked spawns and stuck parent jobs. The broader durable-executor request is a design change rather than a single current-main bug reproduction.

Is this the best way to solve the issue?

Is this the best way to solve the issue? No. The branch introduces a second core substrate beside Tasks/Task Flow; the narrower maintainable path is an executor implementation behind the existing detached-runtime seam.

Full review comments:

  • [P1] Keep tracked spawns non-expiring until lifecycle updates them — src/agents/tools/sessions-spawn-tool.ts:357
    The tracking row is inserted as an active job with a 30s lock, but the original live subagent run does not renew that lock or complete the minion row. Once lock_until expires, normal long-running or already-finished spawns can be requeued as stalled and launched again without a gateway crash.
    Confidence: 0.87
  • [P1] Resolve parents for ignored child failures — src/minions/queue.ts:827-848
    applyChildFailPolicy only resolves parents for remove_dep. For ignore or continue, a parent whose remaining children all fail or die can stay in waiting-children forever because no later completion calls resolveParentIfDone.
    Confidence: 0.9
  • [P2] Tighten workspace symlink trust in the scanner — src/plugins/install-security-scan.runtime.ts:160-193
    The install-security scan change allows outside-install-root symlink targets when they sit under sibling workspace paths. That weakens the blocked-dependency scan for plugin installs and should be split or narrowed to exact trusted OpenClaw peer links with focused tests.
    Confidence: 0.78

Overall correctness: patch is incorrect
Overall confidence: 0.89

Security concerns:

  • [medium] Broad node_modules symlink trust — src/plugins/install-security-scan.runtime.ts:160
    The branch allows node_modules symlinks to resolve outside the install root when the target is a sibling workspace member. That can exempt outside-root code from the existing dependency-scan failure path and is unrelated to the minions feature.
    Confidence: 0.78

What I checked:

Likely related people:

  • mbelinky: Authored the landed detached lifecycle seam and plugin registration contract used as the expected direction for this work, and provided the architectural review guidance in this PR discussion. (role: recent maintainer and seam owner; confidence: high; commits: 0787266637fd, bd3ad3436efd; files: src/tasks/detached-task-runtime.ts, src/tasks/detached-task-runtime-contract.ts, src/plugins/types.ts)
  • vincentkoc: The PR discussion explicitly routed this to @vincentkoc as someone who worked on Tasks and Task Flow; local sampled history did not expose a stronger commit trail in this checkout. (role: adjacent Tasks/Task Flow reviewer; confidence: low; files: src/tasks/task-registry.ts, src/tasks/task-flow-registry.ts)
  • Peter Steinberger: Recent main history/blame shows broad maintenance/refactor touches across the current task/runtime files, making him a plausible follow-up router though not the feature introducer. (role: recent maintainer; confidence: low; commits: 07f523be4a2f; files: src/tasks/detached-task-runtime.ts, src/plugins/types.ts)

Remaining risk / open question:

  • The durable executor idea may still be valuable, but the accepted core/plugin split needs maintainer architecture judgment before more code is added.
  • The plugin install security-scan change is unrelated to durable task execution and should not ride along with this PR.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 89f73a5ef2b0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling app: web-ui App: web-ui cli CLI command changes docs Improvements or additions to documentation gateway Gateway runtime scripts Repository scripts size: XL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants