Engine: Postgres (self-hosted Supabase)
Version: 0.42.38.0
Summary
gbrain autopilot spawns its drain-worker as a bare gbrain jobs work (no --concurrency), which defaults to concurrency 1 (src/commands/jobs.ts:82 → ?? '1'; src/core/minions/worker.ts:200 → ?? 1). The patterns and synthesize cycle phases submit a child subagent job to the same queue and then block on it via waitForCompletion (src/core/cycle/patterns.ts:88,94 — 35-min timeout; src/core/cycle/synthesize.ts:484).
At concurrency 1 the parent autopilot-cycle handler occupies the only in-flight slot — worker.ts:514 claims a new job only while inFlight.size < concurrency, and the parent stays in inFlight for the whole await waitForCompletion(...). So the child subagent can never be claimed by the same (only) worker → self-deadlock. The cycle then trips the 10-min per-job timeout; because the timed-out handler isn't actually cancelled (cf. #212), it becomes a zombie that keeps running. Repeated every interval, zombies accumulate and the queue wedges (hundreds of waiting jobs, 0 completions, dead-holder lock churn).
cycle.ts actually documents the assumption ("synthesize/patterns renew the cycle-lock TTL during long waits") — i.e. the design expects a separate worker slot to process the subagent while the parent waits. Autopilot's bare concurrency-1 worker doesn't provide one.
Why this is under-reported
- The documented worker is
gbrain jobs supervisor --concurrency 4 (docs/guides/minions-deployment.md), which has spare slots → never deadlocks. Autopilot's built-in concurrency-1 worker is the outlier.
patterns only fires with ≥ dream.patterns.min_evidence (default 3) reflection pages; synthesize only with new transcripts. Many brains skip both, so the subagent is never submitted and the deadlock never triggers.
- The population that hits it = {runs the
autopilot daemon, not the supervisor} ∩ {has reflections / pending transcripts}.
Reproduction
- Postgres engine,
gbrain autopilot running (default path → bare concurrency-1 worker).
- Ensure a subagent-spawning phase actually runs (≥3 reflection pages for
patterns, or pending transcripts for synthesize).
- A cycle reaches
[cycle.patterns] start and never logs done; the submitted subagent job stays waiting; the cycle force-evicts at the 10-min timeout; zombie handlers accumulate; the queue wedges.
Isolation that confirms the cause:
gbrain dream (runs the cycle inline, no separate worker) hangs at [cycle.patterns] until killed — the queued subagent has nothing to run it.
- Running
gbrain jobs work --concurrency 4 alongside, the same cycle reaches [cycle.patterns] done and the full cycle completes.
Workaround
export GBRAIN_WORKER_CONCURRENCY=4 in the environment that autopilot-run.sh sources — the autopilot-spawned worker inherits it (autopilot.ts:398 passes ...process.env). Verified: a patterns cycle completes instead of hanging. (Or disable the phases: gbrain config set dream.patterns.enabled false, likewise dream.synthesize.enabled false.)
Suggested fix (any one)
Independently: the per-job timeout should actually cancel the handler (#212) so a deadlocked cycle dies cleanly rather than becoming a zombie.
Related
Engine: Postgres (self-hosted Supabase)
Version: 0.42.38.0
Summary
gbrain autopilotspawns its drain-worker as a baregbrain jobs work(no--concurrency), which defaults to concurrency 1 (src/commands/jobs.ts:82→?? '1';src/core/minions/worker.ts:200→?? 1). Thepatternsandsynthesizecycle phases submit a childsubagentjob to the same queue and then block on it viawaitForCompletion(src/core/cycle/patterns.ts:88,94— 35-min timeout;src/core/cycle/synthesize.ts:484).At concurrency 1 the parent
autopilot-cyclehandler occupies the only in-flight slot —worker.ts:514claims a new job only whileinFlight.size < concurrency, and the parent stays ininFlightfor the wholeawait waitForCompletion(...). So the child subagent can never be claimed by the same (only) worker → self-deadlock. The cycle then trips the 10-min per-job timeout; because the timed-out handler isn't actually cancelled (cf. #212), it becomes a zombie that keeps running. Repeated every interval, zombies accumulate and the queue wedges (hundreds ofwaitingjobs, 0 completions, dead-holder lock churn).cycle.tsactually documents the assumption ("synthesize/patterns renew the cycle-lock TTL during long waits") — i.e. the design expects a separate worker slot to process the subagent while the parent waits. Autopilot's bare concurrency-1 worker doesn't provide one.Why this is under-reported
gbrain jobs supervisor --concurrency 4(docs/guides/minions-deployment.md), which has spare slots → never deadlocks. Autopilot's built-in concurrency-1 worker is the outlier.patternsonly fires with ≥dream.patterns.min_evidence(default 3) reflection pages;synthesizeonly with new transcripts. Many brains skip both, so the subagent is never submitted and the deadlock never triggers.autopilotdaemon, not the supervisor} ∩ {has reflections / pending transcripts}.Reproduction
gbrain autopilotrunning (default path → bare concurrency-1 worker).patterns, or pending transcripts forsynthesize).[cycle.patterns] startand never logsdone; the submittedsubagentjob stayswaiting; the cycle force-evicts at the 10-min timeout; zombie handlers accumulate; the queue wedges.Isolation that confirms the cause:
gbrain dream(runs the cycle inline, no separate worker) hangs at[cycle.patterns]until killed — the queued subagent has nothing to run it.gbrain jobs work --concurrency 4alongside, the same cycle reaches[cycle.patterns] doneand the full cycle completes.Workaround
export GBRAIN_WORKER_CONCURRENCY=4in the environment thatautopilot-run.shsources — the autopilot-spawned worker inherits it (autopilot.ts:398passes...process.env). Verified: a patterns cycle completes instead of hanging. (Or disable the phases:gbrain config set dream.patterns.enabled false, likewisedream.synthesize.enabled false.)Suggested fix (any one)
gbrain autopilotlaunches its worker at--concurrency ≥2by default — it submits subagent-spawning cycles, so a concurrency-1 worker can never satisfy them.patterns/synthesizesubagent inline within the phase (this also fixes the PGLite case in PGLite engine:gbrain dream --phase synthesizehangs indefinitely (no worker daemon to process queued subagent jobs) #1306, where a 2nd worker process can't exist at all).waitForCompletion).Independently: the per-job timeout should actually cancel the handler (#212) so a deadlocked cycle dies cleanly rather than becoming a zombie.
Related
gbrain dream --phase synthesizehangs indefinitely (no worker daemon to process queued subagent jobs) #1306 — PGLite:dream --phase synthesizehangs indefinitely (no worker daemon to process the queued subagent). Same shape; file-lock variant.jobs supervisorproduction deployment shape.AI_MissingToolResultsErroron gateway-loop replay. Surfaces after the deadlock is removed: with a free slot the subagent runs but currently fails this way, so patterns/synthesize complete with 0 pages until that lands.