Skip to content

autopilot: drain-worker runs at concurrency=1, self-deadlocking any cycle phase that spawns a subagent (patterns, synthesize) #2050

@devty

Description

@devty

Engine: Postgres (self-hosted Supabase)
Version: 0.42.38.0

Summary

gbrain autopilot spawns its drain-worker as a bare gbrain jobs work (no --concurrency), which defaults to concurrency 1 (src/commands/jobs.ts:82?? '1'; src/core/minions/worker.ts:200?? 1). The patterns and synthesize cycle phases submit a child subagent job to the same queue and then block on it via waitForCompletion (src/core/cycle/patterns.ts:88,94 — 35-min timeout; src/core/cycle/synthesize.ts:484).

At concurrency 1 the parent autopilot-cycle handler occupies the only in-flight slot — worker.ts:514 claims a new job only while inFlight.size < concurrency, and the parent stays in inFlight for the whole await waitForCompletion(...). So the child subagent can never be claimed by the same (only) worker → self-deadlock. The cycle then trips the 10-min per-job timeout; because the timed-out handler isn't actually cancelled (cf. #212), it becomes a zombie that keeps running. Repeated every interval, zombies accumulate and the queue wedges (hundreds of waiting jobs, 0 completions, dead-holder lock churn).

cycle.ts actually documents the assumption ("synthesize/patterns renew the cycle-lock TTL during long waits") — i.e. the design expects a separate worker slot to process the subagent while the parent waits. Autopilot's bare concurrency-1 worker doesn't provide one.

Why this is under-reported

  • The documented worker is gbrain jobs supervisor --concurrency 4 (docs/guides/minions-deployment.md), which has spare slots → never deadlocks. Autopilot's built-in concurrency-1 worker is the outlier.
  • patterns only fires with ≥ dream.patterns.min_evidence (default 3) reflection pages; synthesize only with new transcripts. Many brains skip both, so the subagent is never submitted and the deadlock never triggers.
  • The population that hits it = {runs the autopilot daemon, not the supervisor} ∩ {has reflections / pending transcripts}.

Reproduction

  1. Postgres engine, gbrain autopilot running (default path → bare concurrency-1 worker).
  2. Ensure a subagent-spawning phase actually runs (≥3 reflection pages for patterns, or pending transcripts for synthesize).
  3. A cycle reaches [cycle.patterns] start and never logs done; the submitted subagent job stays waiting; the cycle force-evicts at the 10-min timeout; zombie handlers accumulate; the queue wedges.

Isolation that confirms the cause:

  • gbrain dream (runs the cycle inline, no separate worker) hangs at [cycle.patterns] until killed — the queued subagent has nothing to run it.
  • Running gbrain jobs work --concurrency 4 alongside, the same cycle reaches [cycle.patterns] done and the full cycle completes.

Workaround

export GBRAIN_WORKER_CONCURRENCY=4 in the environment that autopilot-run.sh sources — the autopilot-spawned worker inherits it (autopilot.ts:398 passes ...process.env). Verified: a patterns cycle completes instead of hanging. (Or disable the phases: gbrain config set dream.patterns.enabled false, likewise dream.synthesize.enabled false.)

Suggested fix (any one)

Independently: the per-job timeout should actually cancel the handler (#212) so a deadlocked cycle dies cleanly rather than becoming a zombie.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions