Skip to content

PGLite engine: gbrain dream --phase synthesize hangs indefinitely (no worker daemon to process queued subagent jobs) #1306

@buildingmvp

Description

@buildingmvp

Summary

gbrain dream (full cycle or --phase synthesize alone) deterministically hangs at the start of [cycle.synthesize] on PGLite with zero apparent cause: idle main thread, no TCP sockets, no Anthropic API calls in flight, no child processes. Reproduced on v0.37.3.0 and v0.39.0.0 against both an established and a freshly-rebuilt brain.

This appears to be a regression in the v0.38.1 "provider-agnostic subagent loop" rebuild on PGLite engines specifically. The PGLite engine does not run a worker daemon (per gbrain jobs work error: "Worker daemon requires Postgres"), but the v0.38+ synthesize phase submits subagent jobs to the queue and waitForCompletion-polls them indefinitely. Without a worker, the jobs stay in waiting state and the cycle never advances.

Environment

  • OS: macOS 26.5 (Darwin 25.5.0), arm64
  • gbrain: 0.39.0.0
  • Bun: 1.3.14
  • Engine: PGLite (zero-config default from gbrain init --pglite)
  • Brain size: 7,178 pages / 20,089 chunks
  • Corpus: dream.synthesize.session_corpus_dir = ~/brain/raw-transcripts/ (5 .txt files for the minimal repro; same hang at 397 files)
  • Embedding: openai:text-embedding-3-large / 1536 dims
  • Models: models.dream.synthesize = anthropic:claude-sonnet-4-6, models.dream.patterns = anthropic:claude-haiku-4-5

Reproduction

# Fresh brain init (NOT a corrupted-WAL case — verified by full rebuild from #223 workaround)
gbrain init --pglite --path "$HOME/.gbrain/brain.pglite" --json
gbrain import ~/brain --no-embed       # 7178 pages, 20089 chunks, 0 errors
gbrain embed --stale                    # all chunks embedded

# Configure synthesize
gbrain config set dream.synthesize.session_corpus_dir ~/brain/raw-transcripts
gbrain config set models.dream.synthesize anthropic:claude-sonnet-4-6

# Trigger
gbrain dream --phase synthesize --dry-run --json
# Hangs indefinitely. Killed after 3 min via SIGKILL.

Observed process state during hang

$ ps -p 58579 -o pid,etime,%cpu,rss
  PID ELAPSED  %CPU    RSS
58579   03:06   0.5 460064

$ lsof -p 58579 -i        # zero entries — no network
(empty)

$ pgrep -P 58579           # zero entries — no children
(empty)

$ sample 58579 2 -mayDie
Call graph:
    1723 Thread_5344885   DispatchQueue_1: com.apple.main-thread  (serial)
    + 1723 start  (in dyld) + 6992  [0x18270be00]
    +   1723 ???  (in bun)  load address 0x100be0000 + 0x9bdd0c
# Main thread parked in kevent64 — classic Bun event-loop wait with nothing scheduled

Last log line before the hang, every time:

[cycle.synthesize] start
[dream] model "anthropic:claude-sonnet-4-6" is not in MODEL_CONTEXT_TOKENS; using 180000-token fallback budget. Set dream.synthesize.max_prompt_tokens to override.

No further output. Process consumes ~2GB RSS over time but does no work.

Root cause analysis

After tracing through src/core/cycle/synthesize.ts and src/core/minions/queue.ts:

  1. Synthesize fans out one subagent job per worth-processing transcript via MinionQueue.add() with allowProtectedSubmit: true.
  2. After submission, it calls waitForCompletion(queue, jobId, { timeoutMs: 35 * 60 * 1000, pollMs: 5_000 }) for each child.
  3. waitForCompletion polls gbrain_jobs.status for that id, expecting a worker to pick it up and transition it through runningcompleted / failed.
  4. On PGLite there is no worker. gbrain jobs work refuses to start with: "Error: Worker daemon requires Postgres. PGLite uses an exclusive file lock that blocks other processes."
  5. The submitted jobs sit at status waiting forever. The orchestrator polls them every 5s for up to 35 min per job — then the minion's TimeoutError fires, status becomes timeout, but only after 35 min per job. With 5 transcripts that's nearly 3 hours wall time.

Confirmed by inspecting queue state during/after a hung run:

$ gbrain jobs list
  ID     Name           Status               Queue      Time     Created
  324    subagent       waiting              default    —        2026-05-22T15:57:06
  323    subagent       waiting              default    —        2026-05-22T15:57:06
  ... (324 stuck jobs from previous attempts) ...

$ gbrain jobs stats
  Queue health: 324 waiting, 0 active, 0 stalled

(I cancelled all 324 via gbrain jobs cancel; the queue stays clean until the next synthesize run repopulates it with new waiters.)

Why this wasn't caught earlier

  • Postgres users have a worker daemon (gbrain jobs work) running alongside the cycle, so the same code path works for them.
  • The PGLite engine's documentation rightly says it uses an exclusive file lock that prevents a separate worker — but the synthesize phase wasn't gated on engine type when v0.38+ moved the work into the minion queue.
  • gbrain doctor --fast doesn't flag this (passes with 90/100 on the broken setup).

Expected behavior

Either:

  1. Run subagent jobs inline on PGLite. Synthesize should detect engine.kind === 'pglite' and execute children synchronously in the orchestrator process instead of submitting to the queue. (This is what v0.37 did, and what gbrain jobs submit <name> --follow does today.)
  2. Skip the phase with a clear error. If the phase architecturally requires a worker, loadSynthConfig or the phase entrypoint should return failed('synthesize requires worker daemon (Postgres engine); current engine: pglite') so users see what's wrong instead of an indefinite hang.
  3. Provide a PGLite-compatible worker. Bracket the worker around the cycle: orchestrator releases the writer lock, worker takes it, processes one job, releases, orchestrator resumes. (Heavier surgery.)

Workarounds for users hitting this today

# Disable synthesize entirely (other cycle phases run cleanly)
gbrain config set dream.synthesize.enabled false

This bypasses the broken code path. Sync, embed, lint, backlinks, doctor all still run on the nightly cycle.

To force-release a stale gbrain-cycle lock left by a killed synthesize hang:

// Connect to ~/.gbrain/brain.pglite directly, run:
//   DELETE FROM gbrain_cycle_locks WHERE id = 'gbrain-cycle';
// Otherwise the 30-min TTL has to expire.

A gbrain doctor --release-stale-locks (or similar admin command) would help here.

Related

  • PGLite WASM crash on macOS 26.3 with Bun 1.3.11 #223 — macOS WASM crash. Not the same bug but easy to confuse: rebuilding from #223 workaround does NOT resolve the synthesize hang. I rebuilt with gbrain init --pglite and re-imported all 7178 pages cleanly, and the hang reproduces immediately on the fresh brain.
  • v0.38.1.0 commit message ("provider-agnostic subagent loop + remote MCP dispatch + budget meter") — likely where the regression entered.

Happy to provide additional diagnostics (full sample trace, jobs table dump, etc.) or test a patch.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions