Environment
gbrain 0.12.0
- Engine: Supabase Postgres + pgvector (Supavisor transaction pool on
:6543)
- Brain size: ~31,295 pages, ~87K chunks, ~17K links, ~22MB of timeline markdown
- launchd user agent (
com.gbrain.autopilot), KeepAlive=true
The problem
MinionWorker fires abort.abort() when a job exceeds timeout_ms, but the autopilot-cycle handler does not observe the AbortSignal. In-flight async work (backlinks page loops, embed iteration, DB queries) continues to completion, which cascades:
- Job is marked
dead with error_text: "timeout exceeded", but worker keeps burning CPU and holding pool connections on the zombie handler.
- Queue concurrency is 1, so waiting jobs never get claimed while the zombie runs.
- Subsequent cycles queue up at the autopilot interval.
- Lock renewal emits
"Lock lost for job N, aborting execution" — but the "aborting execution" claim is misleading; the handler doesn't stop.
- After a worker restart, fresh workers claim the stacked waiting jobs and hit
max stalled count exceeded (max_stalled=1 default), marking them dead on first re-claim.
Per-job timeout_ms = Math.max(baseInterval * 2 * 1000, 300_000) in src/commands/autopilot.ts:214. On a 5-minute --interval, that's 10 minutes. A full sync → extract → embed → backlinks pass on my brain takes 30–60+ minutes, so every cycle hits the budget and produces the above cascade.
Repro
- Point autopilot at a brain with >20K pages and >10K links.
- Run
gbrain autopilot --repo ... --interval 300.
- Watch
~/.gbrain/autopilot.err:
Job N (autopilot-cycle) hit per-job timeout (600000ms), aborting
Lock lost for job N, aborting execution
gbrain jobs stats accumulates dead jobs (timeout exceeded, then max stalled count exceeded) while ps shows the worker still in R state on a long-running backlinks query for 30+ minutes after the timeout.
What I tried
--interval 1800 → 60-min budget. Did not fix it, same cascade, just slower.
ALTER ROLE postgres SET statement_timeout = '120s' on Supabase. Irrelevant to this bug (addresses a different pooler issue).
Suggested direction
Plumb AbortSignal from MinionWorker.executeJob through to the cycle handler and its inner steps. At a minimum:
- Accept
ctx.signal in the autopilot-cycle handler and its sub-functions (sync, extract, embed, backlinks).
- In per-page iterators (backlinks loop over all pages, embed stale-walk), check
signal.aborted between iterations and throw early.
- For
postgres.js queries, bind signal to the query so sql.cancel() fires.
- For OpenAI calls in embed, pass
signal to fetch.
- For
git pull / child_process, track PID and SIGTERM on abort.
Handler ignoring the cancel is the root cause; everything downstream (stalls, lock-loss cascades, orphan pool connections) is a symptom.
Related
Happy to help
I can test a patched binary against my brain (31K pages) to verify the cascade goes away. My Supabase project has the v0.12 migrations applied.
Environment
gbrain 0.12.0:6543)com.gbrain.autopilot),KeepAlive=trueThe problem
MinionWorkerfiresabort.abort()when a job exceedstimeout_ms, but theautopilot-cyclehandler does not observe theAbortSignal. In-flight async work (backlinks page loops, embed iteration, DB queries) continues to completion, which cascades:deadwitherror_text: "timeout exceeded", but worker keeps burning CPU and holding pool connections on the zombie handler."Lock lost for job N, aborting execution"— but the "aborting execution" claim is misleading; the handler doesn't stop.max stalled count exceeded(max_stalled=1default), marking themdeadon first re-claim.Per-job
timeout_ms = Math.max(baseInterval * 2 * 1000, 300_000)insrc/commands/autopilot.ts:214. On a 5-minute--interval, that's 10 minutes. A fullsync → extract → embed → backlinkspass on my brain takes 30–60+ minutes, so every cycle hits the budget and produces the above cascade.Repro
gbrain autopilot --repo ... --interval 300.~/.gbrain/autopilot.err:gbrain jobs statsaccumulatesdeadjobs (timeout exceeded, thenmax stalled count exceeded) whilepsshows the worker still inRstate on a long-running backlinks query for 30+ minutes after the timeout.What I tried
--interval 1800→ 60-min budget. Did not fix it, same cascade, just slower.ALTER ROLE postgres SET statement_timeout = '120s'on Supabase. Irrelevant to this bug (addresses a different pooler issue).Suggested direction
Plumb
AbortSignalfromMinionWorker.executeJobthrough to the cycle handler and its inner steps. At a minimum:ctx.signalin theautopilot-cyclehandler and its sub-functions (sync,extract,embed,backlinks).signal.abortedbetween iterations and throw early.postgres.jsqueries, bindsignalto the query sosql.cancel()fires.signaltofetch.git pull/ child_process, track PID and SIGTERM on abort.Handler ignoring the cancel is the root cause; everything downstream (stalls, lock-loss cascades, orphan pool connections) is a symptom.
Related
setTimeouton SIGTERM. Same class, different scope.Happy to help
I can test a patched binary against my brain (31K pages) to verify the cascade goes away. My Supabase project has the v0.12 migrations applied.