Skip to content

autopilot: cycle handler ignores AbortSignal, leaving zombie work after per-job timeout #212

@notjbg

Description

@notjbg

Environment

  • gbrain 0.12.0
  • Engine: Supabase Postgres + pgvector (Supavisor transaction pool on :6543)
  • Brain size: ~31,295 pages, ~87K chunks, ~17K links, ~22MB of timeline markdown
  • launchd user agent (com.gbrain.autopilot), KeepAlive=true

The problem

MinionWorker fires abort.abort() when a job exceeds timeout_ms, but the autopilot-cycle handler does not observe the AbortSignal. In-flight async work (backlinks page loops, embed iteration, DB queries) continues to completion, which cascades:

  1. Job is marked dead with error_text: "timeout exceeded", but worker keeps burning CPU and holding pool connections on the zombie handler.
  2. Queue concurrency is 1, so waiting jobs never get claimed while the zombie runs.
  3. Subsequent cycles queue up at the autopilot interval.
  4. Lock renewal emits "Lock lost for job N, aborting execution" — but the "aborting execution" claim is misleading; the handler doesn't stop.
  5. After a worker restart, fresh workers claim the stacked waiting jobs and hit max stalled count exceeded (max_stalled=1 default), marking them dead on first re-claim.

Per-job timeout_ms = Math.max(baseInterval * 2 * 1000, 300_000) in src/commands/autopilot.ts:214. On a 5-minute --interval, that's 10 minutes. A full sync → extract → embed → backlinks pass on my brain takes 30–60+ minutes, so every cycle hits the budget and produces the above cascade.

Repro

  1. Point autopilot at a brain with >20K pages and >10K links.
  2. Run gbrain autopilot --repo ... --interval 300.
  3. Watch ~/.gbrain/autopilot.err:
    Job N (autopilot-cycle) hit per-job timeout (600000ms), aborting
    Lock lost for job N, aborting execution
    
  4. gbrain jobs stats accumulates dead jobs (timeout exceeded, then max stalled count exceeded) while ps shows the worker still in R state on a long-running backlinks query for 30+ minutes after the timeout.

What I tried

  • --interval 1800 → 60-min budget. Did not fix it, same cascade, just slower.
  • ALTER ROLE postgres SET statement_timeout = '120s' on Supabase. Irrelevant to this bug (addresses a different pooler issue).

Suggested direction

Plumb AbortSignal from MinionWorker.executeJob through to the cycle handler and its inner steps. At a minimum:

  • Accept ctx.signal in the autopilot-cycle handler and its sub-functions (sync, extract, embed, backlinks).
  • In per-page iterators (backlinks loop over all pages, embed stale-walk), check signal.aborted between iterations and throw early.
  • For postgres.js queries, bind signal to the query so sql.cancel() fires.
  • For OpenAI calls in embed, pass signal to fetch.
  • For git pull / child_process, track PID and SIGTERM on abort.

Handler ignoring the cancel is the root cause; everything downstream (stalls, lock-loss cascades, orphan pool connections) is a symptom.

Related

Happy to help

I can test a patched binary against my brain (31K pages) to verify the cascade goes away. My Supabase project has the v0.12 migrations applied.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions