autopilot: cycle handler ignores AbortSignal, leaving zombie work after per-job timeout

## Environment

- `gbrain 0.12.0`
- Engine: Supabase Postgres + pgvector (Supavisor transaction pool on `:6543`)
- Brain size: ~31,295 pages, ~87K chunks, ~17K links, ~22MB of timeline markdown
- launchd user agent (`com.gbrain.autopilot`), `KeepAlive=true`

## The problem

`MinionWorker` fires `abort.abort()` when a job exceeds `timeout_ms`, but the `autopilot-cycle` handler does not observe the `AbortSignal`. In-flight async work (backlinks page loops, embed iteration, DB queries) continues to completion, which cascades:

1. Job is marked `dead` with `error_text: "timeout exceeded"`, but worker keeps burning CPU and holding pool connections on the zombie handler.
2. Queue concurrency is 1, so waiting jobs never get claimed while the zombie runs.
3. Subsequent cycles queue up at the autopilot interval.
4. Lock renewal emits `"Lock lost for job N, aborting execution"` — but the "aborting execution" claim is misleading; the handler doesn't stop.
5. After a worker restart, fresh workers claim the stacked waiting jobs and hit `max stalled count exceeded` (`max_stalled=1` default), marking them `dead` on first re-claim.

Per-job `timeout_ms = Math.max(baseInterval * 2 * 1000, 300_000)` in `src/commands/autopilot.ts:214`. On a 5-minute `--interval`, that's 10 minutes. A full `sync → extract → embed → backlinks` pass on my brain takes 30–60+ minutes, so every cycle hits the budget and produces the above cascade.

## Repro

1. Point autopilot at a brain with >20K pages and >10K links.
2. Run `gbrain autopilot --repo ... --interval 300`.
3. Watch `~/.gbrain/autopilot.err`:
   ```
   Job N (autopilot-cycle) hit per-job timeout (600000ms), aborting
   Lock lost for job N, aborting execution
   ```
4. `gbrain jobs stats` accumulates `dead` jobs (`timeout exceeded`, then `max stalled count exceeded`) while `ps` shows the worker still in `R` state on a long-running backlinks query for 30+ minutes after the timeout.

## What I tried

- `--interval 1800` → 60-min budget. Did not fix it, same cascade, just slower.
- `ALTER ROLE postgres SET statement_timeout = '120s'` on Supabase. Irrelevant to this bug (addresses a different pooler issue).

## Suggested direction

Plumb `AbortSignal` from `MinionWorker.executeJob` through to the cycle handler and its inner steps. At a minimum:

- Accept `ctx.signal` in the `autopilot-cycle` handler and its sub-functions (`sync`, `extract`, `embed`, `backlinks`).
- In per-page iterators (backlinks loop over all pages, embed stale-walk), check `signal.aborted` between iterations and throw early.
- For `postgres.js` queries, bind `signal` to the query so `sql.cancel()` fires.
- For OpenAI calls in embed, pass `signal` to `fetch`.
- For `git pull` / child_process, track PID and SIGTERM on abort.

Handler ignoring the cancel is the root cause; everything downstream (stalls, lock-loss cascades, orphan pool connections) is a symptom.

## Related

- #204 — outer sleep loop also has non-cancellable `setTimeout` on SIGTERM. Same class, different scope.

## Happy to help

I can test a patched binary against my brain (31K pages) to verify the cascade goes away. My Supabase project has the v0.12 migrations applied.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

autopilot: cycle handler ignores AbortSignal, leaving zombie work after per-job timeout #212

Environment

The problem

Repro

What I tried

Suggested direction

Related

Happy to help

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

autopilot: cycle handler ignores AbortSignal, leaving zombie work after per-job timeout #212

Description

Environment

The problem

Repro

What I tried

Suggested direction

Related

Happy to help

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions