Minions: long-running jobs (subagent/embed-backfill/autopilot-cycle) thrash on wall-clock timeout with broken attempt accounting, starving the queue

## Summary
The shell-job lane is rock solid (38/38 done, ~11s avg). But the **long-running job lanes** — `subagent`, `embed-backfill`, `autopilot-cycle` — thrash and die without ever consuming an attempt, and pile up `waiting` behind them.

## Evidence (observed on a live worker: `gbrain jobs work --concurrency 3 --queue default --max-rss 16384`)

`jobs get` on dead jobs:
```
Job #11385: autopilot-cycle (DEAD after 0 attempts)
  Attempts: 0/2 (started: 3)
  Error: wall-clock timeout exceeded

Job #11383: embed-backfill (DEAD after 0 attempts)
  Attempts: 0/3 (started: 1)
  Error: wall-clock timeout exceeded
  Data: {"reason":"sync_all","sourceId":"straylight-brain","batchSize":500}
```

The smoking gun is **`Attempts: 0/2 (started: 3)`** — the job was *started* 3 times but the attempt counter never incremented. Pattern: worker claims the job (`started++`), the job runs long and exceeds the wall-clock, the lease is killed mid-flight, and the **attempt is never recorded as consumed**, so it gets reclaimed and re-run until it exhausts the `started` ceiling -> DEAD. Meanwhile a freshly-submitted `subagent` job (#11407) sat at `Attempts: 0/3 (started: 0)` — never claimed, because the thrashing long jobs monopolize the 3 concurrency slots.

## Impact
- The subagent (LLM) lane is effectively unusable for batch fan-out: jobs either never get claimed or die on wall-clock without making progress.
- Queue starvation: 5+ jobs stuck `waiting`, oldest 5 days old (`protected` #10506 from 2026-05-27).
- The doctor hint exists (`no subagent jobs completed — cap may be too tight; export GBRAIN_ANTHROPIC_MAX_INFLIGHT=64`) but the root issue is lease/attempt accounting, not just the inflight cap.

## Expected
- A job killed by wall-clock timeout should **increment attempts** (it consumed an attempt), not leave `attempts:0 / started:N`.
- `started >> attempts` should be impossible, or surfaced as a distinct "lease-lost / reclaim thrash" state instead of silently looping to DEAD.
- Long-running handlers should get a per-handler wall-clock budget distinct from short jobs, OR a heartbeat/lease-renew so an actively-progressing LLM job isn't reaped.
- Concurrency accounting should not let stalled long jobs permanently starve newly-submitted ones (fair scheduling / separate queue or slot reservation for long lanes).

## Repro
1. Submit a `subagent` or `embed-backfill --reason sync_all` job that runs longer than the wall-clock budget.
2. `gbrain jobs get <id>` -> observe `Attempts: 0/N (started: >1)` then eventual DEAD on `wall-clock timeout exceeded`.
3. Submit a fresh `subagent` job alongside -> observe it never leaves `waiting` (`started: 0`).

## Env
- worker: `jobs work --concurrency 3 --queue default --max-rss 16384`
- PgBouncer transaction-mode (port 6543), prepared statements disabled
- `GBRAIN_ANTHROPIC_MAX_INFLIGHT` unset

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minions: long-running jobs (subagent/embed-backfill/autopilot-cycle) thrash on wall-clock timeout with broken attempt accounting, starving the queue #1737

Summary

Evidence (observed on a live worker: `gbrain jobs work --concurrency 3 --queue default --max-rss 16384`)

Impact

Expected

Repro

Env

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Minions: long-running jobs (subagent/embed-backfill/autopilot-cycle) thrash on wall-clock timeout with broken attempt accounting, starving the queue #1737

Description

Summary

Evidence (observed on a live worker: gbrain jobs work --concurrency 3 --queue default --max-rss 16384)

Impact

Expected

Repro

Env

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Evidence (observed on a live worker: `gbrain jobs work --concurrency 3 --queue default --max-rss 16384`)