Summary
The shell-job lane is rock solid (38/38 done, ~11s avg). But the long-running job lanes — subagent, embed-backfill, autopilot-cycle — thrash and die without ever consuming an attempt, and pile up waiting behind them.
Evidence (observed on a live worker: gbrain jobs work --concurrency 3 --queue default --max-rss 16384)
jobs get on dead jobs:
Job #11385: autopilot-cycle (DEAD after 0 attempts)
Attempts: 0/2 (started: 3)
Error: wall-clock timeout exceeded
Job #11383: embed-backfill (DEAD after 0 attempts)
Attempts: 0/3 (started: 1)
Error: wall-clock timeout exceeded
Data: {"reason":"sync_all","sourceId":"straylight-brain","batchSize":500}
The smoking gun is Attempts: 0/2 (started: 3) — the job was started 3 times but the attempt counter never incremented. Pattern: worker claims the job (started++), the job runs long and exceeds the wall-clock, the lease is killed mid-flight, and the attempt is never recorded as consumed, so it gets reclaimed and re-run until it exhausts the started ceiling -> DEAD. Meanwhile a freshly-submitted subagent job (#11407) sat at Attempts: 0/3 (started: 0) — never claimed, because the thrashing long jobs monopolize the 3 concurrency slots.
Impact
- The subagent (LLM) lane is effectively unusable for batch fan-out: jobs either never get claimed or die on wall-clock without making progress.
- Queue starvation: 5+ jobs stuck
waiting, oldest 5 days old (protected #10506 from 2026-05-27).
- The doctor hint exists (
no subagent jobs completed — cap may be too tight; export GBRAIN_ANTHROPIC_MAX_INFLIGHT=64) but the root issue is lease/attempt accounting, not just the inflight cap.
Expected
- A job killed by wall-clock timeout should increment attempts (it consumed an attempt), not leave
attempts:0 / started:N.
started >> attempts should be impossible, or surfaced as a distinct "lease-lost / reclaim thrash" state instead of silently looping to DEAD.
- Long-running handlers should get a per-handler wall-clock budget distinct from short jobs, OR a heartbeat/lease-renew so an actively-progressing LLM job isn't reaped.
- Concurrency accounting should not let stalled long jobs permanently starve newly-submitted ones (fair scheduling / separate queue or slot reservation for long lanes).
Repro
- Submit a
subagent or embed-backfill --reason sync_all job that runs longer than the wall-clock budget.
gbrain jobs get <id> -> observe Attempts: 0/N (started: >1) then eventual DEAD on wall-clock timeout exceeded.
- Submit a fresh
subagent job alongside -> observe it never leaves waiting (started: 0).
Env
- worker:
jobs work --concurrency 3 --queue default --max-rss 16384
- PgBouncer transaction-mode (port 6543), prepared statements disabled
GBRAIN_ANTHROPIC_MAX_INFLIGHT unset
Summary
The shell-job lane is rock solid (38/38 done, ~11s avg). But the long-running job lanes —
subagent,embed-backfill,autopilot-cycle— thrash and die without ever consuming an attempt, and pile upwaitingbehind them.Evidence (observed on a live worker:
gbrain jobs work --concurrency 3 --queue default --max-rss 16384)jobs geton dead jobs:The smoking gun is
Attempts: 0/2 (started: 3)— the job was started 3 times but the attempt counter never incremented. Pattern: worker claims the job (started++), the job runs long and exceeds the wall-clock, the lease is killed mid-flight, and the attempt is never recorded as consumed, so it gets reclaimed and re-run until it exhausts thestartedceiling -> DEAD. Meanwhile a freshly-submittedsubagentjob (#11407) sat atAttempts: 0/3 (started: 0)— never claimed, because the thrashing long jobs monopolize the 3 concurrency slots.Impact
waiting, oldest 5 days old (protected#10506 from 2026-05-27).no subagent jobs completed — cap may be too tight; export GBRAIN_ANTHROPIC_MAX_INFLIGHT=64) but the root issue is lease/attempt accounting, not just the inflight cap.Expected
attempts:0 / started:N.started >> attemptsshould be impossible, or surfaced as a distinct "lease-lost / reclaim thrash" state instead of silently looping to DEAD.Repro
subagentorembed-backfill --reason sync_alljob that runs longer than the wall-clock budget.gbrain jobs get <id>-> observeAttempts: 0/N (started: >1)then eventual DEAD onwall-clock timeout exceeded.subagentjob alongside -> observe it never leaveswaiting(started: 0).Env
jobs work --concurrency 3 --queue default --max-rss 16384GBRAIN_ANTHROPIC_MAX_INFLIGHTunset