Skip to content

Minions: long-running jobs (subagent/embed-backfill/autopilot-cycle) thrash on wall-clock timeout with broken attempt accounting, starving the queue #1737

@garrytan-agents

Description

@garrytan-agents

Summary

The shell-job lane is rock solid (38/38 done, ~11s avg). But the long-running job lanessubagent, embed-backfill, autopilot-cycle — thrash and die without ever consuming an attempt, and pile up waiting behind them.

Evidence (observed on a live worker: gbrain jobs work --concurrency 3 --queue default --max-rss 16384)

jobs get on dead jobs:

Job #11385: autopilot-cycle (DEAD after 0 attempts)
  Attempts: 0/2 (started: 3)
  Error: wall-clock timeout exceeded

Job #11383: embed-backfill (DEAD after 0 attempts)
  Attempts: 0/3 (started: 1)
  Error: wall-clock timeout exceeded
  Data: {"reason":"sync_all","sourceId":"straylight-brain","batchSize":500}

The smoking gun is Attempts: 0/2 (started: 3) — the job was started 3 times but the attempt counter never incremented. Pattern: worker claims the job (started++), the job runs long and exceeds the wall-clock, the lease is killed mid-flight, and the attempt is never recorded as consumed, so it gets reclaimed and re-run until it exhausts the started ceiling -> DEAD. Meanwhile a freshly-submitted subagent job (#11407) sat at Attempts: 0/3 (started: 0) — never claimed, because the thrashing long jobs monopolize the 3 concurrency slots.

Impact

  • The subagent (LLM) lane is effectively unusable for batch fan-out: jobs either never get claimed or die on wall-clock without making progress.
  • Queue starvation: 5+ jobs stuck waiting, oldest 5 days old (protected #10506 from 2026-05-27).
  • The doctor hint exists (no subagent jobs completed — cap may be too tight; export GBRAIN_ANTHROPIC_MAX_INFLIGHT=64) but the root issue is lease/attempt accounting, not just the inflight cap.

Expected

  • A job killed by wall-clock timeout should increment attempts (it consumed an attempt), not leave attempts:0 / started:N.
  • started >> attempts should be impossible, or surfaced as a distinct "lease-lost / reclaim thrash" state instead of silently looping to DEAD.
  • Long-running handlers should get a per-handler wall-clock budget distinct from short jobs, OR a heartbeat/lease-renew so an actively-progressing LLM job isn't reaped.
  • Concurrency accounting should not let stalled long jobs permanently starve newly-submitted ones (fair scheduling / separate queue or slot reservation for long lanes).

Repro

  1. Submit a subagent or embed-backfill --reason sync_all job that runs longer than the wall-clock budget.
  2. gbrain jobs get <id> -> observe Attempts: 0/N (started: >1) then eventual DEAD on wall-clock timeout exceeded.
  3. Submit a fresh subagent job alongside -> observe it never leaves waiting (started: 0).

Env

  • worker: jobs work --concurrency 3 --queue default --max-rss 16384
  • PgBouncer transaction-mode (port 6543), prepared statements disabled
  • GBRAIN_ANTHROPIC_MAX_INFLIGHT unset

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions