feat: queue resilience — wall-clock timeouts, backpressure, --no-worker, env concurrency by garrytan · Pull Request #379 · garrytan/gbrain

garrytan · 2026-04-24T03:08:46Z

Problem

In production (OpenClaw + gbrain), the Minions job queue experienced full blockage:

A single autopilot-cycle job stalled (git index.lock held the process)
The stall detector uses FOR UPDATE SKIP LOCKED, so the stalled job was un-evictable
With worker concurrency=1, this blocked the entire queue for 90+ minutes
Autopilot cron kept submitting — 18 duplicate slots piled up
Shell jobs (X ingestion pipeline) starved completely

Changes

1. Wall-clock timeout sweep (`queue.ts` + `worker.ts`)

Dead-letters active jobs exceeding 2× timeout_ms regardless of lock state. Catches jobs stuck while holding DB connections that FOR UPDATE SKIP LOCKED stall detection skips.

2. Submission backpressure — `maxWaiting` (`queue.ts` + `types.ts`)

New maxWaiting option on job submission. Caps waiting jobs per name — prevents autopilot-cycle flood when queue is blocked.

3. `--no-worker` flag (`autopilot.ts`)

Skips spawning the built-in worker child. For environments where the worker lifecycle is managed externally (systemd, Docker, OpenClaw service-manager).

4. `GBRAIN_WORKER_CONCURRENCY` env var (`jobs.ts`)

Fallback for --concurrency when the worker is spawned by autopilot (which cannot pass CLI flags to the child).

5. Shell job env guard with clear logging (`shell.ts` + `jobs.ts`)

Shell handler is always registered but throws UnrecoverableError with a clear message when GBRAIN_ALLOW_SHELL_JOBS=1 is not set. Previously, the handler was simply not registered — jobs would sit in waiting with no explanation.

Testing

bun run typecheck passes clean
All changes are backward-compatible (no schema changes, no new defaults that break existing behavior)

…er, env concurrency, shell guard Prevents stall-induced queue blockage discovered in production (OpenClaw): 1. Wall-clock timeout sweep: dead-letters active jobs exceeding 2× timeout_ms (or 2 × lockDuration × max_stalled). Catches jobs stuck while holding DB connections where FOR UPDATE SKIP LOCKED stall detection skips them. 2. Submission backpressure (maxWaiting): caps waiting jobs per name at submission time. Prevents autopilot-cycle flood when the queue is blocked. 3. --no-worker flag for autopilot: skips spawning the built-in worker child. For environments where the worker lifecycle is managed externally (systemd, Docker, OpenClaw service-manager). 4. GBRAIN_WORKER_CONCURRENCY env var: fallback for --concurrency when the worker is spawned by autopilot (which can't pass CLI flags to the child). 5. Shell job env guard with clear logging: shell handler is always registered but throws UnrecoverableError with a clear message when GBRAIN_ALLOW_SHELL_JOBS=1 is not set, instead of silently not registering.

…max-waiting CLI Addresses three production-hardening findings from the CEO + Eng + Codex adversarial review of PR #379: D2/H2: maxWaiting was TOCTOU-racy — two concurrent submitters could both see waitingCount < max and both insert. Wrap the count+select+insert in pg_advisory_xact_lock keyed on (name, queue). Serializes concurrent decisions for the SAME key while leaving different keys fully parallel. Lock auto-releases on txn commit/rollback — no cleanup path to leak. Also fix the missing queue-scope bug: count and select now filter on (name, queue) not name alone, so cross-queue same-name jobs don't suppress each other. D3/H3: resolveWorkerConcurrency silently accepted NaN / 0 / negative from parseInt. `inFlight.size < NaN` is always false → worker claims nothing → silent wedge from a single-typo env var. Clamp to ≥1 with a loud stderr warning naming the bad value. D5/H5: `gbrain jobs submit` never parsed `--max-waiting N` despite the MinionJobInput field. Wire the flag with clamp [1, 100], mirror `--max-stalled`. Extract `parseMaxWaitingFlag` for unit testing. Q1: Silent coalesce was invisible by design. New src/core/minions/backpressure-audit.ts mirrors shell-audit.ts's ISO-week JSONL pattern: `~/.gbrain/audit/backpressure-YYYY-Www.jsonl`. Coalesce events write one JSONL line with (queue, name, waiting_count, max_waiting, returned_job_id, ts). Best-effort — disk-full never blocks submission. A2: `gbrain jobs smoke --wedge-rescue` new opt-in regression case. Forges a wedged-worker row state, invokes handleStalled + handleTimeouts + handleWallClockTimeouts in order, asserts only wall-clock evicts. Mirrors the v0.14.3 `--sigkill-rescue` shape. Tests: 23 new unit cases in test/minions.test.ts covering wall-clock timeout (3 cases + non-interference with handleTimeouts), maxWaiting (coalesce, clamp 0, floor, concurrent-submitter race via Promise.all, cross-queue isolation, unset fallthrough), concurrency clamp (7 cases incl. NaN/0/negative), parseMaxWaitingFlag (5 cases), backpressure audit file write. Part of v0.19.1 plan at ~/.claude/plans/ok-wintermute-wrote-this-polished-matsumoto.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…book A5 / D4: New `queue_health` check in `gbrain doctor`. Postgres-only (PGLite has no multi-process worker surface). Two subchecks, both cheap (single SELECT each, status-index-covered): - stalled-forever: any active job with started_at > 1h. Surfaces the worst offenders (top 5 by started_at ASC) with `gbrain jobs get/cancel` fix hints. The incident that motivated v0.19.1 ran 90+ min before the operator noticed. - waiting-depth: per-name waiting count exceeds threshold. Default 10, overridable via GBRAIN_QUEUE_WAITING_THRESHOLD env (D9). Signals a submitter probably needs maxWaiting set. Worker-heartbeat subcheck from the original plan dropped (D4/H4): no minion_workers table exists, and lock_until-on-active-jobs is a lossy proxy that can't distinguish idle-worker from dead-worker. Tracked as follow-up B7. A4: --no-worker peer-liveness probe in autopilot. When --no-worker is set, every cycle runs a cheap SELECT checking for any active job whose lock_until was refreshed in the last 2 minutes. After 3 consecutive idle ticks, logs a loud WARNING naming the silent-wedge vector and referencing B7 as the ground-truth follow-up. Re-arms on next live signal so the warning doesn't spam every cycle. A6: New docs/guides/queue-operations-runbook.md (one viewport, ~60 lines). "My queue looks wedged — what do I run?" in order of escalation. What each doctor subcheck means. Self-check for the --no-worker / no-worker-running footgun. CLAUDE.md: key-files updates for handleWallClockTimeouts (v0.19.0 Layer 3 kill shot), maxWaiting advisory-lock rewrite (v0.19.1 D2), queue_health doctor check (v0.19.1 D4), and backpressure-audit.ts. Tests: all 143 minions + 13 doctor unit tests pass. No new test cases required in Lane B; the doctor queue_health exercise is in the E2E verification step (needs real PG to produce meaningful stalled-forever rows). The --no-worker probe is exercised by the smoke case's wedge setup in Lane A. README: unchanged. Existing `gbrain jobs submit` examples don't show --max-stalled, so no --max-waiting precedent to extend per A6 conditional. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

VERSION: 0.19.0 → 0.19.1 (patch; bug-fix-dominant, no schema change, no new user-facing vocabulary). CHANGELOG: new v0.19.1 entry at the top with the full release-summary template per CLAUDE.md — bold two-line headline, lead paragraph, "numbers that matter" before/after table measured against the real incident, "what this means for OpenClaw users" closer, required "To take advantage of v0.19.1" block naming the worker-restart requirement, itemized changes by area, and "For contributors" section closing the loop on the stale autopilot-idempotency narrative the CEO review was based on. Mechanism reframing per D1/H1: the 18-job pile-up was NOT caused by missing idempotency (autopilot already passes `idempotency_key: autopilot-cycle:${slot}` at autopilot.ts:241). The 18 jobs were 18 DIFFERENT slots stacking up behind the wedged one. `maxWaiting` still caps the pile; the incident just wasn't about idempotency. Adversarial review caught this before ship. SPEC.md: deleted from repo root. It was Wintermute's planning artifact for the original PR, not a shipped spec. Design docs belong under docs/designs/ per repo convention; leaving one at repo root set a precedent this repo doesn't want (A7/D11). CHANGELOG + the plan file at ~/.claude/plans/ok-wintermute-wrote-this-polished-matsumoto.md are the durable artifacts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Smoke case was setting lock_until in the past, so handleStalled's requeue path fired before handleWallClockTimeouts had a chance to evict. Production scenario is "lock_until still live (worker renewing) + timeout_at disqualified" — only wall-clock matches. Single-connection smoke can't simulate a row lock held by another txn, so we force the equivalent outcome: - lock_until = now() + 30s → handleStalled skips (not a stall) - timeout_at = NULL → handleTimeouts skips (needs NOT NULL) - started_at = now() - 10s, timeout_ms=1000 → wall-clock matches (2 × timeout_ms = 2000ms threshold exceeded) Verified: SMOKE PASS — Minions healthy + wedge rescue in 0.14s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Master shipped its own v0.19.1 (smoke-test skillpack, PR #369) and then v0.20.0 (BrainBench extraction to sibling repo, PR #195) while this branch was in review. Bumping our release from v0.19.1 to v0.20.1 follows the CLAUDE.md rule: "VERSION must be higher than master's." Resolved conflicts: - VERSION: 0.19.1 (ours) + 0.20.0 (master) → 0.20.1 - package.json: same bump applied to the version field - CHANGELOG.md: our queue-resilience entry renamed from v0.19.1 to v0.20.1 (6 inline refs updated across the body: numbers-that-matter table, "To take advantage" block, pre-v0.20.1 code reference, adversarial-review mention, and the v0.19.2 → v0.20.2 deferral reference for composite indexes). Entry stays at the top of the file, followed by master's v0.20.0 (BrainBench) and v0.19.1 (smoke-test skillpack). Sequence is now 0.20.1 → 0.20.0 → 0.19.1 → 0.19.0 → 0.18.2 → ... Rebuilt binary reports gbrain 0.20.1. Pre-merge verification carried forward: - 143 minions + 13 doctor unit tests pass - typecheck clean - 189 E2E tests pass against real Postgres 16 + pgvector - All 3 smoke cases pass (basic, --sigkill-rescue, --wedge-rescue) - queue_health doctor check fires correctly on a forged stalled-forever job No source changes — conflict resolution was version-label surgery only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Master shipped v0.20.2 (jobs supervisor, PR #364) while this branch was in review. Bumping our release from v0.20.1 to v0.20.3 follows the CLAUDE.md "VERSION must be higher than master's" rule. Resolved conflicts: - VERSION: 0.20.1 (ours) + 0.20.2 (master) → 0.20.3 - package.json: version bumped to match - CHANGELOG.md: our queue-resilience entry renamed from v0.20.1 to v0.20.3 throughout (5 inline refs updated: numbers-that-matter table, "To take advantage" block, pre-v0.20.3 code reference, adversarial-review mention). The "composite indexes deferred to v0.20.2" follow-up reference updated to v0.20.4 because master already took v0.20.2 for the supervisor feature. Auto-merged cleanly: - src/commands/doctor.ts: master added supervisor health check (filesystem-only, PID liveness + audit tail) alongside our new queue_health check. No conflict — different sections. - src/commands/jobs.ts: master added `jobs supervisor` subcommand with start / status / stop variants. No conflict with our --max-waiting CLI wiring or --wedge-rescue smoke case. New files pulled in from master (supervisor feature): - src/core/minions/supervisor.ts (MinionSupervisor class) - src/core/minions/handlers/supervisor-audit.ts (JSONL audit) - test/supervisor.test.ts (13 cases) - test/fixtures/supervisor-runner.ts (integration test helper) Post-merge verification: - `bun run typecheck` — clean - `bun test test/minions.test.ts test/doctor.test.ts` — 156 pass - `bun run build` → gbrain 0.20.3 - CHANGELOG version sequence: 0.20.3 → 0.20.2 → 0.20.0 → 0.19.1 → ... No source changes — conflict resolution was version-label surgery only. The queue-resilience code, tests, and docs landed from master cleanly alongside the new supervisor feature. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two CI failure clusters, both pre-existing but surfaced by the v0.20.3 merge: 1) test/minions-shell.test.ts — 12 failing cases. The shell handler throws UnrecoverableError when GBRAIN_ALLOW_SHELL_JOBS !== '1' (the production RCE guard at shell.ts:210). The unit tests exercise handler mechanics, not the guard, but never set the env var — so every invocation exits through the guard path instead of the code being tested. Fix: set GBRAIN_ALLOW_SHELL_JOBS=1 in beforeAll, restore in afterAll. The env-guard IS still tested separately via the test/minions.test.ts case added in v0.20.3 Lane A which toggles the var itself. 2) llms-full.txt — stale against CLAUDE.md. Key-files entries for queue.ts, doctor.ts, and the new backpressure-audit.ts updated in v0.20.3 Lane B triggered the build-llms drift guard. Regenerated via `bun run build:llms`; no behavior change, just the inlined-docs bundle catching up to source. Full test run: 2367 pass, 0 fail across 137 files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pulls upstream v0.20.3 (#379): queue resilience — wall-clock timeouts, backpressure, --no-worker, env concurrency. Conflicts resolved: - VERSION — kept 0.21.0; upstream is 0.20.3 - package.json — v0.21.0 wins - CHANGELOG.md — v0.21.0 preserved above upstream's v0.20.3 Build clean: 0.21.0 binary runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge upstream/master (commit 11abb24, gbrain v0.20.4) into KOS v2 fork. Six upstream commits land: - v0.19.0 check-resolvable OpenClaw fallback (garrytan#326) - v0.19.1 smoke-test skillpack (garrytan#369) - v0.20.0 BrainBench extracted to sibling repo (garrytan#195) - v0.20.2 jobs supervisor (garrytan#364) — Postgres-only, PGLite skips - v0.20.3 queue resilience + queue_health doctor (garrytan#379) — Postgres-only - v0.20.4 minion-orchestrator skill consolidation (garrytan#381) Conflicts resolved (2 real, 5 auto): - .gitignore: union both fork (.omc/, kos-jarvis log globs) and upstream (eval/data/world-v1/world.html, amara-life-v1 cache) entries. - skills/manifest.json: append upstream's smoke-test skill plus retain the 9 kos-jarvis fork skills (39 total). - CLAUDE.md / README.md / package.json (0.20.4) / skills/RESOLVER.md / src/cli.ts (mode 0755) auto-merged cleanly. Fork-local patches preserved (verified post-merge): - src/core/pglite-schema.ts:65 — idx_pages_source_id commented out (upstream garrytan#370 still open, fix retained). - src/core/pglite-engine.ts:87 — pg_switch_wal() before close() (WAL durability patch, no upstream issue filed yet). - src/cli.ts mode 100755 — bun shim executable bit. Issue garrytan#332 (v0_13_0 process.execPath) fixed upstream in v0.19.0 ... running gbrain apply-migrations --yes will clear the partial-ledger remainder that has been stuck in doctor since the v0.13 sync. v0.20's headline features (jobs supervisor, queue_health, wedge-rescue, backpressure-audit) are Postgres-only and skip on our PGLite engine. Sync is preventive ... keeps the fork mergeable rather than buying new runtime capability. Pre-merge baseline (HEAD 170876f): - pages 1988, chunks 3750 (100% embedded), links 8522, timeline 10881 - doctor health 60/100 (failed: minions_migration partial 0.13.0) - brain_score 86/100 Rollback: git tag pre-sync-v0.20-1777105378 PGLite snapshot: ~/.gbrain/brain.pglite.pre-sync-v0.20-1777105391 (416M)

root and others added 4 commits April 24, 2026 03:08

garrytan force-pushed the feat/queue-resilience branch from 444499c to b042157 Compare April 24, 2026 07:04

garrytan and others added 4 commits April 24, 2026 00:08

garrytan merged commit d838d47 into master Apr 24, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: queue resilience — wall-clock timeouts, backpressure, --no-worker, env concurrency#379

feat: queue resilience — wall-clock timeouts, backpressure, --no-worker, env concurrency#379
garrytan merged 8 commits intomasterfrom
feat/queue-resilience

garrytan commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented Apr 24, 2026

Problem

Changes

1. Wall-clock timeout sweep (queue.ts + worker.ts)

2. Submission backpressure — maxWaiting (queue.ts + types.ts)

3. --no-worker flag (autopilot.ts)

4. GBRAIN_WORKER_CONCURRENCY env var (jobs.ts)

5. Shell job env guard with clear logging (shell.ts + jobs.ts)

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Wall-clock timeout sweep (`queue.ts` + `worker.ts`)

2. Submission backpressure — `maxWaiting` (`queue.ts` + `types.ts`)

3. `--no-worker` flag (`autopilot.ts`)

4. `GBRAIN_WORKER_CONCURRENCY` env var (`jobs.ts`)

5. Shell job env guard with clear logging (`shell.ts` + `jobs.ts`)