Skip to content

feat: queue resilience — wall-clock timeouts, backpressure, --no-worker, env concurrency#379

Merged
garrytan merged 8 commits intomasterfrom
feat/queue-resilience
Apr 24, 2026
Merged

feat: queue resilience — wall-clock timeouts, backpressure, --no-worker, env concurrency#379
garrytan merged 8 commits intomasterfrom
feat/queue-resilience

Conversation

@garrytan
Copy link
Copy Markdown
Owner

Problem

In production (OpenClaw + gbrain), the Minions job queue experienced full blockage:

  1. A single autopilot-cycle job stalled (git index.lock held the process)
  2. The stall detector uses FOR UPDATE SKIP LOCKED, so the stalled job was un-evictable
  3. With worker concurrency=1, this blocked the entire queue for 90+ minutes
  4. Autopilot cron kept submitting — 18 duplicate slots piled up
  5. Shell jobs (X ingestion pipeline) starved completely

Changes

1. Wall-clock timeout sweep (queue.ts + worker.ts)

Dead-letters active jobs exceeding 2× timeout_ms regardless of lock state. Catches jobs stuck while holding DB connections that FOR UPDATE SKIP LOCKED stall detection skips.

2. Submission backpressure — maxWaiting (queue.ts + types.ts)

New maxWaiting option on job submission. Caps waiting jobs per name — prevents autopilot-cycle flood when queue is blocked.

3. --no-worker flag (autopilot.ts)

Skips spawning the built-in worker child. For environments where the worker lifecycle is managed externally (systemd, Docker, OpenClaw service-manager).

4. GBRAIN_WORKER_CONCURRENCY env var (jobs.ts)

Fallback for --concurrency when the worker is spawned by autopilot (which cannot pass CLI flags to the child).

5. Shell job env guard with clear logging (shell.ts + jobs.ts)

Shell handler is always registered but throws UnrecoverableError with a clear message when GBRAIN_ALLOW_SHELL_JOBS=1 is not set. Previously, the handler was simply not registered — jobs would sit in waiting with no explanation.

Testing

  • bun run typecheck passes clean
  • All changes are backward-compatible (no schema changes, no new defaults that break existing behavior)

root and others added 4 commits April 24, 2026 03:08
…er, env concurrency, shell guard

Prevents stall-induced queue blockage discovered in production (OpenClaw):

1. Wall-clock timeout sweep: dead-letters active jobs exceeding 2× timeout_ms
   (or 2 × lockDuration × max_stalled). Catches jobs stuck while holding DB
   connections where FOR UPDATE SKIP LOCKED stall detection skips them.

2. Submission backpressure (maxWaiting): caps waiting jobs per name at
   submission time. Prevents autopilot-cycle flood when the queue is blocked.

3. --no-worker flag for autopilot: skips spawning the built-in worker child.
   For environments where the worker lifecycle is managed externally (systemd,
   Docker, OpenClaw service-manager).

4. GBRAIN_WORKER_CONCURRENCY env var: fallback for --concurrency when the
   worker is spawned by autopilot (which can't pass CLI flags to the child).

5. Shell job env guard with clear logging: shell handler is always registered
   but throws UnrecoverableError with a clear message when
   GBRAIN_ALLOW_SHELL_JOBS=1 is not set, instead of silently not registering.
…max-waiting CLI

Addresses three production-hardening findings from the CEO + Eng + Codex
adversarial review of PR #379:

D2/H2: maxWaiting was TOCTOU-racy — two concurrent submitters could both
see waitingCount < max and both insert. Wrap the count+select+insert in
pg_advisory_xact_lock keyed on (name, queue). Serializes concurrent
decisions for the SAME key while leaving different keys fully parallel.
Lock auto-releases on txn commit/rollback — no cleanup path to leak.
Also fix the missing queue-scope bug: count and select now filter on
(name, queue) not name alone, so cross-queue same-name jobs don't
suppress each other.

D3/H3: resolveWorkerConcurrency silently accepted NaN / 0 / negative from
parseInt. `inFlight.size < NaN` is always false → worker claims nothing →
silent wedge from a single-typo env var. Clamp to ≥1 with a loud stderr
warning naming the bad value.

D5/H5: `gbrain jobs submit` never parsed `--max-waiting N` despite the
MinionJobInput field. Wire the flag with clamp [1, 100], mirror
`--max-stalled`. Extract `parseMaxWaitingFlag` for unit testing.

Q1: Silent coalesce was invisible by design. New
src/core/minions/backpressure-audit.ts mirrors shell-audit.ts's ISO-week
JSONL pattern: `~/.gbrain/audit/backpressure-YYYY-Www.jsonl`. Coalesce
events write one JSONL line with (queue, name, waiting_count, max_waiting,
returned_job_id, ts). Best-effort — disk-full never blocks submission.

A2: `gbrain jobs smoke --wedge-rescue` new opt-in regression case.
Forges a wedged-worker row state, invokes handleStalled + handleTimeouts
+ handleWallClockTimeouts in order, asserts only wall-clock evicts.
Mirrors the v0.14.3 `--sigkill-rescue` shape.

Tests: 23 new unit cases in test/minions.test.ts covering wall-clock
timeout (3 cases + non-interference with handleTimeouts), maxWaiting
(coalesce, clamp 0, floor, concurrent-submitter race via Promise.all,
cross-queue isolation, unset fallthrough), concurrency clamp (7 cases
incl. NaN/0/negative), parseMaxWaitingFlag (5 cases), backpressure
audit file write.

Part of v0.19.1 plan at ~/.claude/plans/ok-wintermute-wrote-this-polished-matsumoto.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…book

A5 / D4: New `queue_health` check in `gbrain doctor`. Postgres-only (PGLite
has no multi-process worker surface). Two subchecks, both cheap (single
SELECT each, status-index-covered):

- stalled-forever: any active job with started_at > 1h. Surfaces the
  worst offenders (top 5 by started_at ASC) with `gbrain jobs get/cancel`
  fix hints. The incident that motivated v0.19.1 ran 90+ min before the
  operator noticed.
- waiting-depth: per-name waiting count exceeds threshold. Default 10,
  overridable via GBRAIN_QUEUE_WAITING_THRESHOLD env (D9). Signals a
  submitter probably needs maxWaiting set.

Worker-heartbeat subcheck from the original plan dropped (D4/H4): no
minion_workers table exists, and lock_until-on-active-jobs is a lossy
proxy that can't distinguish idle-worker from dead-worker. Tracked as
follow-up B7.

A4: --no-worker peer-liveness probe in autopilot. When --no-worker is
set, every cycle runs a cheap SELECT checking for any active job whose
lock_until was refreshed in the last 2 minutes. After 3 consecutive
idle ticks, logs a loud WARNING naming the silent-wedge vector and
referencing B7 as the ground-truth follow-up. Re-arms on next live
signal so the warning doesn't spam every cycle.

A6: New docs/guides/queue-operations-runbook.md (one viewport, ~60
lines). "My queue looks wedged — what do I run?" in order of
escalation. What each doctor subcheck means. Self-check for the
--no-worker / no-worker-running footgun.

CLAUDE.md: key-files updates for handleWallClockTimeouts (v0.19.0 Layer
3 kill shot), maxWaiting advisory-lock rewrite (v0.19.1 D2), queue_health
doctor check (v0.19.1 D4), and backpressure-audit.ts.

Tests: all 143 minions + 13 doctor unit tests pass. No new test cases
required in Lane B; the doctor queue_health exercise is in the E2E
verification step (needs real PG to produce meaningful stalled-forever
rows). The --no-worker probe is exercised by the smoke case's wedge
setup in Lane A.

README: unchanged. Existing `gbrain jobs submit` examples don't show
--max-stalled, so no --max-waiting precedent to extend per A6 conditional.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VERSION: 0.19.0 → 0.19.1 (patch; bug-fix-dominant, no schema change,
no new user-facing vocabulary).

CHANGELOG: new v0.19.1 entry at the top with the full release-summary
template per CLAUDE.md — bold two-line headline, lead paragraph, "numbers
that matter" before/after table measured against the real incident,
"what this means for OpenClaw users" closer, required "To take
advantage of v0.19.1" block naming the worker-restart requirement,
itemized changes by area, and "For contributors" section closing the
loop on the stale autopilot-idempotency narrative the CEO review was
based on.

Mechanism reframing per D1/H1: the 18-job pile-up was NOT caused by
missing idempotency (autopilot already passes
`idempotency_key: autopilot-cycle:${slot}` at autopilot.ts:241). The
18 jobs were 18 DIFFERENT slots stacking up behind the wedged one.
`maxWaiting` still caps the pile; the incident just wasn't about
idempotency. Adversarial review caught this before ship.

SPEC.md: deleted from repo root. It was Wintermute's planning artifact
for the original PR, not a shipped spec. Design docs belong under
docs/designs/ per repo convention; leaving one at repo root set a
precedent this repo doesn't want (A7/D11). CHANGELOG + the plan file
at ~/.claude/plans/ok-wintermute-wrote-this-polished-matsumoto.md are
the durable artifacts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan force-pushed the feat/queue-resilience branch from 444499c to b042157 Compare April 24, 2026 07:04
garrytan and others added 4 commits April 24, 2026 00:08
Smoke case was setting lock_until in the past, so handleStalled's
requeue path fired before handleWallClockTimeouts had a chance to
evict. Production scenario is "lock_until still live (worker
renewing) + timeout_at disqualified" — only wall-clock matches.

Single-connection smoke can't simulate a row lock held by another
txn, so we force the equivalent outcome:
- lock_until = now() + 30s → handleStalled skips (not a stall)
- timeout_at = NULL → handleTimeouts skips (needs NOT NULL)
- started_at = now() - 10s, timeout_ms=1000 → wall-clock matches
  (2 × timeout_ms = 2000ms threshold exceeded)

Verified: SMOKE PASS — Minions healthy + wedge rescue in 0.14s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Master shipped its own v0.19.1 (smoke-test skillpack, PR #369) and then
v0.20.0 (BrainBench extraction to sibling repo, PR #195) while this
branch was in review. Bumping our release from v0.19.1 to v0.20.1
follows the CLAUDE.md rule: "VERSION must be higher than master's."

Resolved conflicts:

- VERSION: 0.19.1 (ours) + 0.20.0 (master) → 0.20.1
- package.json: same bump applied to the version field
- CHANGELOG.md: our queue-resilience entry renamed from v0.19.1 to
  v0.20.1 (6 inline refs updated across the body: numbers-that-matter
  table, "To take advantage" block, pre-v0.20.1 code reference,
  adversarial-review mention, and the v0.19.2 → v0.20.2 deferral
  reference for composite indexes). Entry stays at the top of the
  file, followed by master's v0.20.0 (BrainBench) and v0.19.1
  (smoke-test skillpack). Sequence is now 0.20.1 → 0.20.0 → 0.19.1 →
  0.19.0 → 0.18.2 → ...

Rebuilt binary reports gbrain 0.20.1.

Pre-merge verification carried forward:
- 143 minions + 13 doctor unit tests pass
- typecheck clean
- 189 E2E tests pass against real Postgres 16 + pgvector
- All 3 smoke cases pass (basic, --sigkill-rescue, --wedge-rescue)
- queue_health doctor check fires correctly on a forged stalled-forever job

No source changes — conflict resolution was version-label surgery only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Master shipped v0.20.2 (jobs supervisor, PR #364) while this branch was
in review. Bumping our release from v0.20.1 to v0.20.3 follows the
CLAUDE.md "VERSION must be higher than master's" rule.

Resolved conflicts:

- VERSION: 0.20.1 (ours) + 0.20.2 (master) → 0.20.3
- package.json: version bumped to match
- CHANGELOG.md: our queue-resilience entry renamed from v0.20.1 to
  v0.20.3 throughout (5 inline refs updated: numbers-that-matter
  table, "To take advantage" block, pre-v0.20.3 code reference,
  adversarial-review mention). The "composite indexes deferred to
  v0.20.2" follow-up reference updated to v0.20.4 because master
  already took v0.20.2 for the supervisor feature.

Auto-merged cleanly:

- src/commands/doctor.ts: master added supervisor health check
  (filesystem-only, PID liveness + audit tail) alongside our new
  queue_health check. No conflict — different sections.
- src/commands/jobs.ts: master added `jobs supervisor` subcommand with
  start / status / stop variants. No conflict with our --max-waiting
  CLI wiring or --wedge-rescue smoke case.

New files pulled in from master (supervisor feature):
- src/core/minions/supervisor.ts (MinionSupervisor class)
- src/core/minions/handlers/supervisor-audit.ts (JSONL audit)
- test/supervisor.test.ts (13 cases)
- test/fixtures/supervisor-runner.ts (integration test helper)

Post-merge verification:
- `bun run typecheck` — clean
- `bun test test/minions.test.ts test/doctor.test.ts` — 156 pass
- `bun run build` → gbrain 0.20.3
- CHANGELOG version sequence: 0.20.3 → 0.20.2 → 0.20.0 → 0.19.1 → ...

No source changes — conflict resolution was version-label surgery only.
The queue-resilience code, tests, and docs landed from master cleanly
alongside the new supervisor feature.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two CI failure clusters, both pre-existing but surfaced by the v0.20.3
merge:

1) test/minions-shell.test.ts — 12 failing cases. The shell handler
   throws UnrecoverableError when GBRAIN_ALLOW_SHELL_JOBS !== '1' (the
   production RCE guard at shell.ts:210). The unit tests exercise
   handler mechanics, not the guard, but never set the env var — so
   every invocation exits through the guard path instead of the code
   being tested. Fix: set GBRAIN_ALLOW_SHELL_JOBS=1 in beforeAll,
   restore in afterAll. The env-guard IS still tested separately via
   the test/minions.test.ts case added in v0.20.3 Lane A which toggles
   the var itself.

2) llms-full.txt — stale against CLAUDE.md. Key-files entries for
   queue.ts, doctor.ts, and the new backpressure-audit.ts updated in
   v0.20.3 Lane B triggered the build-llms drift guard. Regenerated
   via `bun run build:llms`; no behavior change, just the inlined-docs
   bundle catching up to source.

Full test run: 2367 pass, 0 fail across 137 files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan merged commit d838d47 into master Apr 24, 2026
4 checks passed
garrytan added a commit that referenced this pull request Apr 24, 2026
Pulls upstream v0.20.3 (#379): queue resilience — wall-clock timeouts,
backpressure, --no-worker, env concurrency.

Conflicts resolved:
- VERSION — kept 0.21.0; upstream is 0.20.3
- package.json — v0.21.0 wins
- CHANGELOG.md — v0.21.0 preserved above upstream's v0.20.3

Build clean: 0.21.0 binary runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChenyqThu pushed a commit to ChenyqThu/jarvis-knowledge-os-v2 that referenced this pull request Apr 27, 2026
Merge upstream/master (commit 11abb24, gbrain v0.20.4) into KOS v2 fork.

Six upstream commits land:
- v0.19.0 check-resolvable OpenClaw fallback (garrytan#326)
- v0.19.1 smoke-test skillpack (garrytan#369)
- v0.20.0 BrainBench extracted to sibling repo (garrytan#195)
- v0.20.2 jobs supervisor (garrytan#364) — Postgres-only, PGLite skips
- v0.20.3 queue resilience + queue_health doctor (garrytan#379) — Postgres-only
- v0.20.4 minion-orchestrator skill consolidation (garrytan#381)

Conflicts resolved (2 real, 5 auto):
- .gitignore: union both fork (.omc/, kos-jarvis log globs) and upstream
  (eval/data/world-v1/world.html, amara-life-v1 cache) entries.
- skills/manifest.json: append upstream's smoke-test skill plus retain
  the 9 kos-jarvis fork skills (39 total).
- CLAUDE.md / README.md / package.json (0.20.4) / skills/RESOLVER.md /
  src/cli.ts (mode 0755) auto-merged cleanly.

Fork-local patches preserved (verified post-merge):
- src/core/pglite-schema.ts:65 — idx_pages_source_id commented out
  (upstream garrytan#370 still open, fix retained).
- src/core/pglite-engine.ts:87 — pg_switch_wal() before close()
  (WAL durability patch, no upstream issue filed yet).
- src/cli.ts mode 100755 — bun shim executable bit.

Issue garrytan#332 (v0_13_0 process.execPath) fixed upstream in v0.19.0 ...
running gbrain apply-migrations --yes will clear the partial-ledger
remainder that has been stuck in doctor since the v0.13 sync.

v0.20's headline features (jobs supervisor, queue_health, wedge-rescue,
backpressure-audit) are Postgres-only and skip on our PGLite engine.
Sync is preventive ... keeps the fork mergeable rather than buying new
runtime capability.

Pre-merge baseline (HEAD 170876f):
- pages 1988, chunks 3750 (100% embedded), links 8522, timeline 10881
- doctor health 60/100 (failed: minions_migration partial 0.13.0)
- brain_score 86/100

Rollback: git tag pre-sync-v0.20-1777105378
PGLite snapshot: ~/.gbrain/brain.pglite.pre-sync-v0.20-1777105391 (416M)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant