feat: queue resilience — wall-clock timeouts, backpressure, --no-worker, env concurrency#379
Merged
feat: queue resilience — wall-clock timeouts, backpressure, --no-worker, env concurrency#379
Conversation
…er, env concurrency, shell guard Prevents stall-induced queue blockage discovered in production (OpenClaw): 1. Wall-clock timeout sweep: dead-letters active jobs exceeding 2× timeout_ms (or 2 × lockDuration × max_stalled). Catches jobs stuck while holding DB connections where FOR UPDATE SKIP LOCKED stall detection skips them. 2. Submission backpressure (maxWaiting): caps waiting jobs per name at submission time. Prevents autopilot-cycle flood when the queue is blocked. 3. --no-worker flag for autopilot: skips spawning the built-in worker child. For environments where the worker lifecycle is managed externally (systemd, Docker, OpenClaw service-manager). 4. GBRAIN_WORKER_CONCURRENCY env var: fallback for --concurrency when the worker is spawned by autopilot (which can't pass CLI flags to the child). 5. Shell job env guard with clear logging: shell handler is always registered but throws UnrecoverableError with a clear message when GBRAIN_ALLOW_SHELL_JOBS=1 is not set, instead of silently not registering.
…max-waiting CLI Addresses three production-hardening findings from the CEO + Eng + Codex adversarial review of PR #379: D2/H2: maxWaiting was TOCTOU-racy — two concurrent submitters could both see waitingCount < max and both insert. Wrap the count+select+insert in pg_advisory_xact_lock keyed on (name, queue). Serializes concurrent decisions for the SAME key while leaving different keys fully parallel. Lock auto-releases on txn commit/rollback — no cleanup path to leak. Also fix the missing queue-scope bug: count and select now filter on (name, queue) not name alone, so cross-queue same-name jobs don't suppress each other. D3/H3: resolveWorkerConcurrency silently accepted NaN / 0 / negative from parseInt. `inFlight.size < NaN` is always false → worker claims nothing → silent wedge from a single-typo env var. Clamp to ≥1 with a loud stderr warning naming the bad value. D5/H5: `gbrain jobs submit` never parsed `--max-waiting N` despite the MinionJobInput field. Wire the flag with clamp [1, 100], mirror `--max-stalled`. Extract `parseMaxWaitingFlag` for unit testing. Q1: Silent coalesce was invisible by design. New src/core/minions/backpressure-audit.ts mirrors shell-audit.ts's ISO-week JSONL pattern: `~/.gbrain/audit/backpressure-YYYY-Www.jsonl`. Coalesce events write one JSONL line with (queue, name, waiting_count, max_waiting, returned_job_id, ts). Best-effort — disk-full never blocks submission. A2: `gbrain jobs smoke --wedge-rescue` new opt-in regression case. Forges a wedged-worker row state, invokes handleStalled + handleTimeouts + handleWallClockTimeouts in order, asserts only wall-clock evicts. Mirrors the v0.14.3 `--sigkill-rescue` shape. Tests: 23 new unit cases in test/minions.test.ts covering wall-clock timeout (3 cases + non-interference with handleTimeouts), maxWaiting (coalesce, clamp 0, floor, concurrent-submitter race via Promise.all, cross-queue isolation, unset fallthrough), concurrency clamp (7 cases incl. NaN/0/negative), parseMaxWaitingFlag (5 cases), backpressure audit file write. Part of v0.19.1 plan at ~/.claude/plans/ok-wintermute-wrote-this-polished-matsumoto.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…book A5 / D4: New `queue_health` check in `gbrain doctor`. Postgres-only (PGLite has no multi-process worker surface). Two subchecks, both cheap (single SELECT each, status-index-covered): - stalled-forever: any active job with started_at > 1h. Surfaces the worst offenders (top 5 by started_at ASC) with `gbrain jobs get/cancel` fix hints. The incident that motivated v0.19.1 ran 90+ min before the operator noticed. - waiting-depth: per-name waiting count exceeds threshold. Default 10, overridable via GBRAIN_QUEUE_WAITING_THRESHOLD env (D9). Signals a submitter probably needs maxWaiting set. Worker-heartbeat subcheck from the original plan dropped (D4/H4): no minion_workers table exists, and lock_until-on-active-jobs is a lossy proxy that can't distinguish idle-worker from dead-worker. Tracked as follow-up B7. A4: --no-worker peer-liveness probe in autopilot. When --no-worker is set, every cycle runs a cheap SELECT checking for any active job whose lock_until was refreshed in the last 2 minutes. After 3 consecutive idle ticks, logs a loud WARNING naming the silent-wedge vector and referencing B7 as the ground-truth follow-up. Re-arms on next live signal so the warning doesn't spam every cycle. A6: New docs/guides/queue-operations-runbook.md (one viewport, ~60 lines). "My queue looks wedged — what do I run?" in order of escalation. What each doctor subcheck means. Self-check for the --no-worker / no-worker-running footgun. CLAUDE.md: key-files updates for handleWallClockTimeouts (v0.19.0 Layer 3 kill shot), maxWaiting advisory-lock rewrite (v0.19.1 D2), queue_health doctor check (v0.19.1 D4), and backpressure-audit.ts. Tests: all 143 minions + 13 doctor unit tests pass. No new test cases required in Lane B; the doctor queue_health exercise is in the E2E verification step (needs real PG to produce meaningful stalled-forever rows). The --no-worker probe is exercised by the smoke case's wedge setup in Lane A. README: unchanged. Existing `gbrain jobs submit` examples don't show --max-stalled, so no --max-waiting precedent to extend per A6 conditional. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
VERSION: 0.19.0 → 0.19.1 (patch; bug-fix-dominant, no schema change,
no new user-facing vocabulary).
CHANGELOG: new v0.19.1 entry at the top with the full release-summary
template per CLAUDE.md — bold two-line headline, lead paragraph, "numbers
that matter" before/after table measured against the real incident,
"what this means for OpenClaw users" closer, required "To take
advantage of v0.19.1" block naming the worker-restart requirement,
itemized changes by area, and "For contributors" section closing the
loop on the stale autopilot-idempotency narrative the CEO review was
based on.
Mechanism reframing per D1/H1: the 18-job pile-up was NOT caused by
missing idempotency (autopilot already passes
`idempotency_key: autopilot-cycle:${slot}` at autopilot.ts:241). The
18 jobs were 18 DIFFERENT slots stacking up behind the wedged one.
`maxWaiting` still caps the pile; the incident just wasn't about
idempotency. Adversarial review caught this before ship.
SPEC.md: deleted from repo root. It was Wintermute's planning artifact
for the original PR, not a shipped spec. Design docs belong under
docs/designs/ per repo convention; leaving one at repo root set a
precedent this repo doesn't want (A7/D11). CHANGELOG + the plan file
at ~/.claude/plans/ok-wintermute-wrote-this-polished-matsumoto.md are
the durable artifacts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
444499c to
b042157
Compare
Smoke case was setting lock_until in the past, so handleStalled's requeue path fired before handleWallClockTimeouts had a chance to evict. Production scenario is "lock_until still live (worker renewing) + timeout_at disqualified" — only wall-clock matches. Single-connection smoke can't simulate a row lock held by another txn, so we force the equivalent outcome: - lock_until = now() + 30s → handleStalled skips (not a stall) - timeout_at = NULL → handleTimeouts skips (needs NOT NULL) - started_at = now() - 10s, timeout_ms=1000 → wall-clock matches (2 × timeout_ms = 2000ms threshold exceeded) Verified: SMOKE PASS — Minions healthy + wedge rescue in 0.14s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Master shipped its own v0.19.1 (smoke-test skillpack, PR #369) and then v0.20.0 (BrainBench extraction to sibling repo, PR #195) while this branch was in review. Bumping our release from v0.19.1 to v0.20.1 follows the CLAUDE.md rule: "VERSION must be higher than master's." Resolved conflicts: - VERSION: 0.19.1 (ours) + 0.20.0 (master) → 0.20.1 - package.json: same bump applied to the version field - CHANGELOG.md: our queue-resilience entry renamed from v0.19.1 to v0.20.1 (6 inline refs updated across the body: numbers-that-matter table, "To take advantage" block, pre-v0.20.1 code reference, adversarial-review mention, and the v0.19.2 → v0.20.2 deferral reference for composite indexes). Entry stays at the top of the file, followed by master's v0.20.0 (BrainBench) and v0.19.1 (smoke-test skillpack). Sequence is now 0.20.1 → 0.20.0 → 0.19.1 → 0.19.0 → 0.18.2 → ... Rebuilt binary reports gbrain 0.20.1. Pre-merge verification carried forward: - 143 minions + 13 doctor unit tests pass - typecheck clean - 189 E2E tests pass against real Postgres 16 + pgvector - All 3 smoke cases pass (basic, --sigkill-rescue, --wedge-rescue) - queue_health doctor check fires correctly on a forged stalled-forever job No source changes — conflict resolution was version-label surgery only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Master shipped v0.20.2 (jobs supervisor, PR #364) while this branch was in review. Bumping our release from v0.20.1 to v0.20.3 follows the CLAUDE.md "VERSION must be higher than master's" rule. Resolved conflicts: - VERSION: 0.20.1 (ours) + 0.20.2 (master) → 0.20.3 - package.json: version bumped to match - CHANGELOG.md: our queue-resilience entry renamed from v0.20.1 to v0.20.3 throughout (5 inline refs updated: numbers-that-matter table, "To take advantage" block, pre-v0.20.3 code reference, adversarial-review mention). The "composite indexes deferred to v0.20.2" follow-up reference updated to v0.20.4 because master already took v0.20.2 for the supervisor feature. Auto-merged cleanly: - src/commands/doctor.ts: master added supervisor health check (filesystem-only, PID liveness + audit tail) alongside our new queue_health check. No conflict — different sections. - src/commands/jobs.ts: master added `jobs supervisor` subcommand with start / status / stop variants. No conflict with our --max-waiting CLI wiring or --wedge-rescue smoke case. New files pulled in from master (supervisor feature): - src/core/minions/supervisor.ts (MinionSupervisor class) - src/core/minions/handlers/supervisor-audit.ts (JSONL audit) - test/supervisor.test.ts (13 cases) - test/fixtures/supervisor-runner.ts (integration test helper) Post-merge verification: - `bun run typecheck` — clean - `bun test test/minions.test.ts test/doctor.test.ts` — 156 pass - `bun run build` → gbrain 0.20.3 - CHANGELOG version sequence: 0.20.3 → 0.20.2 → 0.20.0 → 0.19.1 → ... No source changes — conflict resolution was version-label surgery only. The queue-resilience code, tests, and docs landed from master cleanly alongside the new supervisor feature. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two CI failure clusters, both pre-existing but surfaced by the v0.20.3 merge: 1) test/minions-shell.test.ts — 12 failing cases. The shell handler throws UnrecoverableError when GBRAIN_ALLOW_SHELL_JOBS !== '1' (the production RCE guard at shell.ts:210). The unit tests exercise handler mechanics, not the guard, but never set the env var — so every invocation exits through the guard path instead of the code being tested. Fix: set GBRAIN_ALLOW_SHELL_JOBS=1 in beforeAll, restore in afterAll. The env-guard IS still tested separately via the test/minions.test.ts case added in v0.20.3 Lane A which toggles the var itself. 2) llms-full.txt — stale against CLAUDE.md. Key-files entries for queue.ts, doctor.ts, and the new backpressure-audit.ts updated in v0.20.3 Lane B triggered the build-llms drift guard. Regenerated via `bun run build:llms`; no behavior change, just the inlined-docs bundle catching up to source. Full test run: 2367 pass, 0 fail across 137 files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan
added a commit
that referenced
this pull request
Apr 24, 2026
Pulls upstream v0.20.3 (#379): queue resilience — wall-clock timeouts, backpressure, --no-worker, env concurrency. Conflicts resolved: - VERSION — kept 0.21.0; upstream is 0.20.3 - package.json — v0.21.0 wins - CHANGELOG.md — v0.21.0 preserved above upstream's v0.20.3 Build clean: 0.21.0 binary runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChenyqThu
pushed a commit
to ChenyqThu/jarvis-knowledge-os-v2
that referenced
this pull request
Apr 27, 2026
Merge upstream/master (commit 11abb24, gbrain v0.20.4) into KOS v2 fork. Six upstream commits land: - v0.19.0 check-resolvable OpenClaw fallback (garrytan#326) - v0.19.1 smoke-test skillpack (garrytan#369) - v0.20.0 BrainBench extracted to sibling repo (garrytan#195) - v0.20.2 jobs supervisor (garrytan#364) — Postgres-only, PGLite skips - v0.20.3 queue resilience + queue_health doctor (garrytan#379) — Postgres-only - v0.20.4 minion-orchestrator skill consolidation (garrytan#381) Conflicts resolved (2 real, 5 auto): - .gitignore: union both fork (.omc/, kos-jarvis log globs) and upstream (eval/data/world-v1/world.html, amara-life-v1 cache) entries. - skills/manifest.json: append upstream's smoke-test skill plus retain the 9 kos-jarvis fork skills (39 total). - CLAUDE.md / README.md / package.json (0.20.4) / skills/RESOLVER.md / src/cli.ts (mode 0755) auto-merged cleanly. Fork-local patches preserved (verified post-merge): - src/core/pglite-schema.ts:65 — idx_pages_source_id commented out (upstream garrytan#370 still open, fix retained). - src/core/pglite-engine.ts:87 — pg_switch_wal() before close() (WAL durability patch, no upstream issue filed yet). - src/cli.ts mode 100755 — bun shim executable bit. Issue garrytan#332 (v0_13_0 process.execPath) fixed upstream in v0.19.0 ... running gbrain apply-migrations --yes will clear the partial-ledger remainder that has been stuck in doctor since the v0.13 sync. v0.20's headline features (jobs supervisor, queue_health, wedge-rescue, backpressure-audit) are Postgres-only and skip on our PGLite engine. Sync is preventive ... keeps the fork mergeable rather than buying new runtime capability. Pre-merge baseline (HEAD 170876f): - pages 1988, chunks 3750 (100% embedded), links 8522, timeline 10881 - doctor health 60/100 (failed: minions_migration partial 0.13.0) - brain_score 86/100 Rollback: git tag pre-sync-v0.20-1777105378 PGLite snapshot: ~/.gbrain/brain.pglite.pre-sync-v0.20-1777105391 (416M)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
In production (OpenClaw + gbrain), the Minions job queue experienced full blockage:
autopilot-cyclejob stalled (git index.lock held the process)FOR UPDATE SKIP LOCKED, so the stalled job was un-evictableChanges
1. Wall-clock timeout sweep (
queue.ts+worker.ts)Dead-letters active jobs exceeding 2×
timeout_msregardless of lock state. Catches jobs stuck while holding DB connections thatFOR UPDATE SKIP LOCKEDstall detection skips.2. Submission backpressure —
maxWaiting(queue.ts+types.ts)New
maxWaitingoption on job submission. Caps waiting jobs per name — prevents autopilot-cycle flood when queue is blocked.3.
--no-workerflag (autopilot.ts)Skips spawning the built-in worker child. For environments where the worker lifecycle is managed externally (systemd, Docker, OpenClaw service-manager).
4.
GBRAIN_WORKER_CONCURRENCYenv var (jobs.ts)Fallback for
--concurrencywhen the worker is spawned by autopilot (which cannot pass CLI flags to the child).5. Shell job env guard with clear logging (
shell.ts+jobs.ts)Shell handler is always registered but throws
UnrecoverableErrorwith a clear message whenGBRAIN_ALLOW_SHELL_JOBS=1is not set. Previously, the handler was simply not registered — jobs would sit inwaitingwith no explanation.Testing
bun run typecheckpasses clean