v0.20.2 feat: gbrain jobs supervisor — self-healing worker process manager by garrytan · Pull Request #364 · garrytan/gbrain

garrytan · 2026-04-23T15:49:23Z

Summary

v0.20.2 ships gbrain jobs supervisor as a first-class CLI with the daemon-manager surface an agent can actually drive (start/status/stop + --json + audit log), the safety fixes the four-review pass (CEO + DX + Eng + Codex adversarial) surfaced, and the documentation flip that makes supervisor the canonical answer in docs/guides/minions-deployment.md.

Built by OpenClaw (original feature, b1bfabd). Reviewed + hardened + shipped via /autoplan followed by a 20-item multi-lane implementation plan (all items user-approved at both high-level and code-level).

Lanes shipped

Lane	Commit	What
A	`dc637e8`	Atomic PID lock (`openSync 'wx'`), queue-scoped health query, env-var safety, unified `shutdown(reason, exitCode)`, crashCount reset, Postgres-only docstring, inBackoff flag, listener ref tracking, FILTER query consolidation, child.on('error'), healthInFlight guard — 14 safety + correctness fixes
B	`b377e2c`	Deployment guide rewritten around supervisor, legacy watchdog deleted, README Operations paragraph, systemd/Procfile/fly.toml snippets now call `gbrain jobs supervisor`, docstring + exit-code table in `--help`
C	`e62ae46`	`gbrain jobs supervisor {start --detach / status / stop} [--json]` subcommands + `src/core/minions/handlers/supervisor-audit.ts` (JSONL audit with ISO-week rotation, `readSupervisorEvents()` helper)
D	`cafde77`	`gbrain doctor` supervisor check — reads PID file + audit log, reports `supervisor_running` / `last_start` / `crashes_24h` / `max_crashes_exceeded` with ok/warn/fail thresholds
E	`861b968`	4 critical integration tests via real `spawn()` + shell-script fakes — crash-restart, max-crashes-via-shutdown, SIGTERM-during-backoff clean exit, env-inheritance regression (the security fix guardrail)
F	`0a54dda`	VERSION 0.19.0 → 0.20.2, CHANGELOG entry in GStack voice with release summary + "numbers that matter" table + "To take advantage of v0.20.2" block

Test Coverage

Before: 7 tests (calculateBackoffMs + PID helpers only, ~15% behavioral coverage)
After: 13 tests — 6 unit tests + 4 critical integration tests + 2 audit-format tests + 1 crash-count regression

All 146 tests in supervisor + minions + doctor suites pass in ~8s.

Integration tests use test/fixtures/supervisor-runner.ts to spawn real MinionSupervisor instances in subprocesses (because start() calls process.exit on lifecycle end — can't run in the test runner's own process). Worker is a shell-script fake that emits precise exit codes / env output.

Pre-Landing Review

Four formal reviews completed via /autoplan before any code changed:

CEO Review (/plan-ceo-review) — strategy, DRY, MECE
DX Review (/plan-devex-review — agent persona) — TTHW analysis for OpenClaw
Eng Review (/plan-eng-review) — 10 issues, 22 test gaps, 3 silent failure modes
Codex adversarial (/codex consult) — caught 3 P0 bugs the other three missed (racy PID lock, dead stalled query, unscoped queue filter) and refuted 3 overstated review claims

Post-merge cleanup: autopilot.ts migration to MinionSupervisor is deferred to a follow-up PR (codex identified that the current start() API blocks and can't drop into autopilot's interval loop — needs a non-blocking API redesign, not a DRY cleanup).

Design Review

No frontend files changed — design review skipped.

Eval Results

No prompt-related files changed — evals skipped.

Greptile Review

Will run on push; any findings addressed in follow-up commits.

Plan Completion

All 20 planned items shipped. Full plan at ~/.claude/plans/claw-made-this-so-temporal-flamingo.md (local only).

10 primary blockers (all Option A "complete"): atomic PID lock, stalled query fix, queue scoping, env-inheritance + regression, unified shutdown, PID path relocation, docs, daemon-manager CLI, tests, release hygiene.
10 follow-ups (all Option A/B per question): crashCount reset, Postgres-only docstring, doctor integration, inBackoff flag, listener ref tracking, FILTER query consolidation, .ts path guard + --cli-path flag, child.on('error'), healthInFlight guard, minion-watchdog.sh deletion.

Explicitly deferred: autopilot.ts migration (requires non-blocking start() API redesign — tracked for follow-up PR).

TODOS

No TODOS.md in repo root — skipped.

Documentation

README.md: new paragraph under Minions section pointing to gbrain jobs supervisor as canonical.
docs/guides/minions-deployment.md: rewritten around the supervisor as primary; which-supervisor-when table; agent usage 3-command pattern; upgrading-from-watchdog migration block.
docs/guides/minions-deployment-snippets/{systemd.service,Procfile,fly.toml.partial}: now invoke gbrain jobs supervisor.
docs/guides/minions-deployment-snippets/minion-watchdog.sh: deleted.

Test plan

bun test test/supervisor.test.ts test/minions.test.ts test/doctor.test.ts (146 pass / 0 fail)
Post-merge: run gbrain jobs supervisor start --detach --json against a live Postgres brain, verify audit file at ~/.gbrain/audit/supervisor-YYYY-Www.jsonl
Post-merge: run gbrain doctor and verify supervisor check appears
Post-merge: run gbrain jobs supervisor status --json on a running supervisor and parse the JSON
Post-merge: SIGTERM a running supervisor and verify clean exit via gbrain jobs supervisor stop

🤖 Generated with Claude Code

Adds a first-class supervisor command that: - Spawns `gbrain jobs work` as a child process - Restarts on crash with exponential backoff (1s→60s cap) - Resets crash counter after 5min of stable operation - PID file locking prevents duplicate supervisors - Periodic health checks (stalled jobs, completion gaps) - Graceful shutdown (SIGTERM→35s→SIGKILL) Usage: gbrain jobs supervisor --concurrency 4 Replaces ad-hoc nohup patterns in bootstrap scripts. The autopilot command's internal supervisor can be migrated to use this in a follow-up. Tests: 7 pass (backoff calc, PID management, crash tracking)

… exit Lane A of PR #364 review fixes (20-item multi-lane plan). Addresses the codex-tier + CEO + Eng findings on src/core/minions/supervisor.ts: Safety + correctness: - Atomic O_CREAT|O_EXCL PID lock via openSync('wx') with stale-file liveness check. Prevents two supervisors racing on the same PID file. (codex #1) - Health check now queries status='active' AND lock_until < now() matching queue.ts:848's authoritative stalled definition. The prior `status = 'stalled'` predicate returned zero rows forever because 'stalled' is not a persisted value in the schema. (codex #2) - All health queries scoped to WHERE queue = $1 via opts.queue binding. Multi-queue installs no longer see cross-queue false positives. (codex #3) - Class default allowShellJobs flipped true→false AND explicit `delete env.GBRAIN_ALLOW_SHELL_JOBS` when false, so child workers don't silently inherit the var from the parent shell. (eng #8, codex #9) - Unified shutdown(reason, exitCode) — max-crashes now routes through the same drain path as SIGTERM. Single source of truth for lifecycle cleanup; prerequisite for trustworthy audit events (Lane C). (eng #1) - Default PID path moves from /tmp to ~/.gbrain/supervisor.pid with mkdirSync recursive + GBRAIN_SUPERVISOR_PID_FILE env override. Matches the rest of the product's ~/.gbrain/ convention; fresh installs no longer hit ENOENT. (CEO #2 + codex #6) Refinements: - crashCount = 1 after 5-min stable-run reset (was 0, produced calculateBackoffMs(-1) = 500ms by accident). Now reads as 'first crash of a new cycle' with a clean 1s backoff. (Nit 1) - Top-of-file POSTGRES-ONLY docstring documenting why the supervisor can't run against PGLite. (Nit 2) - inBackoff flag suppresses 'worker not alive' warn during the expected null-child window (crash → sleep → next spawn). (eng #2) - Tracked listener refs for SIGTERM/SIGINT removed in shutdown() so integration tests spinning up/tearing down multiple supervisors on one process don't leak handlers. (eng #3) - Single FILTER query replaces two SELECT counts — one round-trip instead of two, three metrics in one pass. (eng #10) - child.on('error') listener emits worker_spawn_failed event for ENOENT/EACCES; exit handler still increments crashCount as usual so max-crashes bounds permanent misconfigurations. (codex #7) - healthInFlight boolean guard with try/finally prevents overlapping health checks from stacking on a hung DB. (codex #8) Documented exit codes (ExitCodes const): 0 CLEAN, 1 MAX_CRASHES, 2 LOCK_HELD, 3 PID_UNWRITABLE Agent can branch on exit=2 ('another supervisor, I'm fine') vs exit=1 ('escalate to human'). Event emitter surface: - started / worker_spawned / worker_exited / worker_spawn_failed - backoff / health_warn / health_error / max_crashes_exceeded - shutting_down / stopped Plumbed through emit() with an onEvent callback hook for Lane C's audit writer. json:false is the default; Lane C's --json mode flips it and writes JSONL to stderr. CLI changes (src/commands/jobs.ts): - `gbrain jobs supervisor` gains --allow-shell-jobs (explicit opt-in mirroring the env-var gate), --cli-path (override auto-resolution for exotic setups), and --json (JSONL lifecycle events on stderr). - Expanded --help body with description, 3 examples, and exit-code table. (DX Fix A per review) - Three-tier PID path resolution: --pid-file > GBRAIN_SUPERVISOR_PID_FILE > ~/.gbrain/supervisor.pid (via exported DEFAULT_PID_FILE). - Removed the catch-fallback to process.argv[1] — resolveGbrainCliPath() throws its own actionable install-hint error, which is what dev users need instead of a cryptic spawn failure on a .ts path. (codex #5) Tests: existing 7 supervisor.test.ts cases continue to pass. Integration tests (crash-restart, max-crashes, SIGTERM-during-backoff, env-inheritance regression) land in Lane E. Out of scope for this lane (tracked in follow-up lanes): - Audit file writer at ~/.gbrain/audit/supervisor-YYYY-Www.jsonl (Lane C) - Documentation pass (Lane B) - supervisor start/status/stop subcommands (Lane C) - gbrain doctor supervisor check (Lane D) - /ship release hygiene (Lane F) - autopilot.ts migration to MinionSupervisor (deferred to follow-up PR per codex — requires non-blocking start() API redesign, not ~30 lines) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lane B of PR #364 review fixes. Reframes docs/guides/minions-deployment.md around `gbrain jobs supervisor` as the default answer (blocker 7), deletes the 68-line legacy bash watchdog (F10), and updates README + deployment snippets to match. docs/guides/minions-deployment.md: - New 'Worker supervision' section at the top with the canonical 3-command agent pattern (start --detach / status --json / stop) and a documented exit-code table (0 clean, 1 max-crashes, 2 lock-held, 3 PID-unwritable). - 'Which supervisor when?' decision table: container = supervisor as PID 1, Linux VM = systemd-over-supervisor, dev laptop = bare terminal. - New 'Agent usage' section for OpenClaw / Hermes / Cursor / Codex — the 3-turn discover-start-maintain workflow that replaces shell archaeology with machine-parseable JSON events + an audit file at ~/.gbrain/audit/supervisor-YYYY-Www.jsonl. - Demoted the 'Option 1: watchdog cron' path entirely; replaced with a straightforward upgrade migration block (stop script, remove cron line, start supervisor, verify via doctor). - Preconditions now check Postgres connectivity directly (supervisor is Postgres-only; the CLI rejects PGLite with a clear error). Snippets: - systemd.service: ExecStart now invokes `gbrain jobs supervisor` instead of raw `gbrain jobs work`. Two-layer supervision (systemd → supervisor → worker) buys automatic restart on reboot plus fast crash recovery. ReadWritePaths expanded to cover $HOME/.gbrain (supervisor PID + audit). - Procfile + fly.toml.partial: same change — platform restarts the container on host events, supervisor restarts the worker on crashes. - minion-watchdog.sh: deleted (git history retains it for anyone in an exotic deployment). Supervisor subsumes every capability it had plus atomic PID locking, structured audit events, queue-scoped health checks, and graceful drain on SIGTERM. README.md: - Added a paragraph under the Minions section pointing `gbrain jobs supervisor` as canonical, noting the --detach / status / stop surface and the audit file path, with a link to the full deployment guide. Kept `gbrain jobs work` documented for direct raw invocation but flagged 'prefer supervisor' for any long-running use. The supervisor `--help` body itself (3 examples + exit-code table in src/commands/jobs.ts) landed with Lane A — this lane finishes the discoverability story by making the supervisor findable via doc grep, README landing, and deployment-guide landing paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lane C of PR #364 review fixes. Adds the daemon-manager CLI surface so agents can drive `gbrain jobs supervisor` in 3 turns instead of 10, and the audit writer that makes lifecycle events inspectable across process restarts. (Blocker 8, closes DX Fix A/B/C.) New: src/core/minions/handlers/supervisor-audit.ts - writeSupervisorEvent(emission, supervisorPid) appends JSONL to `${GBRAIN_AUDIT_DIR:-~/.gbrain/audit}/supervisor-YYYY-Www.jsonl`. ISO-week rotation via a `computeSupervisorAuditFilename()` helper that mirrors `shell-audit.ts` exactly (year-boundary ISO week math, Thursday anchor, etc). - readSupervisorEvents({sinceMs}) returns parsed events from the current week's file, oldest-first, for Lane D's doctor check. Malformed lines are skipped silently (disk-full truncation is already best-effort at write time). - Reuses `resolveAuditDir()` from shell-audit.ts so the `GBRAIN_AUDIT_DIR` env var override works identically across all gbrain audit trails. src/commands/jobs.ts: supervisor subcommand dispatcher - `gbrain jobs supervisor [start] [--detach] [--json] ...` — default subcommand. Without --detach, runs foreground as before. With --detach, forks a background child (inheriting stderr so the caller can still tail JSONL events), writes a stdout payload: {"event":"started","supervisor_pid":N,"pid_file":"...","detached":true} and exits 0. Stdin/stdout on the detached child are /dev/null so the parent shell isn't held open. - `gbrain jobs supervisor status [--json]` — reads the PID file, checks liveness via `kill -0`, then reads the last 24h from the supervisor audit file to compute crashes_24h / last_start / max_crashes_exceeded. Exits 0 if running, 1 if not. JSON output is machine-parseable; human output is a 5-line ASCII report. - `gbrain jobs supervisor stop [--json]` — reads PID, sends SIGTERM, polls `kill -0` every 250ms for up to 40s (supervisor's own 35s worker-drain + 5s slack). Reports outcome: drained / timeout_40s / pid_file_missing / pid_file_corrupt / process_gone. Exit 0 on clean stop. - `--json` flag is already plumbed through to the supervisor opts from Lane A — this lane adds the onEvent audit-writer callback so every supervisor emission (started, worker_spawned, worker_exited, worker_spawn_failed, backoff, health_warn, health_error, max_crashes_exceeded, shutting_down, stopped) lands in the JSONL file with the supervisor's PID attached. --help body updated: - Three separate usage lines (start / status / stop). - SUBCOMMANDS block with one-line summaries each. - EXIT CODES block (unchanged from Lane A, moved under SUBCOMMANDS). - EXAMPLES block updated with status --json + stop + --detach forms. Tests: existing 127 supervisor + minions tests continue to pass. Integration tests for the new subcommands + audit writer land with Lane E. Follow-up (Lane D): `gbrain doctor` will read readSupervisorEvents() from this module to surface a `supervisor` health check alongside its existing checks (DB connectivity, schema version, queue health). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lane D of PR #364 review fixes. Closes the observability loop: now that Lane C writes supervisor lifecycle events to `${GBRAIN_AUDIT_DIR:-~/.gbrain/audit}/supervisor-YYYY-Www.jsonl`, `gbrain doctor` surfaces a `supervisor` check alongside its existing health indicators. Implementation (src/commands/doctor.ts, filesystem-only block 3b-bis): - Resolves DEFAULT_PID_FILE via the same three-tier logic as the start path (--pid-file > GBRAIN_SUPERVISOR_PID_FILE > ~/.gbrain/supervisor.pid). - Reads the PID file + `kill -0 <pid>` for liveness. - Calls readSupervisorEvents({sinceMs: 24h}) from the audit module to derive last_start / crashes_24h / max_crashes_exceeded. - Suppresses the check entirely when the user has never invoked the supervisor (no PID file AND no audit events) — avoids noise on installs that don't use the feature. Status thresholds: fail max_crashes_exceeded event seen in last 24h (supervisor gave up; operator needs to restart or triage) warn supervisor not running but audit shows prior use (unexpected stop — likely crash or manual kill) warn running but > 3 crashes in last 24h (supervisor recovering but worker is unstable) ok running + ≤ 3 crashes + no max_crashes event All failure paths emit a paste-ready recovery command. Read/import errors are swallowed (best-effort like the other doctor checks). Tests: all 127 supervisor + minions tests still green; 13 existing doctor tests unaffected. F3 done. All four lanes A/B/C/D are now committed; Lane E (integration tests) and Lane F (/ship v0.20.2) remain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lane E of PR #364 review fixes (blocker 10). Fills the ~15% coverage gap flagged in the eng review by actually exercising the code paths that will break in production — crash-restart loop, max-crashes exit, SIGTERM-during-backoff, env-var inheritance — via real spawn() calls against fake shell-script workers. No mocks: real fork, real signals, real env propagation, real audit file writes. test/fixtures/supervisor-runner.ts (new, 55 lines): A standalone bun script that constructs a MinionSupervisor from env vars (SUP_PID_FILE / SUP_CLI_PATH / SUP_MAX_CRASHES / SUP_BACKOFF_FLOOR_MS / SUP_HEALTH_INTERVAL_MS / SUP_ALLOW_SHELL_JOBS / SUP_AUDIT_DIR) and calls start(). Mock engine returns empty rows for executeRaw (health check path still exercised without Postgres). Tests spawn this as a subprocess because MinionSupervisor.start() calls process.exit() on shutdown — can't run it in the test runner's own process. test/supervisor.test.ts (existing; 91 → 300 lines): - Added IntegrationHarness helper: creates a unique tmpdir per test, a fake worker shell script, a PID-file path, and an audit-dir path; cleanup runs in finally. - spawnSupervisor() forks bun on the runner with env vars set. - readAudit() reads the supervisor-YYYY-Www.jsonl file via the existing readSupervisorEvents() helper (Lane C), threading GBRAIN_AUDIT_DIR through so tests don't collide on ~/.gbrain. - waitFor(pred, timeoutMs) polls helper for event-driven tests. Four integration tests (with _backoffFloorMs=5 for <1s suite runs): 1. "respawns the worker after a crash and eventually exits with max-crashes code=1" Worker always `exit 1`. maxCrashes=3. Asserts: exit code 1, PID file cleaned up, audit contains started + 3x worker_spawned + 3x worker_exited + max_crashes_exceeded + shutting_down + stopped, and the stopped event carries {reason:'max_crashes', exit_code:1}. Locks in blockers 1 (PID lock), 2+3+6 (health SQL doesn't 500), 5 (unified shutdown emits right events), F8 (spawn errors counted). 2. "receives SIGTERM while sleeping between crashes and exits 0 cleanly" Worker always `exit 1`, backoff floor 800ms to catch the sleep. Asserts: SIGTERM during backoff → exit code 0 (not 1) in <5s, no signal kill (process.exit via shutdown), audit contains shutting_down {reason:'SIGTERM'} + stopped, PID file cleaned up. Locks in eng Issue 1 (unified exit path), eng Issue 3 (signal handlers don't accumulate across shutdowns). 3. "strips inherited GBRAIN_ALLOW_SHELL_JOBS when allowShellJobs=false, even if parent has it set" ⚠ CRITICAL regression test Parent env has GBRAIN_ALLOW_SHELL_JOBS=1. SUP_ALLOW_SHELL_JOBS=0. Worker writes $GBRAIN_ALLOW_SHELL_JOBS (or 'UNSET' if absent) to an OUT_FILE. Asserts child sees 'UNSET'. Locks in codex #9 + eng #8: the `else delete env.GBRAIN_ALLOW_SHELL_JOBS` branch from Lane A is load-bearing for the supervisor's security posture; this test prevents a future refactor silently re-opening the inheritance hole. 4. "DOES pass GBRAIN_ALLOW_SHELL_JOBS to child when allowShellJobs=true" Positive-path companion to #3. SUP_ALLOW_SHELL_JOBS=1 → worker sees '1'. Confirms the else-branch doesn't over-strip and that operators who explicitly opt in still get shell-exec enabled. Plus two audit-format unit tests: - computeSupervisorAuditFilename format (regex match) - Year-boundary ISO week: 2027-01-01 → supervisor-2026-W53.jsonl (matches the shell-audit.ts pattern exactly) Before: 7 tests covering backoff math + PID helpers (~15% behavioral coverage per eng review). After: 13 tests across all critical lifecycle paths (crash-restart, max-crashes, SIGTERM, env-inheritance, audit rotation). All 146 tests in supervisor + minions + doctor suites green in ~8s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lane F of PR #364 review fixes. Closes the multi-lane plan with release hygiene: VERSION bump 0.19.0 → 0.20.2, package.json sync, CHANGELOG entry in GStack voice with release summary + "numbers that matter" table + "To take advantage of v0.20.2" migration block + itemized changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # CHANGELOG.md # VERSION # package.json

The --help body in src/commands/jobs.ts is one big backtick template literal. The supervisor subcommand description I added in Lane B used both `${GBRAIN_AUDIT_DIR:-~/.gbrain/audit}` (parsed as a template interpolation into an undefined variable) and inline `code` backticks (parsed as nested template literals). CI caught it with ~200 tsc parse errors across the file. Fix: - Escape `${...}` → `\${...}` so the audit-file path renders literally. - Replace prose inline-code backticks with plain single-quote fences (`gbrain jobs work` → 'gbrain jobs work', `~/.gbrain/supervisor.pid` → ~/.gbrain/supervisor.pid). `--help` output is human prose; the single-quote form reads cleanly in a terminal without needing to smuggle nested backticks through a template literal. `bunx tsc --noEmit` is clean. 146 tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI drift guard caught that `llms-full.txt` didn't match the current generator output. Root cause: the Lane B rewrite of `docs/guides/minions-deployment.md` (supervisor as canonical, watchdog deleted) changed content that gets inlined into `llms-full.txt`, but I didn't run `bun run build:llms` to regenerate. `bun test test/build-llms.test.ts` now clean (7/7 pass). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # CHANGELOG.md # VERSION # package.json

Pulls upstream v0.20.2 (#364): gbrain jobs supervisor — self-healing worker process manager. Conflicts resolved: - VERSION — kept 0.21.0; upstream is 0.20.2 - package.json — v0.21.0 wins - CHANGELOG.md — v0.21.0 preserved above upstream's v0.20.2 Build clean: 0.21.0 binary runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Master shipped v0.20.2 (jobs supervisor, PR #364) while this branch was in review. Bumping our release from v0.20.1 to v0.20.3 follows the CLAUDE.md "VERSION must be higher than master's" rule. Resolved conflicts: - VERSION: 0.20.1 (ours) + 0.20.2 (master) → 0.20.3 - package.json: version bumped to match - CHANGELOG.md: our queue-resilience entry renamed from v0.20.1 to v0.20.3 throughout (5 inline refs updated: numbers-that-matter table, "To take advantage" block, pre-v0.20.3 code reference, adversarial-review mention). The "composite indexes deferred to v0.20.2" follow-up reference updated to v0.20.4 because master already took v0.20.2 for the supervisor feature. Auto-merged cleanly: - src/commands/doctor.ts: master added supervisor health check (filesystem-only, PID liveness + audit tail) alongside our new queue_health check. No conflict — different sections. - src/commands/jobs.ts: master added `jobs supervisor` subcommand with start / status / stop variants. No conflict with our --max-waiting CLI wiring or --wedge-rescue smoke case. New files pulled in from master (supervisor feature): - src/core/minions/supervisor.ts (MinionSupervisor class) - src/core/minions/handlers/supervisor-audit.ts (JSONL audit) - test/supervisor.test.ts (13 cases) - test/fixtures/supervisor-runner.ts (integration test helper) Post-merge verification: - `bun run typecheck` — clean - `bun test test/minions.test.ts test/doctor.test.ts` — 156 pass - `bun run build` → gbrain 0.20.3 - CHANGELOG version sequence: 0.20.3 → 0.20.2 → 0.20.0 → 0.19.1 → ... No source changes — conflict resolution was version-label surgery only. The queue-resilience code, tests, and docs landed from master cleanly alongside the new supervisor feature. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge upstream/master (commit 11abb24, gbrain v0.20.4) into KOS v2 fork. Six upstream commits land: - v0.19.0 check-resolvable OpenClaw fallback (garrytan#326) - v0.19.1 smoke-test skillpack (garrytan#369) - v0.20.0 BrainBench extracted to sibling repo (garrytan#195) - v0.20.2 jobs supervisor (garrytan#364) — Postgres-only, PGLite skips - v0.20.3 queue resilience + queue_health doctor (garrytan#379) — Postgres-only - v0.20.4 minion-orchestrator skill consolidation (garrytan#381) Conflicts resolved (2 real, 5 auto): - .gitignore: union both fork (.omc/, kos-jarvis log globs) and upstream (eval/data/world-v1/world.html, amara-life-v1 cache) entries. - skills/manifest.json: append upstream's smoke-test skill plus retain the 9 kos-jarvis fork skills (39 total). - CLAUDE.md / README.md / package.json (0.20.4) / skills/RESOLVER.md / src/cli.ts (mode 0755) auto-merged cleanly. Fork-local patches preserved (verified post-merge): - src/core/pglite-schema.ts:65 — idx_pages_source_id commented out (upstream garrytan#370 still open, fix retained). - src/core/pglite-engine.ts:87 — pg_switch_wal() before close() (WAL durability patch, no upstream issue filed yet). - src/cli.ts mode 100755 — bun shim executable bit. Issue garrytan#332 (v0_13_0 process.execPath) fixed upstream in v0.19.0 ... running gbrain apply-migrations --yes will clear the partial-ledger remainder that has been stuck in doctor since the v0.13 sync. v0.20's headline features (jobs supervisor, queue_health, wedge-rescue, backpressure-audit) are Postgres-only and skip on our PGLite engine. Sync is preventive ... keeps the fork mergeable rather than buying new runtime capability. Pre-merge baseline (HEAD 170876f): - pages 1988, chunks 3750 (100% embedded), links 8522, timeline 10881 - doctor health 60/100 (failed: minions_migration partial 0.13.0) - brain_score 86/100 Rollback: git tag pre-sync-v0.20-1777105378 PGLite snapshot: ~/.gbrain/brain.pglite.pre-sync-v0.20-1777105391 (416M)

garrytan force-pushed the feat/worker-supervisor branch from 4b72cd3 to b1bfabd Compare April 24, 2026 02:59

garrytan and others added 6 commits April 23, 2026 23:42

garrytan changed the title ~~feat: gbrain jobs supervisor — self-healing worker process manager~~ v0.20.2 feat: gbrain jobs supervisor — self-healing worker process manager Apr 24, 2026

garrytan and others added 4 commits April 24, 2026 00:01

Merge remote-tracking branch 'origin/master' into feat/worker-supervisor

d216bde

# Conflicts: # CHANGELOG.md # VERSION # package.json

Merge remote-tracking branch 'origin/master' into feat/worker-supervisor

2bcc1c7

# Conflicts: # CHANGELOG.md # VERSION # package.json

garrytan merged commit e3f7042 into master Apr 24, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.20.2 feat: gbrain jobs supervisor — self-healing worker process manager#364

v0.20.2 feat: gbrain jobs supervisor — self-healing worker process manager#364
garrytan merged 11 commits intomasterfrom
feat/worker-supervisor

garrytan commented Apr 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Lanes shipped

Test Coverage

Pre-Landing Review

Design Review

Eval Results

Greptile Review

Plan Completion

TODOS

Documentation

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

garrytan commented Apr 23, 2026 •

edited

Loading