Skip to content

v0.41.19.0 feat: Supavisor Retry Cathedral (engine-level retry primitive)#1537

Merged
garrytan merged 1 commit into
masterfrom
garrytan/fredericton
May 27, 2026
Merged

v0.41.19.0 feat: Supavisor Retry Cathedral (engine-level retry primitive)#1537
garrytan merged 1 commit into
masterfrom
garrytan/fredericton

Conversation

@garrytan

@garrytan garrytan commented May 27, 2026

Copy link
Copy Markdown
Owner

Summary

Engine-level retry primitive that closes the v0.41.17 production incident: ~3,000 wiki links + timeline entries silently lost per dream cycle on a 16K-page brain because Supavisor's 5-10s circuit-breaker recovery couldn't fit inside the prior single-500ms-retry window.

Architecture (the cathedral move): retry becomes a data-primitive contract, not a caller responsibility. postgres-engine.ts + pglite-engine.ts self-retry inside addLinksBatch, addTimelineEntriesBatch, and upsertChunks. Every caller — current AND future — inherits retry for free. CI lint guard scripts/check-no-double-retry.sh fails the build if anyone re-wraps an engine batch method (prevents 3×3=9 retry amplification).

Codex-hardened defaults: BULK_RETRY_OPTS = {maxRetries:3, delayMs:1000, delayMaxMs:10000, jitter:'decorrelated'}. ~12s worst-case wait covers the full Supavisor recovery window. AWS-style decorrelated jitter replaces 'full' jitter which allowed near-zero retries that re-hit the still-recovering breaker.

Observability: ~/.gbrain/audit/batch-retry-YYYY-Www.jsonl records every retry. gbrain doctor's new batch_retry_health check reads last 24h (codex H-9: avoid permanent noise from one historical blip) with per-site thresholds. 30-day audit pruning hooked into the cycle's purge phase.

AbortSignal threading: SIGTERM aborts sleeping retries cleanly instead of blocking deploys for up to delayMaxMs.

Operator tuning: GBRAIN_BULK_MAX_RETRIES / GBRAIN_BULK_RETRY_BASE_MS / GBRAIN_BULK_RETRY_MAX_MS0 disables retries for debugging; bad values surface at gbrain doctor startup with paste-ready fix hints.

Test Coverage

64 new test cases across 4 new test files:

  • test/core/retry.test.ts — 37 cases (jitter math, abort semantics, env-override boundaries, typed audit-site validation)
  • test/audit/batch-retry-audit.test.ts — 12 cases (privacy posture, corruption tolerance, 30-day pruning)
  • test/doctor-batch-retry.test.ts — 10 cases (warn/fail thresholds, corrupt-JSONL tolerance, M-10 env validation)
  • test/core/retry-stress.slow.test.ts — 5 cases (100 batches × 30% blip rate, asserts zero row loss)

Plus PR #1523's original 5 cases preserved via co-author trailer (assertions adjusted for new BULK_RETRY_OPTS defaults).

Tests: 11,432 → 11,496 (+64 new)

Pre-Landing Review

CEO review (SELECTIVE EXPANSION): 4 cherry-picks proposed, 4 accepted (env override, doctor check, sync/reindex coverage, stress regression).

Eng review (2 passes): 10 findings, 0 critical gaps. The load-bearing call was the architectural pivot from CEO plan's per-site wrapping to engine-level wrap — closes the bug class structurally for future callers too.

Codex independent review: 23 findings; 10 critical/high absorbed (decorrelated jitter, 12s backoff window, AbortSignal, commit-ambiguity proof per primitive, backfill unification, typed audit-site enum, doctor 24h window + thresholds, audit pruning, env validation at doctor startup, per-instance test seam via WeakMap).

Plan Completion

11/11 tasks done (T1-T11). Full record in ~/.claude/plans/system-instruction-you-are-working-smooth-gosling.md.

Verification Results

  • bun run verify: 28/28 checks green (includes 2 new lint guards: check-no-double-retry, check-batch-audit-site)
  • bun run test: 11,453 pass / 1 pre-existing flake (schema-cli.test.ts > schema active reports default resolution — confirmed by running on clean master, NOT introduced by this wave)
  • bun run test:slow: 40/40 including new retry-stress.slow.test.ts
  • bunx tsc --noEmit: 0 errors

Closes / Absorbs

Closes PR #1523 (@garrytan-agents). Original extract.ts fix preserved verbatim in withRetry's call shape; 5 test cases moved to test/core/retry.test.ts with assertions adjusted for the v0.41.19.0 BULK_RETRY_OPTS defaults. Co-Authored-By trailer on the commit credits the original work.

Test plan

  • bun run verify (28 checks)
  • bun run test (parallel unit suite)
  • bun run test:slow (includes stress regression)
  • bunx tsc --noEmit clean
  • CI green on push
  • Smoke test against the user's actual Supavisor-pooled brain: run gbrain extract all + grep ~/.gbrain/audit/batch-retry-*.jsonl

🤖 Generated with Claude Code

Engine-level retry primitive that closes the v0.41.17 production incident
where ~3,000 wiki links + timeline entries were silently lost per dream
cycle on a 16K-page brain. Supavisor's circuit-breaker takes 5-10s to
recover; the prior single-500ms-retry shape couldn't survive it.

ARCHITECTURE
============

Retry becomes a data-primitive contract, not a caller responsibility.
postgres-engine.ts + pglite-engine.ts now self-retry inside addLinksBatch,
addTimelineEntriesBatch, and upsertChunks. Every caller — current AND
future — inherits retry-for-free. CI lint guard `scripts/check-no-double-retry.sh`
fails the build if anyone re-wraps an engine batch method (preventing
3×3=9 retry amplification on incomplete reverts).

CODEX-HARDENED DEFAULTS
=======================

BULK_RETRY_OPTS = {maxRetries:3, delayMs:1000, delayMaxMs:10000,
jitter:'decorrelated'}. Total worst-case wait ≈12s covers full Supavisor
recovery window. Decorrelated jitter (AWS-style uniform(base, prevDelay*3)
capped at maxDelay) replaces 'full' which allowed near-zero retries that
re-hit the still-recovering breaker.

AbortSignal threading from MinionWorker.shutdownAbort.signal through
engine method opts → withRetry → abortableSleep. SIGTERM aborts sleeping
retries instead of blocking deploys for up to delayMaxMs.

OBSERVABILITY
=============

`~/.gbrain/audit/batch-retry-YYYY-Www.jsonl` records every retry event
(success-after-blip AND exhausted-retries). Built on the v0.40.4.0
audit-writer cathedral. Privacy posture: never logs slugs / page IDs /
content (mirrors shell-audit.ts).

`gbrain doctor` learns `batch_retry_health` check. Reads last 24h
(not 7d — codex H-9: avoid permanent noise from one historical blip).
Thresholds: ok (zero or <3 same-site), warn (>=3 same-site OR >=5
cross-site), fail (>=20 sustained breaker). Surfaces bad GBRAIN_BULK_*
env at startup (codex M-10). Corrupt-JSONL tolerant.

30-day audit pruning hooked into the dream cycle's purge phase (codex H-8
— implements the 'pruning convention' for real).

OPERATOR TUNING
===============

GBRAIN_BULK_MAX_RETRIES (int >= 0; 0 disables retries for debugging)
GBRAIN_BULK_RETRY_BASE_MS (int > 0)
GBRAIN_BULK_RETRY_MAX_MS (int >= base)

Bad values throw GBrainError with paste-ready fix hints at doctor startup,
not at first-retry mid-cycle.

VERIFICATION
============

- bun run verify: 28/28 checks green (includes 2 new lint guards:
  check-no-double-retry, check-batch-audit-site)
- bun run test: 11453 pass / 1 pre-existing flake (schema-cli.test.ts —
  confirmed by running on clean master, NOT introduced by this wave)
- bun run test:slow: 40/40 including new test/core/retry-stress.slow.test.ts
  (100 batches × 30% blip rate × decorrelated jitter, zero row loss)
- bunx tsc --noEmit: 0 errors

REVIEWS
=======

- CEO review (SELECTIVE EXPANSION): 4 cherry-picks proposed, 4 accepted
- Eng review (2 passes): 10 findings, 0 critical gaps, architectural
  pivot from per-site to engine-level wrap
- Codex independent review: 23 findings; 10 critical/high absorbed
  (decorrelated jitter, 12s backoff window, AbortSignal, idempotency
  proof, backfill unification, typed audit-site enum, doctor expiry
  thresholds, audit pruning, env validation at doctor startup)

PR #1523 closed and absorbed (@garrytan-agents original extract.ts fix
preserved via co-author trailer; 5 test cases moved to test/core/retry.test.ts
with assertions adjusted for the v0.41.19.0 BULK_RETRY_OPTS defaults).

Co-Authored-By: garrytan-agents <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@garrytan garrytan force-pushed the garrytan/fredericton branch from 3a22022 to aacc81a Compare May 27, 2026 06:11
@garrytan garrytan changed the title v0.42.0.0 feat: Supavisor Retry Cathedral (engine-level retry primitive) v0.41.19.0 feat: Supavisor Retry Cathedral (engine-level retry primitive) May 27, 2026
@garrytan garrytan merged commit a7b79b6 into master May 27, 2026
21 checks passed
garrytan added a commit that referenced this pull request May 27, 2026
…ems #6 + #7) (#1544)

* feat(doctor): doctor-categories foundation — BRAIN/SKILL/OPS/META sets + drift guard

Categorizes every doctor check name into exactly one of four categories. Exported
constants + categorizeCheck(name) helper are the single source of truth for the
v0.41.20.0 brain_checks_score + category_scores + --scope=brain wave. Drift guard
test parses doctor.ts source for both inline {name: 'foo'} and helper
const name = 'foo' patterns; CI fires if any check name lacks a category.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(doctor): brain_checks_score + category_scores + --scope=brain skip-computation

Extends Check with optional category and DoctorReport with brain_checks_score +
category_scores (additive — schema_version stays at 2; back-compat health_score
math byte-identical). buildChecks gains --scope=brain with explicit early-skip
gates around the SKILL check group (resolver_health + skill_conformance +
skill_brain_first + whoknows_health). Sub-second doctor on a brain with thousands
of skills. computeDoctorReport tags every check via categorizeCheck() at compute
time. Human output leads with the brain figure and renders the weighted
BrainHealth.brain_score alongside.

Test seam fix in test/doctor-home-dir-in-worktree.test.ts: the pre-existing
fragile JSON parser walked back from "checks" to find the envelope's outer
brace; v0.41.20.0's new nested category_scores object broke that heuristic.
Anchored on the canonical {"schema_version" envelope prefix instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(status): gbrain status — single-screen brain health dashboard + get_status_snapshot MCP op

NEW gbrain status command (src/commands/status.ts) composes 6 sections:
sync (per-source last_sync_at + staleness via buildSyncStatusReport),
cycle (TWO rows: last autopilot-cycle + last autopilot-* of any kind —
reflects v0.36.4.0 health-aware autopilot's targeted handler routing;
totals read from result.report.totals per the canonical handler shape),
locks (gbrain_cycle_locks active rows), workers (readSupervisorEvents +
summarizeCrashes), queue (LIVE counts NO time-window — old stuck jobs
are exactly what status surfaces), autopilot (PID liveness via kill -0).

Stable --json envelope (schema_version: 1). Exit codes 0=ok / 1=snapshot
failed / 2=usage. --section filter.

Thin-client mode routes Sync + Cycle through NEW get_status_snapshot MCP
op (admin scope, NOT localOnly; payload deliberately omits Locks /
Workers / Queue / Autopilot so feature creep can't quietly widen the
admin-scoped data exposure). Local-only sections render "local-only —
N/A on remote brain" honestly instead of pretending the local install's
empty state is the remote brain's.

CLI dispatch: pre-engine-bind branch for thin-client (no PGLite needed)
+ engine-connected dispatch case for local mode. CLI-only architecture
per codex MAJOR-4 (status owns its own thin-client branch inside
runStatus, not routed through op dispatch).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version to v0.41.20.0 + CHANGELOG + TODOS + llms regen

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(doctor-categories): categorize batch_retry_health from v0.41.19.0 Supavisor wave

The drift guard correctly caught a new check introduced by master's v0.41.19.0
Supavisor Retry Cathedral (PR #1537). batch_retry_health surfaces batch-write
retry events from the new src/core/audit/batch-retry-audit.ts module — OPS
category (infrastructure liveness).

This is exactly why the drift guard exists: any future check added to doctor.ts
without a category entry fails CI immediately instead of silently degrading
to 'meta'.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan added a commit that referenced this pull request May 27, 2026
Resolve VERSION, package.json, CHANGELOG conflicts. v0.41.22.0 stays
on top (higher than master's v0.41.20.0); master's v0.41.20.0 entry
preserved below in CHANGELOG order. Brought in master's gbrain status
+ doctor --scope=brain (PR #1544) and the v0.41.19.0 Supavisor Retry
Cathedral (PR #1537) cleanly via auto-merge.

Typecheck clean. bun run verify: 28/28 checks pass.
mgunnin added a commit to mgunnin/gbrain that referenced this pull request May 28, 2026
* upstream/master:
  v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes garrytan#1567) (garrytan#1572)
  v0.41.26.0 fix: dream --source + ingest junk titles + emoji-crash (supersedes garrytan#1559, garrytan#1561) (garrytan#1571)
  v0.41.25.0 perf(sync): batched deletes + global page-generation clock (supersedes garrytan#1538) (garrytan#1566)
  v0.41.24.0 fix(conversation-parser): threshold gates + bold-paren-time pattern — 20,167 Circleback messages unblocked (closes garrytan#1533) (garrytan#1543)
  v0.41.23.0 feat: extract operator surfaces + pack-driven extractables (garrytan#1541)
  v0.41.22.1 feat: brainstorm/lsd judge fixes (closes garrytan#1540 end-to-end) (garrytan#1562)
  v0.41.22.0 feat: type-unification cathedral — 94 types → 15 canonical (closes garrytan#1479) (garrytan#1542)
  v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave (garrytan#1545)
  v0.41.20.0 feat: gbrain status + doctor --scope=brain (fix wave 2: items garrytan#6 + garrytan#7) (garrytan#1544)
  feat: v0.41.19.0 Supavisor Retry Cathedral (garrytan#1537)
  v0.41.18.0: gbrain onboard — the activation surface gbrain didn't have before (garrytan#1521)
  v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity (garrytan#1519)
  v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes garrytan#1461) (garrytan#1510)
  v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes garrytan#1472 RFC) (garrytan#1506)
garrytan added a commit that referenced this pull request Jun 2, 2026
… — gbrain extract --stale + doctor lag check (#1696) (#1755)

* feat(extract): link/timeline extraction freshness watermark (#1696)

Closes the "imported != curated" gap: plain `gbrain sync` only extracts
CHANGED pages, so a brain with autopilot off accumulated a links table that
was ~99.7% untyped `mentions` with nothing surfacing it. Adds a per-page
freshness watermark (pages.links_extracted_at, migration v112) and three
things built on it:

- `gbrain extract --stale [--source-id] [--catch-up] [--dry-run] [--json]`:
  incremental DB-source link+timeline sweep over pages whose extraction is
  stale (never extracted, edited since, or extractor version bumped). Small
  byte-bounded batches, non-swallowing flush, stamp-after-flush so a crash
  re-extracts idempotently. Stamps with the row's READ updated_at (not now())
  so a concurrent edit during the sweep stays stale instead of being lost.
- `links_extraction_lag` doctor check (local + remote): warn-only by default
  (>20%), hard-fail only via GBRAIN_EXTRACTION_LAG_FAIL_PCT. Vacuous-skip
  <100 pages; pre-v112 brains graceful-skip.
- `gbrain sync --no-extract` flag + end-of-sync nudge (fires on
  synced|first_sync|up_to_date so the initial import surfaces its backlog).

Three new BrainEngine methods (countStalePagesForExtraction /
listStalePagesForExtraction / markPagesExtractedBatch) with Postgres<->PGLite
parity + bootstrap probes. Schema parity: schema.sql + regenerated
pglite-schema.ts + schema-embedded.ts + bootstrap-coverage test. Migration
v112 (composite (source_id, links_extracted_at) index, no backfill so the
real backlog surfaces on first doctor run).

* test(audit): hermetic GBRAIN_AUDIT_DIR override for prune ENOENT case

The "no-op when audit dir does not exist (ENOENT)" case called
pruneOldBatchRetryAuditFiles without a GBRAIN_AUDIT_DIR override, so it read
the developer's real ~/.gbrain/audit and flaked (kept>0) on any machine with
prior gbrain audit history. Point it at a guaranteed-nonexistent temp path so
it tests the real missing-dir branch hermetically — matching the file
header's "never touches ~/.gbrain/audit" contract. Pre-existing flake
(introduced by v0.41.19.0 #1537), unrelated to #1696.

* chore: bump version and changelog (v0.42.2.0)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: CLAUDE.md key-files entry for the #1696 extract-stale wave + regen llms-full

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant