v0.41.19.0 feat: Supavisor Retry Cathedral (engine-level retry primitive)#1537
Merged
Conversation
Engine-level retry primitive that closes the v0.41.17 production incident
where ~3,000 wiki links + timeline entries were silently lost per dream
cycle on a 16K-page brain. Supavisor's circuit-breaker takes 5-10s to
recover; the prior single-500ms-retry shape couldn't survive it.
ARCHITECTURE
============
Retry becomes a data-primitive contract, not a caller responsibility.
postgres-engine.ts + pglite-engine.ts now self-retry inside addLinksBatch,
addTimelineEntriesBatch, and upsertChunks. Every caller — current AND
future — inherits retry-for-free. CI lint guard `scripts/check-no-double-retry.sh`
fails the build if anyone re-wraps an engine batch method (preventing
3×3=9 retry amplification on incomplete reverts).
CODEX-HARDENED DEFAULTS
=======================
BULK_RETRY_OPTS = {maxRetries:3, delayMs:1000, delayMaxMs:10000,
jitter:'decorrelated'}. Total worst-case wait ≈12s covers full Supavisor
recovery window. Decorrelated jitter (AWS-style uniform(base, prevDelay*3)
capped at maxDelay) replaces 'full' which allowed near-zero retries that
re-hit the still-recovering breaker.
AbortSignal threading from MinionWorker.shutdownAbort.signal through
engine method opts → withRetry → abortableSleep. SIGTERM aborts sleeping
retries instead of blocking deploys for up to delayMaxMs.
OBSERVABILITY
=============
`~/.gbrain/audit/batch-retry-YYYY-Www.jsonl` records every retry event
(success-after-blip AND exhausted-retries). Built on the v0.40.4.0
audit-writer cathedral. Privacy posture: never logs slugs / page IDs /
content (mirrors shell-audit.ts).
`gbrain doctor` learns `batch_retry_health` check. Reads last 24h
(not 7d — codex H-9: avoid permanent noise from one historical blip).
Thresholds: ok (zero or <3 same-site), warn (>=3 same-site OR >=5
cross-site), fail (>=20 sustained breaker). Surfaces bad GBRAIN_BULK_*
env at startup (codex M-10). Corrupt-JSONL tolerant.
30-day audit pruning hooked into the dream cycle's purge phase (codex H-8
— implements the 'pruning convention' for real).
OPERATOR TUNING
===============
GBRAIN_BULK_MAX_RETRIES (int >= 0; 0 disables retries for debugging)
GBRAIN_BULK_RETRY_BASE_MS (int > 0)
GBRAIN_BULK_RETRY_MAX_MS (int >= base)
Bad values throw GBrainError with paste-ready fix hints at doctor startup,
not at first-retry mid-cycle.
VERIFICATION
============
- bun run verify: 28/28 checks green (includes 2 new lint guards:
check-no-double-retry, check-batch-audit-site)
- bun run test: 11453 pass / 1 pre-existing flake (schema-cli.test.ts —
confirmed by running on clean master, NOT introduced by this wave)
- bun run test:slow: 40/40 including new test/core/retry-stress.slow.test.ts
(100 batches × 30% blip rate × decorrelated jitter, zero row loss)
- bunx tsc --noEmit: 0 errors
REVIEWS
=======
- CEO review (SELECTIVE EXPANSION): 4 cherry-picks proposed, 4 accepted
- Eng review (2 passes): 10 findings, 0 critical gaps, architectural
pivot from per-site to engine-level wrap
- Codex independent review: 23 findings; 10 critical/high absorbed
(decorrelated jitter, 12s backoff window, AbortSignal, idempotency
proof, backfill unification, typed audit-site enum, doctor expiry
thresholds, audit pruning, env validation at doctor startup)
PR #1523 closed and absorbed (@garrytan-agents original extract.ts fix
preserved via co-author trailer; 5 test cases moved to test/core/retry.test.ts
with assertions adjusted for the v0.41.19.0 BULK_RETRY_OPTS defaults).
Co-Authored-By: garrytan-agents <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3a22022 to
aacc81a
Compare
garrytan
added a commit
that referenced
this pull request
May 27, 2026
…ems #6 + #7) (#1544) * feat(doctor): doctor-categories foundation — BRAIN/SKILL/OPS/META sets + drift guard Categorizes every doctor check name into exactly one of four categories. Exported constants + categorizeCheck(name) helper are the single source of truth for the v0.41.20.0 brain_checks_score + category_scores + --scope=brain wave. Drift guard test parses doctor.ts source for both inline {name: 'foo'} and helper const name = 'foo' patterns; CI fires if any check name lacks a category. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(doctor): brain_checks_score + category_scores + --scope=brain skip-computation Extends Check with optional category and DoctorReport with brain_checks_score + category_scores (additive — schema_version stays at 2; back-compat health_score math byte-identical). buildChecks gains --scope=brain with explicit early-skip gates around the SKILL check group (resolver_health + skill_conformance + skill_brain_first + whoknows_health). Sub-second doctor on a brain with thousands of skills. computeDoctorReport tags every check via categorizeCheck() at compute time. Human output leads with the brain figure and renders the weighted BrainHealth.brain_score alongside. Test seam fix in test/doctor-home-dir-in-worktree.test.ts: the pre-existing fragile JSON parser walked back from "checks" to find the envelope's outer brace; v0.41.20.0's new nested category_scores object broke that heuristic. Anchored on the canonical {"schema_version" envelope prefix instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(status): gbrain status — single-screen brain health dashboard + get_status_snapshot MCP op NEW gbrain status command (src/commands/status.ts) composes 6 sections: sync (per-source last_sync_at + staleness via buildSyncStatusReport), cycle (TWO rows: last autopilot-cycle + last autopilot-* of any kind — reflects v0.36.4.0 health-aware autopilot's targeted handler routing; totals read from result.report.totals per the canonical handler shape), locks (gbrain_cycle_locks active rows), workers (readSupervisorEvents + summarizeCrashes), queue (LIVE counts NO time-window — old stuck jobs are exactly what status surfaces), autopilot (PID liveness via kill -0). Stable --json envelope (schema_version: 1). Exit codes 0=ok / 1=snapshot failed / 2=usage. --section filter. Thin-client mode routes Sync + Cycle through NEW get_status_snapshot MCP op (admin scope, NOT localOnly; payload deliberately omits Locks / Workers / Queue / Autopilot so feature creep can't quietly widen the admin-scoped data exposure). Local-only sections render "local-only — N/A on remote brain" honestly instead of pretending the local install's empty state is the remote brain's. CLI dispatch: pre-engine-bind branch for thin-client (no PGLite needed) + engine-connected dispatch case for local mode. CLI-only architecture per codex MAJOR-4 (status owns its own thin-client branch inside runStatus, not routed through op dispatch). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version to v0.41.20.0 + CHANGELOG + TODOS + llms regen Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(doctor-categories): categorize batch_retry_health from v0.41.19.0 Supavisor wave The drift guard correctly caught a new check introduced by master's v0.41.19.0 Supavisor Retry Cathedral (PR #1537). batch_retry_health surfaces batch-write retry events from the new src/core/audit/batch-retry-audit.ts module — OPS category (infrastructure liveness). This is exactly why the drift guard exists: any future check added to doctor.ts without a category entry fails CI immediately instead of silently degrading to 'meta'. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan
added a commit
that referenced
this pull request
May 27, 2026
Resolve VERSION, package.json, CHANGELOG conflicts. v0.41.22.0 stays on top (higher than master's v0.41.20.0); master's v0.41.20.0 entry preserved below in CHANGELOG order. Brought in master's gbrain status + doctor --scope=brain (PR #1544) and the v0.41.19.0 Supavisor Retry Cathedral (PR #1537) cleanly via auto-merge. Typecheck clean. bun run verify: 28/28 checks pass.
mgunnin
added a commit
to mgunnin/gbrain
that referenced
this pull request
May 28, 2026
* upstream/master: v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes garrytan#1567) (garrytan#1572) v0.41.26.0 fix: dream --source + ingest junk titles + emoji-crash (supersedes garrytan#1559, garrytan#1561) (garrytan#1571) v0.41.25.0 perf(sync): batched deletes + global page-generation clock (supersedes garrytan#1538) (garrytan#1566) v0.41.24.0 fix(conversation-parser): threshold gates + bold-paren-time pattern — 20,167 Circleback messages unblocked (closes garrytan#1533) (garrytan#1543) v0.41.23.0 feat: extract operator surfaces + pack-driven extractables (garrytan#1541) v0.41.22.1 feat: brainstorm/lsd judge fixes (closes garrytan#1540 end-to-end) (garrytan#1562) v0.41.22.0 feat: type-unification cathedral — 94 types → 15 canonical (closes garrytan#1479) (garrytan#1542) v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave (garrytan#1545) v0.41.20.0 feat: gbrain status + doctor --scope=brain (fix wave 2: items garrytan#6 + garrytan#7) (garrytan#1544) feat: v0.41.19.0 Supavisor Retry Cathedral (garrytan#1537) v0.41.18.0: gbrain onboard — the activation surface gbrain didn't have before (garrytan#1521) v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity (garrytan#1519) v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes garrytan#1461) (garrytan#1510) v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes garrytan#1472 RFC) (garrytan#1506)
This was referenced May 28, 2026
garrytan
added a commit
that referenced
this pull request
Jun 2, 2026
… — gbrain extract --stale + doctor lag check (#1696) (#1755) * feat(extract): link/timeline extraction freshness watermark (#1696) Closes the "imported != curated" gap: plain `gbrain sync` only extracts CHANGED pages, so a brain with autopilot off accumulated a links table that was ~99.7% untyped `mentions` with nothing surfacing it. Adds a per-page freshness watermark (pages.links_extracted_at, migration v112) and three things built on it: - `gbrain extract --stale [--source-id] [--catch-up] [--dry-run] [--json]`: incremental DB-source link+timeline sweep over pages whose extraction is stale (never extracted, edited since, or extractor version bumped). Small byte-bounded batches, non-swallowing flush, stamp-after-flush so a crash re-extracts idempotently. Stamps with the row's READ updated_at (not now()) so a concurrent edit during the sweep stays stale instead of being lost. - `links_extraction_lag` doctor check (local + remote): warn-only by default (>20%), hard-fail only via GBRAIN_EXTRACTION_LAG_FAIL_PCT. Vacuous-skip <100 pages; pre-v112 brains graceful-skip. - `gbrain sync --no-extract` flag + end-of-sync nudge (fires on synced|first_sync|up_to_date so the initial import surfaces its backlog). Three new BrainEngine methods (countStalePagesForExtraction / listStalePagesForExtraction / markPagesExtractedBatch) with Postgres<->PGLite parity + bootstrap probes. Schema parity: schema.sql + regenerated pglite-schema.ts + schema-embedded.ts + bootstrap-coverage test. Migration v112 (composite (source_id, links_extracted_at) index, no backfill so the real backlog surfaces on first doctor run). * test(audit): hermetic GBRAIN_AUDIT_DIR override for prune ENOENT case The "no-op when audit dir does not exist (ENOENT)" case called pruneOldBatchRetryAuditFiles without a GBRAIN_AUDIT_DIR override, so it read the developer's real ~/.gbrain/audit and flaked (kept>0) on any machine with prior gbrain audit history. Point it at a guaranteed-nonexistent temp path so it tests the real missing-dir branch hermetically — matching the file header's "never touches ~/.gbrain/audit" contract. Pre-existing flake (introduced by v0.41.19.0 #1537), unrelated to #1696. * chore: bump version and changelog (v0.42.2.0) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: CLAUDE.md key-files entry for the #1696 extract-stale wave + regen llms-full --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Engine-level retry primitive that closes the v0.41.17 production incident: ~3,000 wiki links + timeline entries silently lost per dream cycle on a 16K-page brain because Supavisor's 5-10s circuit-breaker recovery couldn't fit inside the prior single-500ms-retry window.
Architecture (the cathedral move): retry becomes a data-primitive contract, not a caller responsibility.
postgres-engine.ts+pglite-engine.tsself-retry insideaddLinksBatch,addTimelineEntriesBatch, andupsertChunks. Every caller — current AND future — inherits retry for free. CI lint guardscripts/check-no-double-retry.shfails the build if anyone re-wraps an engine batch method (prevents 3×3=9 retry amplification).Codex-hardened defaults:
BULK_RETRY_OPTS = {maxRetries:3, delayMs:1000, delayMaxMs:10000, jitter:'decorrelated'}. ~12s worst-case wait covers the full Supavisor recovery window. AWS-style decorrelated jitter replaces'full'jitter which allowed near-zero retries that re-hit the still-recovering breaker.Observability:
~/.gbrain/audit/batch-retry-YYYY-Www.jsonlrecords every retry.gbrain doctor's newbatch_retry_healthcheck reads last 24h (codex H-9: avoid permanent noise from one historical blip) with per-site thresholds. 30-day audit pruning hooked into the cycle's purge phase.AbortSignal threading: SIGTERM aborts sleeping retries cleanly instead of blocking deploys for up to
delayMaxMs.Operator tuning:
GBRAIN_BULK_MAX_RETRIES/GBRAIN_BULK_RETRY_BASE_MS/GBRAIN_BULK_RETRY_MAX_MS—0disables retries for debugging; bad values surface atgbrain doctorstartup with paste-ready fix hints.Test Coverage
64 new test cases across 4 new test files:
test/core/retry.test.ts— 37 cases (jitter math, abort semantics, env-override boundaries, typed audit-site validation)test/audit/batch-retry-audit.test.ts— 12 cases (privacy posture, corruption tolerance, 30-day pruning)test/doctor-batch-retry.test.ts— 10 cases (warn/fail thresholds, corrupt-JSONL tolerance, M-10 env validation)test/core/retry-stress.slow.test.ts— 5 cases (100 batches × 30% blip rate, asserts zero row loss)Plus PR #1523's original 5 cases preserved via co-author trailer (assertions adjusted for new BULK_RETRY_OPTS defaults).
Tests: 11,432 → 11,496 (+64 new)
Pre-Landing Review
CEO review (SELECTIVE EXPANSION): 4 cherry-picks proposed, 4 accepted (env override, doctor check, sync/reindex coverage, stress regression).
Eng review (2 passes): 10 findings, 0 critical gaps. The load-bearing call was the architectural pivot from CEO plan's per-site wrapping to engine-level wrap — closes the bug class structurally for future callers too.
Codex independent review: 23 findings; 10 critical/high absorbed (decorrelated jitter, 12s backoff window, AbortSignal, commit-ambiguity proof per primitive, backfill unification, typed audit-site enum, doctor 24h window + thresholds, audit pruning, env validation at doctor startup, per-instance test seam via WeakMap).
Plan Completion
11/11 tasks done (T1-T11). Full record in
~/.claude/plans/system-instruction-you-are-working-smooth-gosling.md.Verification Results
bun run verify: 28/28 checks green (includes 2 new lint guards:check-no-double-retry,check-batch-audit-site)bun run test: 11,453 pass / 1 pre-existing flake (schema-cli.test.ts > schema active reports default resolution— confirmed by running on clean master, NOT introduced by this wave)bun run test:slow: 40/40 including newretry-stress.slow.test.tsbunx tsc --noEmit: 0 errorsCloses / Absorbs
Closes PR #1523 (
@garrytan-agents). Original extract.ts fix preserved verbatim inwithRetry's call shape; 5 test cases moved totest/core/retry.test.tswith assertions adjusted for the v0.41.19.0BULK_RETRY_OPTSdefaults. Co-Authored-By trailer on the commit credits the original work.Test plan
bun run verify(28 checks)bun run test(parallel unit suite)bun run test:slow(includes stress regression)bunx tsc --noEmitcleangbrain extract all+ grep~/.gbrain/audit/batch-retry-*.jsonl🤖 Generated with Claude Code