fix(extract): exponential backoff for batch-flush retries on pooler circuit-breaker#1523
Closed
garrytan-agents wants to merge 1 commit into
Closed
Conversation
…ircuit-breaker On a 16K-page brain, the extract phase runs 10+ minutes of sustained batch inserts. Supabase Supavisor (session-mode pooler, port 5432) can recycle the backend connection mid-run; rapid reconnect attempts trigger the pooler's circuit breaker (ECIRCUITBREAKER) which needs 5-10s to recover. The existing single 500ms retry (v0.41.2.1) was insufficient for this recovery window, resulting in ~30% batch loss on large brains. Changes: - withRetry: add maxRetries, delayMaxMs opts + exponential backoff (base * 2^attempt, capped at delayMaxMs). Default maxRetries=1 preserves back-compat for all existing callers. - All 6 extract batch-flush sites: set maxRetries=3 with 500ms base and 8s cap (covers the 5-10s circuit-breaker recovery window). - New tests: multi-retry, exponential backoff timing, delay cap. - 20/20 tests pass (extract-batch-retry suite).
Owner
|
Superseded by v0.41.18.0 Supavisor Retry Cathedral (incoming PR on branch The cathedral wave absorbs your fix verbatim and extends it structurally:
Your commit subject + co-authorship preserved via Plan + 11-task breakdown: see Thank you for the fix and the writeup — the codebase is better for it. |
6 tasks
garrytan
added a commit
that referenced
this pull request
May 27, 2026
Engine-level retry primitive that closes the v0.41.17 production incident
where ~3,000 wiki links + timeline entries were silently lost per dream
cycle on a 16K-page brain. Supavisor's circuit-breaker takes 5-10s to
recover; the prior single-500ms-retry shape couldn't survive it.
ARCHITECTURE
============
Retry becomes a data-primitive contract, not a caller responsibility.
postgres-engine.ts + pglite-engine.ts now self-retry inside addLinksBatch,
addTimelineEntriesBatch, and upsertChunks. Every caller — current AND
future — inherits retry-for-free. CI lint guard `scripts/check-no-double-retry.sh`
fails the build if anyone re-wraps an engine batch method (preventing
3×3=9 retry amplification on incomplete reverts).
CODEX-HARDENED DEFAULTS
=======================
BULK_RETRY_OPTS = {maxRetries:3, delayMs:1000, delayMaxMs:10000,
jitter:'decorrelated'}. Total worst-case wait ≈12s covers full Supavisor
recovery window. Decorrelated jitter (AWS-style uniform(base, prevDelay*3)
capped at maxDelay) replaces 'full' which allowed near-zero retries that
re-hit the still-recovering breaker.
AbortSignal threading from MinionWorker.shutdownAbort.signal through
engine method opts → withRetry → abortableSleep. SIGTERM aborts sleeping
retries instead of blocking deploys for up to delayMaxMs.
OBSERVABILITY
=============
`~/.gbrain/audit/batch-retry-YYYY-Www.jsonl` records every retry event
(success-after-blip AND exhausted-retries). Built on the v0.40.4.0
audit-writer cathedral. Privacy posture: never logs slugs / page IDs /
content (mirrors shell-audit.ts).
`gbrain doctor` learns `batch_retry_health` check. Reads last 24h
(not 7d — codex H-9: avoid permanent noise from one historical blip).
Thresholds: ok (zero or <3 same-site), warn (>=3 same-site OR >=5
cross-site), fail (>=20 sustained breaker). Surfaces bad GBRAIN_BULK_*
env at startup (codex M-10). Corrupt-JSONL tolerant.
30-day audit pruning hooked into the dream cycle's purge phase (codex H-8
— implements the 'pruning convention' for real).
OPERATOR TUNING
===============
GBRAIN_BULK_MAX_RETRIES (int >= 0; 0 disables retries for debugging)
GBRAIN_BULK_RETRY_BASE_MS (int > 0)
GBRAIN_BULK_RETRY_MAX_MS (int >= base)
Bad values throw GBrainError with paste-ready fix hints at doctor startup,
not at first-retry mid-cycle.
VERIFICATION
============
- bun run verify: 28/28 checks green (includes 2 new lint guards:
check-no-double-retry, check-batch-audit-site)
- bun run test: 11453 pass / 1 pre-existing flake (schema-cli.test.ts —
confirmed by running on clean master, NOT introduced by this wave)
- bun run test:slow: 40/40 including new test/core/retry-stress.slow.test.ts
(100 batches × 30% blip rate × decorrelated jitter, zero row loss)
- bunx tsc --noEmit: 0 errors
REVIEWS
=======
- CEO review (SELECTIVE EXPANSION): 4 cherry-picks proposed, 4 accepted
- Eng review (2 passes): 10 findings, 0 critical gaps, architectural
pivot from per-site to engine-level wrap
- Codex independent review: 23 findings; 10 critical/high absorbed
(decorrelated jitter, 12s backoff window, AbortSignal, idempotency
proof, backfill unification, typed audit-site enum, doctor expiry
thresholds, audit pruning, env validation at doctor startup)
PR #1523 closed and absorbed (@garrytan-agents original extract.ts fix
preserved via co-author trailer; 5 test cases moved to test/core/retry.test.ts
with assertions adjusted for the v0.41.19.0 BULK_RETRY_OPTS defaults).
Co-Authored-By: garrytan-agents <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
garrytan
added a commit
that referenced
this pull request
May 27, 2026
Engine-level retry primitive that closes the v0.41.17 production incident
where ~3,000 wiki links + timeline entries were silently lost per dream
cycle on a 16K-page brain. Supavisor's circuit-breaker takes 5-10s to
recover; the prior single-500ms-retry shape couldn't survive it.
ARCHITECTURE
============
Retry becomes a data-primitive contract, not a caller responsibility.
postgres-engine.ts + pglite-engine.ts now self-retry inside addLinksBatch,
addTimelineEntriesBatch, and upsertChunks. Every caller — current AND
future — inherits retry-for-free. CI lint guard `scripts/check-no-double-retry.sh`
fails the build if anyone re-wraps an engine batch method (preventing
3×3=9 retry amplification on incomplete reverts).
CODEX-HARDENED DEFAULTS
=======================
BULK_RETRY_OPTS = {maxRetries:3, delayMs:1000, delayMaxMs:10000,
jitter:'decorrelated'}. Total worst-case wait ≈12s covers full Supavisor
recovery window. Decorrelated jitter (AWS-style uniform(base, prevDelay*3)
capped at maxDelay) replaces 'full' which allowed near-zero retries that
re-hit the still-recovering breaker.
AbortSignal threading from MinionWorker.shutdownAbort.signal through
engine method opts → withRetry → abortableSleep. SIGTERM aborts sleeping
retries instead of blocking deploys for up to delayMaxMs.
OBSERVABILITY
=============
`~/.gbrain/audit/batch-retry-YYYY-Www.jsonl` records every retry event
(success-after-blip AND exhausted-retries). Built on the v0.40.4.0
audit-writer cathedral. Privacy posture: never logs slugs / page IDs /
content (mirrors shell-audit.ts).
`gbrain doctor` learns `batch_retry_health` check. Reads last 24h
(not 7d — codex H-9: avoid permanent noise from one historical blip).
Thresholds: ok (zero or <3 same-site), warn (>=3 same-site OR >=5
cross-site), fail (>=20 sustained breaker). Surfaces bad GBRAIN_BULK_*
env at startup (codex M-10). Corrupt-JSONL tolerant.
30-day audit pruning hooked into the dream cycle's purge phase (codex H-8
— implements the 'pruning convention' for real).
OPERATOR TUNING
===============
GBRAIN_BULK_MAX_RETRIES (int >= 0; 0 disables retries for debugging)
GBRAIN_BULK_RETRY_BASE_MS (int > 0)
GBRAIN_BULK_RETRY_MAX_MS (int >= base)
Bad values throw GBrainError with paste-ready fix hints at doctor startup,
not at first-retry mid-cycle.
VERIFICATION
============
- bun run verify: 28/28 checks green (includes 2 new lint guards:
check-no-double-retry, check-batch-audit-site)
- bun run test: 11453 pass / 1 pre-existing flake (schema-cli.test.ts —
confirmed by running on clean master, NOT introduced by this wave)
- bun run test:slow: 40/40 including new test/core/retry-stress.slow.test.ts
(100 batches × 30% blip rate × decorrelated jitter, zero row loss)
- bunx tsc --noEmit: 0 errors
REVIEWS
=======
- CEO review (SELECTIVE EXPANSION): 4 cherry-picks proposed, 4 accepted
- Eng review (2 passes): 10 findings, 0 critical gaps, architectural
pivot from per-site to engine-level wrap
- Codex independent review: 23 findings; 10 critical/high absorbed
(decorrelated jitter, 12s backoff window, AbortSignal, idempotency
proof, backfill unification, typed audit-site enum, doctor expiry
thresholds, audit pruning, env validation at doctor startup)
PR #1523 closed and absorbed (@garrytan-agents original extract.ts fix
preserved via co-author trailer; 5 test cases moved to test/core/retry.test.ts
with assertions adjusted for the v0.41.19.0 BULK_RETRY_OPTS defaults).
Co-authored-by: garrytan-agents <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On a production brain with 16K+ pages, the
extractdream-cycle phase runs 10+ minutes of sustained batch inserts (links + timeline entries, 100-row batches viaaddLinksBatch/addTimelineEntriesBatch). During these long runs, Supabase Supavisor (session-mode pooler, port 5432) periodically recycles backend connections.The existing retry mechanism (v0.41.2.1,
withRetry) does a single 500ms retry per batch flush. This was designed for PgBouncer's transaction-mode connection recycling where a single retry catches the stale handle. However, when Supavisor's circuit breaker activates (ECIRCUITBREAKER), recovery takes 5-10 seconds — far longer than 500ms. The result:What We Tried
db.<ref>.supabase.co:5432direct) — improved stability but the pooler path is still the default and needs to be robustGBRAIN_DISABLE_DIRECT_POOL— the kill-switch forces all queries through the pooler read pool, which is the default path for most operatorsSolution
Upgrade
withRetryto support exponential backoff with configurable retry count:withRetrychanges (src/commands/extract.ts)Extract batch-flush config
All 6 batch-flush sites now use:
This covers the typical Supavisor circuit-breaker recovery window (5-10s) with total backoff time of ~3.5s before exhausting retries.
Back-compatibility
Default
maxRetries=1preserves the exact v0.41.2.1 behavior for all existing callers (single 500ms retry). Only the extract batch-flush sites opt in to the higher retry count.Testing
20/20 tests pass in
test/extract-batch-retry.test.ts:maxRetries=3exhaustion (4 total calls)maxRetries=3recovery on third retrydelayMaxMscaps delay growthFiles Changed
src/commands/extract.tswithRetryexponential backoff + bulk retry constants + 6 callsite updatestest/extract-batch-retry.test.ts