Skip to content

fix(extract): exponential backoff for batch-flush retries on pooler circuit-breaker#1523

Closed
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/extract-connection-resilience
Closed

fix(extract): exponential backoff for batch-flush retries on pooler circuit-breaker#1523
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/extract-connection-resilience

Conversation

@garrytan-agents

Copy link
Copy Markdown
Contributor

Problem

On a production brain with 16K+ pages, the extract dream-cycle phase runs 10+ minutes of sustained batch inserts (links + timeline entries, 100-row batches via addLinksBatch / addTimelineEntriesBatch). During these long runs, Supabase Supavisor (session-mode pooler, port 5432) periodically recycles backend connections.

The existing retry mechanism (v0.41.2.1, withRetry) does a single 500ms retry per batch flush. This was designed for PgBouncer's transaction-mode connection recycling where a single retry catches the stale handle. However, when Supavisor's circuit breaker activates (ECIRCUITBREAKER), recovery takes 5-10 seconds — far longer than 500ms. The result:

  • First retry fires at 500ms → circuit breaker still active → batch lost
  • Subsequent batches hit the same circuit breaker → cascade of losses
  • In production, ~3,000+ rows lost per dream cycle (links + timeline entries)
[extract.links_inc] connection blip, retrying 100 rows in 500ms (Connection terminated unexpectedly)
  link batch error (100 rows lost): Connection terminated unexpectedly
  link batch error (100 rows lost): Connection terminated unexpectedly
  timeline batch error (100 rows lost): Connection terminated unexpectedly
  ...repeats for ~30 batches until connection fully dies...

What We Tried

  1. Running phases sequentially with lock clearing — helped with lock contention but didn't fix the batch-loss cascade within a single phase
  2. Switching to Supabase direct connection (port 5432 session mode → db.<ref>.supabase.co:5432 direct) — improved stability but the pooler path is still the default and needs to be robust
  3. Investigated GBRAIN_DISABLE_DIRECT_POOL — the kill-switch forces all queries through the pooler read pool, which is the default path for most operators

Solution

Upgrade withRetry to support exponential backoff with configurable retry count:

withRetry changes (src/commands/extract.ts)

// New opts
interface WithRetryOpts {
  onRetry?: (attempt: number, err: unknown) => void;
  delayMs?: number;      // base delay, default 500
  delayMaxMs?: number;   // cap, default 8000
  maxRetries?: number;   // default 1 (back-compat)
}

// Retry loop: base * 2^attempt, capped at delayMaxMs
for (let attempt = 0; attempt <= maxRetries; attempt++) {
  try { return await fn(); }
  catch (err) {
    if (!isRetryableConnError(err)) throw err;
    if (attempt >= maxRetries) break;
    const delay = Math.min(baseDelay * 2**attempt, maxDelay);
    await sleep(delay);
  }
}

Extract batch-flush config

All 6 batch-flush sites now use:

const BULK_MAX_RETRIES = 3;    // 500ms → 1s → 2s → give up
const BULK_RETRY_BASE_MS = 500;
const BULK_RETRY_MAX_MS = 8_000;

This covers the typical Supavisor circuit-breaker recovery window (5-10s) with total backoff time of ~3.5s before exhausting retries.

Back-compatibility

Default maxRetries=1 preserves the exact v0.41.2.1 behavior for all existing callers (single 500ms retry). Only the extract batch-flush sites opt in to the higher retry count.

Testing

20/20 tests pass in test/extract-batch-retry.test.ts:

  • Existing v0.41.2.1 tests unchanged (default maxRetries=1 behavior)
  • New: maxRetries=3 exhaustion (4 total calls)
  • New: maxRetries=3 recovery on third retry
  • New: exponential backoff timing verification
  • New: delayMaxMs caps delay growth
bun test test/extract-batch-retry.test.ts
20 pass, 0 fail, 47 expect() calls

Files Changed

File Change
src/commands/extract.ts withRetry exponential backoff + bulk retry constants + 6 callsite updates
test/extract-batch-retry.test.ts 5 new test cases for multi-retry + backoff behavior

…ircuit-breaker

On a 16K-page brain, the extract phase runs 10+ minutes of sustained
batch inserts. Supabase Supavisor (session-mode pooler, port 5432) can
recycle the backend connection mid-run; rapid reconnect attempts trigger
the pooler's circuit breaker (ECIRCUITBREAKER) which needs 5-10s to
recover. The existing single 500ms retry (v0.41.2.1) was insufficient
for this recovery window, resulting in ~30% batch loss on large brains.

Changes:
- withRetry: add maxRetries, delayMaxMs opts + exponential backoff
  (base * 2^attempt, capped at delayMaxMs). Default maxRetries=1
  preserves back-compat for all existing callers.
- All 6 extract batch-flush sites: set maxRetries=3 with 500ms base
  and 8s cap (covers the 5-10s circuit-breaker recovery window).
- New tests: multi-retry, exponential backoff timing, delay cap.
- 20/20 tests pass (extract-batch-retry suite).
@garrytan

Copy link
Copy Markdown
Owner

Superseded by v0.41.18.0 Supavisor Retry Cathedral (incoming PR on branch garrytan/fredericton).

The cathedral wave absorbs your fix verbatim and extends it structurally:

  • Architectural pivot: wraps retry INSIDE postgres-engine.ts + pglite-engine.ts batch primitives (addLinksBatch / addTimelineEntriesBatch / upsertChunks) so every caller — current AND future — inherits retry for free without per-site wrapping.
  • Codex-hardened defaults: BULK_RETRY_OPTS becomes {maxRetries:3, delayMs:1000, delayMaxMs:10000, jitter:'decorrelated'}. Total worst-case wait ≈ 12s — covers the full 5-10s Supavisor circuit-breaker recovery window your PR identified, where the original 500/1000/2000 shape could still exhaust at ~3.5s.
  • Cathedral scope: also covers operations.ts:810 (MCP put_page auto-link, hit on every agent write — unprotected today), extract.ts:1298 (--by-mention path), backfill-base.ts (unified retry orchestrator), and sync.ts/reindex.ts/reindex-multimodal.ts.
  • Observability: audit JSONL via the shared audit-writer primitive + batch_retry_health doctor check + stress regression test (100 batches × 30% blip × both engines).
  • AbortSignal: retry sleeps abort cleanly on SIGTERM so deploys aren't blocked.

Your commit subject + co-authorship preserved via Co-Authored-By: @garrytan-agents trailer on the merge commit. Your 5 new test cases move to test/core/retry.test.ts with assertions adjusted for the new defaults.

Plan + 11-task breakdown: see ~/.claude/plans/system-instruction-you-are-working-smooth-gosling.md in the implementer's worktree.

Thank you for the fix and the writeup — the codebase is better for it.

@garrytan garrytan closed this May 27, 2026
garrytan added a commit that referenced this pull request May 27, 2026
Engine-level retry primitive that closes the v0.41.17 production incident
where ~3,000 wiki links + timeline entries were silently lost per dream
cycle on a 16K-page brain. Supavisor's circuit-breaker takes 5-10s to
recover; the prior single-500ms-retry shape couldn't survive it.

ARCHITECTURE
============

Retry becomes a data-primitive contract, not a caller responsibility.
postgres-engine.ts + pglite-engine.ts now self-retry inside addLinksBatch,
addTimelineEntriesBatch, and upsertChunks. Every caller — current AND
future — inherits retry-for-free. CI lint guard `scripts/check-no-double-retry.sh`
fails the build if anyone re-wraps an engine batch method (preventing
3×3=9 retry amplification on incomplete reverts).

CODEX-HARDENED DEFAULTS
=======================

BULK_RETRY_OPTS = {maxRetries:3, delayMs:1000, delayMaxMs:10000,
jitter:'decorrelated'}. Total worst-case wait ≈12s covers full Supavisor
recovery window. Decorrelated jitter (AWS-style uniform(base, prevDelay*3)
capped at maxDelay) replaces 'full' which allowed near-zero retries that
re-hit the still-recovering breaker.

AbortSignal threading from MinionWorker.shutdownAbort.signal through
engine method opts → withRetry → abortableSleep. SIGTERM aborts sleeping
retries instead of blocking deploys for up to delayMaxMs.

OBSERVABILITY
=============

`~/.gbrain/audit/batch-retry-YYYY-Www.jsonl` records every retry event
(success-after-blip AND exhausted-retries). Built on the v0.40.4.0
audit-writer cathedral. Privacy posture: never logs slugs / page IDs /
content (mirrors shell-audit.ts).

`gbrain doctor` learns `batch_retry_health` check. Reads last 24h
(not 7d — codex H-9: avoid permanent noise from one historical blip).
Thresholds: ok (zero or <3 same-site), warn (>=3 same-site OR >=5
cross-site), fail (>=20 sustained breaker). Surfaces bad GBRAIN_BULK_*
env at startup (codex M-10). Corrupt-JSONL tolerant.

30-day audit pruning hooked into the dream cycle's purge phase (codex H-8
— implements the 'pruning convention' for real).

OPERATOR TUNING
===============

GBRAIN_BULK_MAX_RETRIES (int >= 0; 0 disables retries for debugging)
GBRAIN_BULK_RETRY_BASE_MS (int > 0)
GBRAIN_BULK_RETRY_MAX_MS (int >= base)

Bad values throw GBrainError with paste-ready fix hints at doctor startup,
not at first-retry mid-cycle.

VERIFICATION
============

- bun run verify: 28/28 checks green (includes 2 new lint guards:
  check-no-double-retry, check-batch-audit-site)
- bun run test: 11453 pass / 1 pre-existing flake (schema-cli.test.ts —
  confirmed by running on clean master, NOT introduced by this wave)
- bun run test:slow: 40/40 including new test/core/retry-stress.slow.test.ts
  (100 batches × 30% blip rate × decorrelated jitter, zero row loss)
- bunx tsc --noEmit: 0 errors

REVIEWS
=======

- CEO review (SELECTIVE EXPANSION): 4 cherry-picks proposed, 4 accepted
- Eng review (2 passes): 10 findings, 0 critical gaps, architectural
  pivot from per-site to engine-level wrap
- Codex independent review: 23 findings; 10 critical/high absorbed
  (decorrelated jitter, 12s backoff window, AbortSignal, idempotency
  proof, backfill unification, typed audit-site enum, doctor expiry
  thresholds, audit pruning, env validation at doctor startup)

PR #1523 closed and absorbed (@garrytan-agents original extract.ts fix
preserved via co-author trailer; 5 test cases moved to test/core/retry.test.ts
with assertions adjusted for the v0.41.19.0 BULK_RETRY_OPTS defaults).

Co-Authored-By: garrytan-agents <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
garrytan added a commit that referenced this pull request May 27, 2026
Engine-level retry primitive that closes the v0.41.17 production incident
where ~3,000 wiki links + timeline entries were silently lost per dream
cycle on a 16K-page brain. Supavisor's circuit-breaker takes 5-10s to
recover; the prior single-500ms-retry shape couldn't survive it.

ARCHITECTURE
============

Retry becomes a data-primitive contract, not a caller responsibility.
postgres-engine.ts + pglite-engine.ts now self-retry inside addLinksBatch,
addTimelineEntriesBatch, and upsertChunks. Every caller — current AND
future — inherits retry-for-free. CI lint guard `scripts/check-no-double-retry.sh`
fails the build if anyone re-wraps an engine batch method (preventing
3×3=9 retry amplification on incomplete reverts).

CODEX-HARDENED DEFAULTS
=======================

BULK_RETRY_OPTS = {maxRetries:3, delayMs:1000, delayMaxMs:10000,
jitter:'decorrelated'}. Total worst-case wait ≈12s covers full Supavisor
recovery window. Decorrelated jitter (AWS-style uniform(base, prevDelay*3)
capped at maxDelay) replaces 'full' which allowed near-zero retries that
re-hit the still-recovering breaker.

AbortSignal threading from MinionWorker.shutdownAbort.signal through
engine method opts → withRetry → abortableSleep. SIGTERM aborts sleeping
retries instead of blocking deploys for up to delayMaxMs.

OBSERVABILITY
=============

`~/.gbrain/audit/batch-retry-YYYY-Www.jsonl` records every retry event
(success-after-blip AND exhausted-retries). Built on the v0.40.4.0
audit-writer cathedral. Privacy posture: never logs slugs / page IDs /
content (mirrors shell-audit.ts).

`gbrain doctor` learns `batch_retry_health` check. Reads last 24h
(not 7d — codex H-9: avoid permanent noise from one historical blip).
Thresholds: ok (zero or <3 same-site), warn (>=3 same-site OR >=5
cross-site), fail (>=20 sustained breaker). Surfaces bad GBRAIN_BULK_*
env at startup (codex M-10). Corrupt-JSONL tolerant.

30-day audit pruning hooked into the dream cycle's purge phase (codex H-8
— implements the 'pruning convention' for real).

OPERATOR TUNING
===============

GBRAIN_BULK_MAX_RETRIES (int >= 0; 0 disables retries for debugging)
GBRAIN_BULK_RETRY_BASE_MS (int > 0)
GBRAIN_BULK_RETRY_MAX_MS (int >= base)

Bad values throw GBrainError with paste-ready fix hints at doctor startup,
not at first-retry mid-cycle.

VERIFICATION
============

- bun run verify: 28/28 checks green (includes 2 new lint guards:
  check-no-double-retry, check-batch-audit-site)
- bun run test: 11453 pass / 1 pre-existing flake (schema-cli.test.ts —
  confirmed by running on clean master, NOT introduced by this wave)
- bun run test:slow: 40/40 including new test/core/retry-stress.slow.test.ts
  (100 batches × 30% blip rate × decorrelated jitter, zero row loss)
- bunx tsc --noEmit: 0 errors

REVIEWS
=======

- CEO review (SELECTIVE EXPANSION): 4 cherry-picks proposed, 4 accepted
- Eng review (2 passes): 10 findings, 0 critical gaps, architectural
  pivot from per-site to engine-level wrap
- Codex independent review: 23 findings; 10 critical/high absorbed
  (decorrelated jitter, 12s backoff window, AbortSignal, idempotency
  proof, backfill unification, typed audit-site enum, doctor expiry
  thresholds, audit pruning, env validation at doctor startup)

PR #1523 closed and absorbed (@garrytan-agents original extract.ts fix
preserved via co-author trailer; 5 test cases moved to test/core/retry.test.ts
with assertions adjusted for the v0.41.19.0 BULK_RETRY_OPTS defaults).

Co-authored-by: garrytan-agents <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants