fix: batch retry for PgBouncer connection drops + dream.* DB config merge by garrytan-agents · Pull Request #1416 · garrytan/gbrain

garrytan-agents · 2026-05-25T07:25:46Z

Problem

Two reliability issues surfaced during v0.41.2.0 upgrade testing on a production brain (~96K pages, Supabase PgBouncer on port 6543):

1. ~30% batch data loss during extract cycles

The extract command's 6 flush() functions call engine.addLinksBatch() / engine.addTimelineEntriesBatch() inside a try/catch. When PgBouncer recycles a backend connection between queries, the batch throws No database connection: connect() has not been called and the entire batch (100 rows) is silently dropped. On a 15K-page brain, this lost ~4,500 link rows and ~3,000 timeline rows per cycle.

2. `dream.*` config invisible to cycle phases

loadConfigWithEngine() merges DB-stored config on top of file-plane config for embedding_* and content_sanity.*, but NOT for dream.synthesize.* or dream.patterns.*. So gbrain config set dream.synthesize.session_corpus_dir /path writes successfully to DB, gbrain config get reads it back, but cycle phases calling loadConfig() (or even loadConfigWithEngine()) never see it. The extract_atoms phase silently skips with "no transcripts to process."

Error Log

[extract.links_fs] 2686/15767 (17%)
  batch error (100 link rows lost): No database connection: connect() has not been called.
[extract.links_fs] 2844/15767 (18%)
  batch error (100 link rows lost): No database connection: connect() has not been called.
...
[extract.timeline_fs] 15642/15767 (99%)
  batch error (100 timeline rows lost): No database connection: connect() has not been called.

Pattern: intermittent, affects 20-40% of batches during heavy cycles. More frequent at higher page counts (connection pool pressure).

Solution

Fix 1: `withRetry()` helper for batch inserts

New function catches connection-class errors (No database connection, connect(), Connection terminated) and retries once after 500ms. Non-connection errors (constraint violations, etc.) are NOT retried — they propagate immediately.

Applied to all 6 flush() functions in extract.ts. The batch array is snapshot-copied before clearing, so the retry operates on the same data.

async function withRetry<T>(fn: () => Promise<T>, label: string, jsonMode: boolean): Promise<T> {
  try { return await fn(); }
  catch (firstErr) {
    if (!isConnectionError(firstErr)) throw firstErr;
    await new Promise(r => setTimeout(r, 500));
    return await fn(); // one retry
  }
}

Fix 2: `dream.*` DB config merge in `loadConfigWithEngine()`

Added 7 dream.* keys to the sparse-merge block:

dream.synthesize.session_corpus_dir
dream.synthesize.meeting_transcripts_dir
dream.synthesize.verdict_model
dream.synthesize.max_prompt_tokens
dream.synthesize.max_chunks_per_transcript
dream.patterns.lookback_days
dream.patterns.min_evidence

Same precedence as all other merged keys: file/env wins per key; DB fills gaps.

Testing

5 new tests for withRetry: pass-through, retry on connection errors, retry on "Connection terminated", no retry on constraint violations, propagation on double failure. All pass.

Companion PR

This is the companion to #1414 (generalize extract_atoms — page-based extraction). That PR adds resolveConfigStr() as a per-call-site fallback; this PR fixes the systemic gap in loadConfigWithEngine() so all callers benefit.

…erge Two production reliability fixes from upgrade testing on a 96K-page brain: 1. Batch insert retry (extract.ts): PgBouncer transaction-mode poolers recycle backend connections between queries. During heavy extract cycles (15K+ pages), this surfaces as 'No database connection: connect() has not been called' on ~30% of batch inserts, silently losing link and timeline rows. Added withRetry() helper that catches connection-class errors and retries once after 500ms. Non-connection errors (constraint violations, etc.) are NOT retried. Applied to all 6 flush functions in extract.ts. The batch snapshot is taken before clearing the array, so the retry operates on the same data regardless of whether new items accumulated during the delay. 2. dream.* DB config merge (config.ts): loadConfigWithEngine() merges DB-stored config on top of file-plane config for embedding_*, content_sanity.*, etc. But dream.synthesize.* and dream.patterns.* keys were missing from the merge, so cycle phases (extract_atoms, synthesize) that read corpus dirs from config would silently skip when those keys were set via `gbrain config set` (DB plane) but not in ~/.gbrain/config.json (file plane). Added dream.synthesize.{session_corpus_dir, meeting_transcripts_dir, verdict_model, max_prompt_tokens, max_chunks_per_transcript} and dream.patterns.{lookback_days, min_evidence} to the merge block. Same precedence: file/env wins per key; DB fills gaps. 5 new tests for withRetry (pass-through, retry on connection errors, no retry on constraint errors, propagation on double failure). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

garrytan · 2026-05-25T18:33:45Z

Closing in favor of fix-wave incorporation on garrytan/pr-1414-1416-1421 (shipping as v0.41.2.1). Rebuilt with /plan-eng-review structural improvements:

withRetry is a pure exported primitive with an onRetry callback (testable in isolation) plus a shared logBatchRetry() helper for the 6 flush sites.
Classifier is the existing isRetryableConnError from src/core/retry-matcher.ts (8-pattern production-grade, extended with the typed GBrainError{problem:'No database connection'} shape) rather than the inline 3-substring match.
Snapshot-before-clear contract preserved per your PR description and pinned by a forced-mutation regression test.
dream. DB-config merge* landed in loadConfigWithEngine() as you proposed; precedence is file > DB > defaults (no env layer since there are no GBRAIN_DREAM_* env vars — claim updated in CHANGELOG to match reality).

Credit on the merged commit.

@garrytan-agents

…mpotency + ze-switch env-gate (#1445) * fix-wave: dream.* DB merge + batch retry + extract_atoms idempotency + ze-switch env-gate + doctor check Closes PRs #1414, #1416, #1421 (rebuilt from designs by @garrytan-agents with structural improvements from /plan-eng-review + codex outside-voice). Three production reliability fixes in one wave: 1. dream.* DB-config merge (closes PR #1416 silent-config gap) - loadConfigWithEngine() sparse-merge extends with 7 dream.* keys - File > DB > defaults precedence (no GBRAIN_DREAM_* env vars) - extract-atoms switches to loadConfigWithEngine() so DB-plane keys reach it 2. Batch retry on transient connection drops (closes PR #1416 ~30%-loss bug) - withRetry() pure primitive exported from src/commands/extract.ts - 6 flush() sites snapshot-before-clear with onRetry callback - Reuses isRetryableConnError from src/core/retry-matcher.ts - retry-matcher extended with GBrainError{problem:'No database connection'} 3. extract_atoms source-hash idempotency + page-based discovery (closes #1414) - One raw SQL with NOT EXISTS subquery replaces 6 listPages + N atom checks - sourceId threaded through every putPage call (codex caught real bug) - NULL content_hash filter + dream_generated exclusion + transcript-side idempotency - cycle.ts passes union of syncPagesAffected + synthesizeWrittenSlugs 4. ze-switch pre-apply + pre-resume env-override gate (closes PR #1421) - Gate fires FIRST in apply AND resume; zero setConfig calls on refusal - ASCII warning box (no Unicode per repo D10) - --ignore-env-override escape hatch for power users - ApplyResult extended with refused variant 5. doctor embedding_env_override check (defense-in-depth for #1421) - Cross-surface parity: buildChecks() + doctorReportRemote() - Uses Check.details (not Check.issues per codex schema review) Co-Authored-By: garrytan-agents <garrytan-agents@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.41.10.0) Adds 61 new tests across 5 new files pinning the fix-wave contracts: - test/extract-batch-retry.test.ts (16 cases) — withRetry primitive + snapshot contract - test/extract-atoms-page-discovery.test.ts (17 cases) — discovery SQL + dual-source idempotency - test/ze-switch-env-override.test.ts (17 cases) — env-gate apply + resume + ZERO-setConfig assertion - test/doctor-embedding-env-override.test.ts (7 cases) — cross-surface parity - test/e2e/extract-atoms-discovery-sql.test.ts (4 cases) — real-Postgres parity for raw SQL Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(test): pin gateway to 1536-dim in 2 PGLite tests that hardcode 1536-vector inserts CI shards 1 + 4 failed persistently (not flake — confirmed via retry) after the v0.41.6.0 merge with this error: error: expected 1280 dimensions, not 1536 file: "vector.c", routine: "CheckExpectedDim" Two test files insert 1536-dim Float32Array vectors into `content_chunks.embedding` / `facts.embedding`, but v0.41.5.0 flipped `DEFAULT_EMBEDDING_DIMENSIONS` from 1536 to 1280 (ZE Matryoshka default). On a fresh CI bun process where no prior test pre-configured the gateway, `initSchema()` sizes the vector column at vector(1280) and the inserts throw. Locally this is hidden when an earlier test file in the shard happens to have called `configureGateway({embedding_dimensions: 1536})` — that state leaks forward through bun's shared process. The v0.41.6.0 LPT shard re-balancing reordered files so these two ran cold, surfacing the latent bug. Fix follows the canonical hermetic pattern from test/consolidate-valid-until.test.ts:23-34: pin the gateway to 1536d in beforeAll, reset in afterAll. Test is now isolated from shard ordering. test/search-types-filter.test.ts — shard 1 fail test/operations-find-trajectory.test.ts — shard 4 (6 fails) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: empty commit to trigger CI * chore: trigger CI again * chore: renumber v0.41.10.0 -> v0.41.10.1 Per request — version slot moved to .1 micro tier to leave .0 available for unrelated wave landing on master. --------- Co-authored-by: garrytan-agents <garrytan-agents@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

garrytan mentioned this pull request May 25, 2026

feat(cycle): generalize extract_atoms — page-based extraction + config-plane fallback #1414

Closed

garrytan closed this May 25, 2026

garrytan mentioned this pull request May 25, 2026

v0.41.10.1 fix-wave: dream.* config + batch retry + extract_atoms idempotency + ze-switch env-gate #1445

Merged

5 tasks

jalagrange mentioned this pull request May 28, 2026

fix(retry): reconnect engine pool in batchRetry onRetry — singleton-null repair #1593

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: batch retry for PgBouncer connection drops + dream.* DB config merge#1416

fix: batch retry for PgBouncer connection drops + dream.* DB config merge#1416
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/upgrade-reliability-bundle

garrytan-agents commented May 25, 2026

Uh oh!

garrytan commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

garrytan-agents commented May 25, 2026

Problem

1. ~30% batch data loss during extract cycles

2. dream.* config invisible to cycle phases

Error Log

Solution

Fix 1: withRetry() helper for batch inserts

Fix 2: dream.* DB config merge in loadConfigWithEngine()

Testing

Companion PR

Uh oh!

garrytan commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

2. `dream.*` config invisible to cycle phases

Fix 1: `withRetry()` helper for batch inserts

Fix 2: `dream.*` DB config merge in `loadConfigWithEngine()`