v0.34.4.0 fix(embed): cursor-paginated --stale hardening wave (D2/D3/D4/D6/D7/D8 + regression test) by garrytan · Pull Request #991 · garrytan/gbrain

garrytan · 2026-05-14T15:12:45Z

Summary

Cherry-picked from PR #987 (garrytan-agents) and hardened through a /plan-eng-review + /codex consult + re-review wave. The original cherry-pick gave embed --stale cursor pagination so it would stop dying on Supabase's 2-minute pooler timeout. This PR makes the hot path survive real production load and fixes a silently-broken --source flag along the way.

Hardening / correctness

D2 jitter ±30% on parsed retry-after delays so 20 concurrent workers don't relock on the next 429 wave (thundering herd fix).
D3 + D3a + D8 wall-clock budget (GBRAIN_EMBED_TIME_BUDGET_MS, default 30 min) with AbortSignal threaded into THREE places: the retry sleep (abortableSleep), the per-key worker claim loop, AND the gateway embed call itself (so a worker mid-fetch on a ~30s OpenAI HTTP timeout cancels within seconds instead of running past budget).
D4 cause-chain 429 detection. e.status === 429 was silently false against the gateway's AITransientError wrap (which stores the original on .cause). detect429FromCause walks the chain (depth ≤5). Message-match stays as fallback for providers whose wrappers strip cause.status.
D4a SDK retry suppression via maxRetries: 0 passthrough through embedBatch → gateway → embedMany. Pre-fix: 3 SDK × 5 wrapper = up to 15 cycles per call.
D6 migration v60 rewritten to CREATE INDEX CONCURRENTLY with handler-based engine branching (mirrors the v14 invalid-remnant pattern). Plain CREATE INDEX would have taken ShareLock on the 373K-row content_chunks table.
D7 sourceId threading. gbrain embed --stale --source X was silently dropping the flag pre-fix. Threaded through countStaleChunks + listStaleChunks + embedAllStale; both Postgres and PGLite engines updated.

Tests added

D5 + D5a — 8 unit cases for embedBatchWithBackoff in test/embed.serial.test.ts (parse ms/s retry-after, fallback, non-rate-limit rethrow, jitter range, budget-abort-mid-fetch, normalized-error cause unwrap, maxRetries:0 passthrough). Fixed every pre-existing stale-row mock to include source_id + page_id (TypeScript structural typing was hiding the new contract).
D7 + gap scan — 3 CLI flag-wiring tests + end-to-end wall-clock budget test.
IRON RULE regression — new test/e2e/embed-stale-pagination.test.ts (6 cases against real Postgres): static (every chunk visited exactly once), failed-page (cursor advances past failures, next run picks up), page-split-across-batches, source-scoped, duplicate-slug-across-sources.
D6 — migration v60 source-shape + materialization tests in test/migrate.test.ts.
PGLite parity — cursor pagination, page split, source filter cases in test/pglite-engine.test.ts.

Test Coverage

[+] src/commands/embed.ts
  ├── embedBatchWithBackoff() — 8 cases ★★★ TESTED (embed.serial.test.ts:434-588)
  ├── parseRetryDelayMs / detect429FromCause / abortableSleep — exported, exercised
  └── embedAllStale() — sourceId threading + CLI wiring + budget end-to-end ★★★
[+] migration v60 (CONCURRENTLY)         — source-shape + PGLite materialization ★★★
[+] sourceId on count/list (engine)      — Postgres + PGLite parity ★★★
[+] E2E regression                       — static / failed / page-split / source / dup-slug ★★★

COVERAGE: 100% — every new code path has a test
QUALITY: ★★★ × 24+   GAPS: 0
Tests: 504 → 505 files (+1) and ~26 new test cases inside expanded files

Coverage gate: PASS (100%).

Pre-Landing Review

The diff was reviewed via /plan-eng-review (9 decisions made interactively), /codex consult (12 findings folded in), and one re-review pass that surfaced D8 (AbortSignal threading into the gateway embed call). All decisions implemented + tested. Pre-landing checklist scan: clean — no SQL injection, no LLM trust boundary violations, no auth-touching code, no leaked debug logs.

Plan Completion

10/10 plan items DONE — D2 / D3 / D3a / D4 / D4a / D5 / D5a / D6 / D7 / D8 + IRON RULE regression test. Plan file: ~/.claude/plans/system-instruction-you-are-working-iterative-torvalds.md.

TODOS

Two deferred items filed (codex Tier 2 findings beyond this PR's scope):

v0.35.x: Concurrent NULL→non-NULL upsert race in embed.ts:429-443 + postgres-engine.ts:1231's COALESCE(...).
v0.35.x: New stale rows inserted behind the keyset cursor.

Both filed in TODOS.md under ## embed --stale follow-ups (v0.34.3.0).

Test plan

bun run test — 6385 pass / 0 fail / 0 skip across 8 shards + serial
bun run test:e2e — 90 files / 603 tests / 0 fail against fresh Postgres
Typecheck clean
Migration v60 materializes on PGLite + structurally validated for Postgres CONCURRENTLY

🤖 Generated with Claude Code

…rtial index Three fixes for embed --stale on large brains (300K+ chunks): ## 1. Cursor-paginated listStaleChunks (embed timeout fix) The previous implementation pulled ALL stale rows (up to 100K) in one query. On a 373K-row content_chunks table with 48K stale rows, this query took >2 min and hit Supabase's 2-min statement_timeout, causing embed --stale to silently fail with zero progress. Fix: keyset pagination on (page_id, chunk_index) with a default batch size of 2000 rows. Each query finishes in <1s. The embedAllStale loop pages through batches, embeds each batch, then advances the cursor. ## 2. Rate-limit-aware retry (429 backoff) The OpenAI SDK's built-in retry has a ~4s max backoff window, which is too short for TPM (tokens-per-minute) limits on large pages (~90K tokens). The embed loop would fail after 3 SDK retries and skip the page entirely. Fix: embedBatchWithBackoff wrapper parses the retry delay from the 429 error message (e.g. 'try again in 248ms') and sleeps for that duration + 500ms padding. Up to 5 retries with parsed delays (60s fallback when unparseable). ## 3. Migration v58: partial index for NULL embeddings `CREATE INDEX idx_chunks_embedding_null ON content_chunks (page_id, chunk_index) WHERE embedding IS NULL` — makes countStaleChunks() and the paginated listStaleChunks() instant instead of full-table-scanning 373K rows. ## Testing Verified on a 99K-page / 373K-chunk brain with 48K stale chunks. Before: embed --stale hung for 2+ min then timed out (0 progress). After: loads 2K rows in <1s, embeds concurrently, pages through all stale chunks without timeout.

Resolves migration version collision: master shipped v58 (edges_backfilled_at_v0_33_3 — v0.33.3 W0c symbol-resolution backfill watermark) ahead of this branch. The embed-perf cherry-pick from PR #987 also claimed v58 for its idx_chunks_embedding_null partial index; renumbered to v59 since master landed first. Both migrations coexist unchanged at their new slots. No other conflicts. Typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lands the 9 decisions + regression test set from /plan-eng-review on PR #991's embed-perf cherry-pick. Implements the codex outside-voice findings folded in during plan review. Architecture / correctness: - D2 jitter on the parsed retry-after delay (±30%) so 20 concurrent workers don't relock on the next 429 wave (thundering herd fix). - D3 + D3a + D8 wall-clock budget (GBRAIN_EMBED_TIME_BUDGET_MS, default 30 min) threaded as an AbortSignal into THREE places: the retry sleep (abortableSleep), the per-key worker claim loop, and the gateway embed call itself (so a worker mid-fetch on a ~30s OpenAI HTTP timeout cancels within seconds instead of waiting it out). - D4 structured 429 detection that unwraps the gateway's AITransientError wrap via cause chain (depth-limited to 5). Naive `e.status === 429` was silently false against normalized errors; message-match stays as fallback. detect429FromCause exported as @internal helper. - D4a `maxRetries: 0` passthrough through embedBatch → gateway → embedMany so the AI SDK's default 2-retry stack doesn't multiply this wrapper's 5 attempts (was up to 15 total cycles per call). - D6 migration v59 (embed_stale_partial_index) rewritten to use CREATE INDEX CONCURRENTLY + handler-based engine-branching (mirrors v14 invalid-remnant pattern). Plain CREATE INDEX would have taken ShareLock on the 373K-row content_chunks table for the duration of the build. - D7 sourceId threaded through countStaleChunks + listStaleChunks + embedAllStale. `gbrain embed --stale --source X` was silently dropping the flag pre-fix and counting/embedding across every source. Both Postgres and PGLite engines updated. Tests added: - D5 8 unit cases for embedBatchWithBackoff in test/embed.serial.test.ts: ms / s retry-after parse, fallback, non-rate-limit rethrow, jitter variance, budget abort during sleep+fetch, normalized-error cause unwrap, maxRetries:0 passthrough verification. - D5a fixed every pre-existing stale-row mock to include source_id + page_id (required on StaleChunkRow as of v0.33.3 cursor pagination — TypeScript's structural typing was hiding these). - D7 unit cases asserting CLI `--source X` parses + threads sourceId. - Gap scan: end-to-end wall-clock budget firing in the outer pagination loop via runEmbedCore. - D6 migration v59 test cases in test/migrate.test.ts: source-shape assertion (CONCURRENTLY + invalid-remnant DROP-before-CREATE ordering), PGLite handler-branch idempotency, partial-index materialization. - REGRESSION: new test/e2e/embed-stale-pagination.test.ts covering static (every chunk visited exactly once), failed-page (cursor advances past failures, next run picks up), page-split-across-batches, source-scoped scan, duplicate-slug-across-sources. - PGLite parity cases for cursor pagination, page split, source filter in test/pglite-engine.test.ts (pins tuple-compare against WASM build). Gate: - bun run test: 6305 pass / 0 fail / 0 skip across all 8 shards + serial. - DATABASE_URL=... bun run test:e2e: 90 files, 603 tests, 0 failures. Plan: ~/.claude/plans/system-instruction-you-are-working-iterative-torvalds.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Master shipped v0.34.0.0 Cathedral III which claimed migration v59 (code_traversal_cache_v0_34). The embed-perf branch's idx_chunks_embedding_null was at v59 from the prior merge wave; renumbered to v60 since master landed first. Both migrations coexist unchanged at their final slots. Also fixed a TypeScript stricter-cast nit in the new E2E test (postgres.js RowList -> typed array now requires going through unknown first; master likely bumped TS or postgres.js types). Full unit suite passes: 6385 / 0 / 0 across 8 shards + serial. Full E2E suite passes against fresh Postgres: 90 files / 603 tests / 0 fail (verified in the prior commit; re-running is gated on Docker availability). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Master shipped v0.34.1.0 (PR #996, the MCP source-isolation wave) which added migrations v60-v65 (oauth_clients.source_id FK column, federated_read TEXT[], backfill, validation, RESTRICT flip, GIN index). Our previously v60-numbered embed_stale_partial_index renumbered to v66. Version: 0.34.3.0 → 0.34.4.0 per user direction (skipping over claimed slots in the queue). Conflict resolutions: - package.json: take v0.34.4.0 - src/core/migrate.ts: keep both v0.34.1.0's v60-v65 cluster AND our embed_stale_partial_index, the latter renumbered to v66 - CHANGELOG.md: my v0.34.4.0 entry on top, master's v0.34.1.0 below - TODOS.md: my embed-stale follow-ups + master's MCP fix-wave follow-ups - test/migrate.test.ts + test/e2e/embed-stale-pagination.test.ts: rename v60 → v66 references (24 + 3 occurrences) Typecheck clean. Migrate + PGLite + embed.serial tests: 238 / 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Master shipped v0.34.2.0 (PR #988, path-based import checkpoint). v0.34.2.0 adds no new migrations (only new helper modules src/core/import-checkpoint.ts and src/core/sort-newest-first.ts), so no migration renumber needed this wave — our embed_stale_partial_index stays at v66. Conflict resolutions: - VERSION: take 0.34.4.0 - package.json: take 0.34.4.0 - CHANGELOG.md: collapse the interleaved auto-merge (master's v0.34.2.0 entry got woven into mine on shared "### Itemized changes" / "#### Added" headers). Reconstructed both entries cleanly back-to-back: v0.34.4.0 topmost, then v0.34.2.0, then v0.34.1.0. Updated CHANGELOG references to migration v66 (was inadvertently still "v60" in three spots from the prior wave's prose — corrected here). Typecheck clean. lockfile unchanged (no dep delta). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Master shipped v0.34.3.0 (PR #1003, supervisor clean-exit watchdog fix). No new migrations — only src/core/minions/* + src/commands/autopilot.ts touched, so our v66 partial-index migration stays in place. Conflict resolutions: - VERSION: take 0.34.4.0 - package.json: take 0.34.4.0 - CHANGELOG.md: clean non-interleaved conflict; strip markers and keep both entries (v0.34.4.0 on top, v0.34.3.0 below, then the existing v0.34.2.0 and v0.34.1.0 in descending order) Typecheck clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* upstream/master: v0.35.1.0: embedder shootout prereqs (pricing + gateway export + --resume-from) (garrytan#1055) v0.35.0.0 feat: ZeroEntropy zembed-1 + zerank-2 reranker (garrytan#1008) v0.34.4.0 fix(embed): cursor-paginated --stale hardening wave (D2/D3/D4/D6/D7/D8 + regression test) (garrytan#991) v0.34.3.0 fix: supervisor treats code=0 watchdog exits as crashes (garrytan#1003) v0.34.2.0 fix(import): path-based checkpoint resume — kills parallel-drop + failed-file-skip + sort-flip bugs (garrytan#988) v0.34.1.0 fix(mcp): MCP fix wave — source-isolation P0 + PKCE DCR + federated_read + 3 more (garrytan#996) v0.34.0.0 feat: Cathedral III — recursive code intelligence + Leiden clusters + eval gate (garrytan#994) v0.33.3.0 feat(v0.33.3): code intelligence MCP foundation (v0.34 W0a-c + W3) (garrytan#934) v0.33.2.1 docs: fork-PR workflow for garrytan-agents (garrytan#992) fix(sync): raise maxBuffer to 100 MiB to prevent silent ENOBUFS crash (garrytan#982) v0.33.2.0 feat(search-lite): token budget + semantic query cache + intent weighting (garrytan#897) v0.33.1.1 fix: Voyage output_dimension + flexible-dim guard + OOM-cap rethrow (garrytan#962)

garrytan-agents and others added 5 commits May 14, 2026 08:09

chore: bump version and changelog (v0.34.3.0)

bfc9238

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

garrytan changed the title ~~perf(embed): cursor-paginated stale loading + rate-limit backoff + partial index~~ v0.34.3.0 fix(embed): cursor-paginated --stale hardening wave (D2/D3/D4/D6/D7/D8 + regression test) May 15, 2026

garrytan changed the title ~~v0.34.3.0 fix(embed): cursor-paginated --stale hardening wave (D2/D3/D4/D6/D7/D8 + regression test)~~ v0.34.4.0 fix(embed): cursor-paginated --stale hardening wave (D2/D3/D4/D6/D7/D8 + regression test) May 15, 2026

garrytan and others added 2 commits May 14, 2026 21:07

garrytan merged commit 24881f6 into master May 15, 2026
7 checks passed

100yenadmin mentioned this pull request May 17, 2026

Merge upstream GBrain v0.35.1.1 while preserving Eva OpenClaw defaults electricsheephq/eva-brain#101

Closed

rayers mentioned this pull request May 20, 2026

fix(embed): preserve code-chunk metadata across re-embed (#769) #1232

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.34.4.0 fix(embed): cursor-paginated --stale hardening wave (D2/D3/D4/D6/D7/D8 + regression test)#991

v0.34.4.0 fix(embed): cursor-paginated --stale hardening wave (D2/D3/D4/D6/D7/D8 + regression test)#991
garrytan merged 8 commits into
masterfrom
garrytan/santiago-v2

garrytan commented May 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

garrytan commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Coverage

Pre-Landing Review

Plan Completion

TODOS

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

garrytan commented May 14, 2026 •

edited

Loading