v0.34.4.0 fix(embed): cursor-paginated --stale hardening wave (D2/D3/D4/D6/D7/D8 + regression test)#991
Merged
Merged
Conversation
…rtial index Three fixes for embed --stale on large brains (300K+ chunks): ## 1. Cursor-paginated listStaleChunks (embed timeout fix) The previous implementation pulled ALL stale rows (up to 100K) in one query. On a 373K-row content_chunks table with 48K stale rows, this query took >2 min and hit Supabase's 2-min statement_timeout, causing embed --stale to silently fail with zero progress. Fix: keyset pagination on (page_id, chunk_index) with a default batch size of 2000 rows. Each query finishes in <1s. The embedAllStale loop pages through batches, embeds each batch, then advances the cursor. ## 2. Rate-limit-aware retry (429 backoff) The OpenAI SDK's built-in retry has a ~4s max backoff window, which is too short for TPM (tokens-per-minute) limits on large pages (~90K tokens). The embed loop would fail after 3 SDK retries and skip the page entirely. Fix: embedBatchWithBackoff wrapper parses the retry delay from the 429 error message (e.g. 'try again in 248ms') and sleeps for that duration + 500ms padding. Up to 5 retries with parsed delays (60s fallback when unparseable). ## 3. Migration v58: partial index for NULL embeddings `CREATE INDEX idx_chunks_embedding_null ON content_chunks (page_id, chunk_index) WHERE embedding IS NULL` — makes countStaleChunks() and the paginated listStaleChunks() instant instead of full-table-scanning 373K rows. ## Testing Verified on a 99K-page / 373K-chunk brain with 48K stale chunks. Before: embed --stale hung for 2+ min then timed out (0 progress). After: loads 2K rows in <1s, embeds concurrently, pages through all stale chunks without timeout.
Resolves migration version collision: master shipped v58 (edges_backfilled_at_v0_33_3 — v0.33.3 W0c symbol-resolution backfill watermark) ahead of this branch. The embed-perf cherry-pick from PR #987 also claimed v58 for its idx_chunks_embedding_null partial index; renumbered to v59 since master landed first. Both migrations coexist unchanged at their new slots. No other conflicts. Typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lands the 9 decisions + regression test set from /plan-eng-review on PR #991's embed-perf cherry-pick. Implements the codex outside-voice findings folded in during plan review. Architecture / correctness: - D2 jitter on the parsed retry-after delay (±30%) so 20 concurrent workers don't relock on the next 429 wave (thundering herd fix). - D3 + D3a + D8 wall-clock budget (GBRAIN_EMBED_TIME_BUDGET_MS, default 30 min) threaded as an AbortSignal into THREE places: the retry sleep (abortableSleep), the per-key worker claim loop, and the gateway embed call itself (so a worker mid-fetch on a ~30s OpenAI HTTP timeout cancels within seconds instead of waiting it out). - D4 structured 429 detection that unwraps the gateway's AITransientError wrap via cause chain (depth-limited to 5). Naive `e.status === 429` was silently false against normalized errors; message-match stays as fallback. detect429FromCause exported as @internal helper. - D4a `maxRetries: 0` passthrough through embedBatch → gateway → embedMany so the AI SDK's default 2-retry stack doesn't multiply this wrapper's 5 attempts (was up to 15 total cycles per call). - D6 migration v59 (embed_stale_partial_index) rewritten to use CREATE INDEX CONCURRENTLY + handler-based engine-branching (mirrors v14 invalid-remnant pattern). Plain CREATE INDEX would have taken ShareLock on the 373K-row content_chunks table for the duration of the build. - D7 sourceId threaded through countStaleChunks + listStaleChunks + embedAllStale. `gbrain embed --stale --source X` was silently dropping the flag pre-fix and counting/embedding across every source. Both Postgres and PGLite engines updated. Tests added: - D5 8 unit cases for embedBatchWithBackoff in test/embed.serial.test.ts: ms / s retry-after parse, fallback, non-rate-limit rethrow, jitter variance, budget abort during sleep+fetch, normalized-error cause unwrap, maxRetries:0 passthrough verification. - D5a fixed every pre-existing stale-row mock to include source_id + page_id (required on StaleChunkRow as of v0.33.3 cursor pagination — TypeScript's structural typing was hiding these). - D7 unit cases asserting CLI `--source X` parses + threads sourceId. - Gap scan: end-to-end wall-clock budget firing in the outer pagination loop via runEmbedCore. - D6 migration v59 test cases in test/migrate.test.ts: source-shape assertion (CONCURRENTLY + invalid-remnant DROP-before-CREATE ordering), PGLite handler-branch idempotency, partial-index materialization. - REGRESSION: new test/e2e/embed-stale-pagination.test.ts covering static (every chunk visited exactly once), failed-page (cursor advances past failures, next run picks up), page-split-across-batches, source-scoped scan, duplicate-slug-across-sources. - PGLite parity cases for cursor pagination, page split, source filter in test/pglite-engine.test.ts (pins tuple-compare against WASM build). Gate: - bun run test: 6305 pass / 0 fail / 0 skip across all 8 shards + serial. - DATABASE_URL=... bun run test:e2e: 90 files, 603 tests, 0 failures. Plan: ~/.claude/plans/system-instruction-you-are-working-iterative-torvalds.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Master shipped v0.34.0.0 Cathedral III which claimed migration v59 (code_traversal_cache_v0_34). The embed-perf branch's idx_chunks_embedding_null was at v59 from the prior merge wave; renumbered to v60 since master landed first. Both migrations coexist unchanged at their final slots. Also fixed a TypeScript stricter-cast nit in the new E2E test (postgres.js RowList -> typed array now requires going through unknown first; master likely bumped TS or postgres.js types). Full unit suite passes: 6385 / 0 / 0 across 8 shards + serial. Full E2E suite passes against fresh Postgres: 90 files / 603 tests / 0 fail (verified in the prior commit; re-running is gated on Docker availability). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Master shipped v0.34.1.0 (PR #996, the MCP source-isolation wave) which added migrations v60-v65 (oauth_clients.source_id FK column, federated_read TEXT[], backfill, validation, RESTRICT flip, GIN index). Our previously v60-numbered embed_stale_partial_index renumbered to v66. Version: 0.34.3.0 → 0.34.4.0 per user direction (skipping over claimed slots in the queue). Conflict resolutions: - package.json: take v0.34.4.0 - src/core/migrate.ts: keep both v0.34.1.0's v60-v65 cluster AND our embed_stale_partial_index, the latter renumbered to v66 - CHANGELOG.md: my v0.34.4.0 entry on top, master's v0.34.1.0 below - TODOS.md: my embed-stale follow-ups + master's MCP fix-wave follow-ups - test/migrate.test.ts + test/e2e/embed-stale-pagination.test.ts: rename v60 → v66 references (24 + 3 occurrences) Typecheck clean. Migrate + PGLite + embed.serial tests: 238 / 0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Master shipped v0.34.2.0 (PR #988, path-based import checkpoint). v0.34.2.0 adds no new migrations (only new helper modules src/core/import-checkpoint.ts and src/core/sort-newest-first.ts), so no migration renumber needed this wave — our embed_stale_partial_index stays at v66. Conflict resolutions: - VERSION: take 0.34.4.0 - package.json: take 0.34.4.0 - CHANGELOG.md: collapse the interleaved auto-merge (master's v0.34.2.0 entry got woven into mine on shared "### Itemized changes" / "#### Added" headers). Reconstructed both entries cleanly back-to-back: v0.34.4.0 topmost, then v0.34.2.0, then v0.34.1.0. Updated CHANGELOG references to migration v66 (was inadvertently still "v60" in three spots from the prior wave's prose — corrected here). Typecheck clean. lockfile unchanged (no dep delta). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Master shipped v0.34.3.0 (PR #1003, supervisor clean-exit watchdog fix). No new migrations — only src/core/minions/* + src/commands/autopilot.ts touched, so our v66 partial-index migration stays in place. Conflict resolutions: - VERSION: take 0.34.4.0 - package.json: take 0.34.4.0 - CHANGELOG.md: clean non-interleaved conflict; strip markers and keep both entries (v0.34.4.0 on top, v0.34.3.0 below, then the existing v0.34.2.0 and v0.34.1.0 in descending order) Typecheck clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5 tasks
brandonlipman
added a commit
to brandonlipman/gbrain
that referenced
this pull request
May 29, 2026
* upstream/master: v0.35.1.0: embedder shootout prereqs (pricing + gateway export + --resume-from) (garrytan#1055) v0.35.0.0 feat: ZeroEntropy zembed-1 + zerank-2 reranker (garrytan#1008) v0.34.4.0 fix(embed): cursor-paginated --stale hardening wave (D2/D3/D4/D6/D7/D8 + regression test) (garrytan#991) v0.34.3.0 fix: supervisor treats code=0 watchdog exits as crashes (garrytan#1003) v0.34.2.0 fix(import): path-based checkpoint resume — kills parallel-drop + failed-file-skip + sort-flip bugs (garrytan#988) v0.34.1.0 fix(mcp): MCP fix wave — source-isolation P0 + PKCE DCR + federated_read + 3 more (garrytan#996) v0.34.0.0 feat: Cathedral III — recursive code intelligence + Leiden clusters + eval gate (garrytan#994) v0.33.3.0 feat(v0.33.3): code intelligence MCP foundation (v0.34 W0a-c + W3) (garrytan#934) v0.33.2.1 docs: fork-PR workflow for garrytan-agents (garrytan#992) fix(sync): raise maxBuffer to 100 MiB to prevent silent ENOBUFS crash (garrytan#982) v0.33.2.0 feat(search-lite): token budget + semantic query cache + intent weighting (garrytan#897) v0.33.1.1 fix: Voyage output_dimension + flexible-dim guard + OOM-cap rethrow (garrytan#962)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Cherry-picked from PR #987 (garrytan-agents) and hardened through a
/plan-eng-review+/codexconsult + re-review wave. The original cherry-pick gaveembed --stalecursor pagination so it would stop dying on Supabase's 2-minute pooler timeout. This PR makes the hot path survive real production load and fixes a silently-broken--sourceflag along the way.Hardening / correctness
GBRAIN_EMBED_TIME_BUDGET_MS, default 30 min) withAbortSignalthreaded into THREE places: the retry sleep (abortableSleep), the per-key worker claim loop, AND the gateway embed call itself (so a worker mid-fetch on a ~30s OpenAI HTTP timeout cancels within seconds instead of running past budget).e.status === 429was silently false against the gateway'sAITransientErrorwrap (which stores the original on.cause).detect429FromCausewalks the chain (depth ≤5). Message-match stays as fallback for providers whose wrappers stripcause.status.maxRetries: 0passthrough throughembedBatch→ gateway →embedMany. Pre-fix: 3 SDK × 5 wrapper = up to 15 cycles per call.CREATE INDEX CONCURRENTLYwith handler-based engine branching (mirrors the v14 invalid-remnant pattern). PlainCREATE INDEXwould have taken ShareLock on the 373K-rowcontent_chunkstable.gbrain embed --stale --source Xwas silently dropping the flag pre-fix. Threaded throughcountStaleChunks+listStaleChunks+embedAllStale; both Postgres and PGLite engines updated.Tests added
embedBatchWithBackoffintest/embed.serial.test.ts(parse ms/s retry-after, fallback, non-rate-limit rethrow, jitter range, budget-abort-mid-fetch, normalized-error cause unwrap, maxRetries:0 passthrough). Fixed every pre-existing stale-row mock to includesource_id+page_id(TypeScript structural typing was hiding the new contract).test/e2e/embed-stale-pagination.test.ts(6 cases against real Postgres): static (every chunk visited exactly once), failed-page (cursor advances past failures, next run picks up), page-split-across-batches, source-scoped, duplicate-slug-across-sources.test/migrate.test.ts.test/pglite-engine.test.ts.Test Coverage
Coverage gate: PASS (100%).
Pre-Landing Review
The diff was reviewed via
/plan-eng-review(9 decisions made interactively),/codexconsult (12 findings folded in), and one re-review pass that surfaced D8 (AbortSignalthreading into the gateway embed call). All decisions implemented + tested. Pre-landing checklist scan: clean — no SQL injection, no LLM trust boundary violations, no auth-touching code, no leaked debug logs.Plan Completion
10/10 plan items DONE — D2 / D3 / D3a / D4 / D4a / D5 / D5a / D6 / D7 / D8 + IRON RULE regression test. Plan file:
~/.claude/plans/system-instruction-you-are-working-iterative-torvalds.md.TODOS
Two deferred items filed (codex Tier 2 findings beyond this PR's scope):
embed.ts:429-443+postgres-engine.ts:1231'sCOALESCE(...).Both filed in
TODOS.mdunder## embed --stale follow-ups (v0.34.3.0).Test plan
bun run test— 6385 pass / 0 fail / 0 skip across 8 shards + serialbun run test:e2e— 90 files / 603 tests / 0 fail against fresh Postgres🤖 Generated with Claude Code