Skip to content

v0.34.4.0 fix(embed): cursor-paginated --stale hardening wave (D2/D3/D4/D6/D7/D8 + regression test)#991

Merged
garrytan merged 8 commits into
masterfrom
garrytan/santiago-v2
May 15, 2026
Merged

v0.34.4.0 fix(embed): cursor-paginated --stale hardening wave (D2/D3/D4/D6/D7/D8 + regression test)#991
garrytan merged 8 commits into
masterfrom
garrytan/santiago-v2

Conversation

@garrytan

@garrytan garrytan commented May 14, 2026

Copy link
Copy Markdown
Owner

Summary

Cherry-picked from PR #987 (garrytan-agents) and hardened through a /plan-eng-review + /codex consult + re-review wave. The original cherry-pick gave embed --stale cursor pagination so it would stop dying on Supabase's 2-minute pooler timeout. This PR makes the hot path survive real production load and fixes a silently-broken --source flag along the way.

Hardening / correctness

  • D2 jitter ±30% on parsed retry-after delays so 20 concurrent workers don't relock on the next 429 wave (thundering herd fix).
  • D3 + D3a + D8 wall-clock budget (GBRAIN_EMBED_TIME_BUDGET_MS, default 30 min) with AbortSignal threaded into THREE places: the retry sleep (abortableSleep), the per-key worker claim loop, AND the gateway embed call itself (so a worker mid-fetch on a ~30s OpenAI HTTP timeout cancels within seconds instead of running past budget).
  • D4 cause-chain 429 detection. e.status === 429 was silently false against the gateway's AITransientError wrap (which stores the original on .cause). detect429FromCause walks the chain (depth ≤5). Message-match stays as fallback for providers whose wrappers strip cause.status.
  • D4a SDK retry suppression via maxRetries: 0 passthrough through embedBatch → gateway → embedMany. Pre-fix: 3 SDK × 5 wrapper = up to 15 cycles per call.
  • D6 migration v60 rewritten to CREATE INDEX CONCURRENTLY with handler-based engine branching (mirrors the v14 invalid-remnant pattern). Plain CREATE INDEX would have taken ShareLock on the 373K-row content_chunks table.
  • D7 sourceId threading. gbrain embed --stale --source X was silently dropping the flag pre-fix. Threaded through countStaleChunks + listStaleChunks + embedAllStale; both Postgres and PGLite engines updated.

Tests added

  • D5 + D5a — 8 unit cases for embedBatchWithBackoff in test/embed.serial.test.ts (parse ms/s retry-after, fallback, non-rate-limit rethrow, jitter range, budget-abort-mid-fetch, normalized-error cause unwrap, maxRetries:0 passthrough). Fixed every pre-existing stale-row mock to include source_id + page_id (TypeScript structural typing was hiding the new contract).
  • D7 + gap scan — 3 CLI flag-wiring tests + end-to-end wall-clock budget test.
  • IRON RULE regression — new test/e2e/embed-stale-pagination.test.ts (6 cases against real Postgres): static (every chunk visited exactly once), failed-page (cursor advances past failures, next run picks up), page-split-across-batches, source-scoped, duplicate-slug-across-sources.
  • D6 — migration v60 source-shape + materialization tests in test/migrate.test.ts.
  • PGLite parity — cursor pagination, page split, source filter cases in test/pglite-engine.test.ts.

Test Coverage

[+] src/commands/embed.ts
  ├── embedBatchWithBackoff() — 8 cases ★★★ TESTED (embed.serial.test.ts:434-588)
  ├── parseRetryDelayMs / detect429FromCause / abortableSleep — exported, exercised
  └── embedAllStale() — sourceId threading + CLI wiring + budget end-to-end ★★★
[+] migration v60 (CONCURRENTLY)         — source-shape + PGLite materialization ★★★
[+] sourceId on count/list (engine)      — Postgres + PGLite parity ★★★
[+] E2E regression                       — static / failed / page-split / source / dup-slug ★★★

COVERAGE: 100% — every new code path has a test
QUALITY: ★★★ × 24+   GAPS: 0
Tests: 504 → 505 files (+1) and ~26 new test cases inside expanded files

Coverage gate: PASS (100%).

Pre-Landing Review

The diff was reviewed via /plan-eng-review (9 decisions made interactively), /codex consult (12 findings folded in), and one re-review pass that surfaced D8 (AbortSignal threading into the gateway embed call). All decisions implemented + tested. Pre-landing checklist scan: clean — no SQL injection, no LLM trust boundary violations, no auth-touching code, no leaked debug logs.

Plan Completion

10/10 plan items DONE — D2 / D3 / D3a / D4 / D4a / D5 / D5a / D6 / D7 / D8 + IRON RULE regression test. Plan file: ~/.claude/plans/system-instruction-you-are-working-iterative-torvalds.md.

TODOS

Two deferred items filed (codex Tier 2 findings beyond this PR's scope):

  • v0.35.x: Concurrent NULL→non-NULL upsert race in embed.ts:429-443 + postgres-engine.ts:1231's COALESCE(...).
  • v0.35.x: New stale rows inserted behind the keyset cursor.

Both filed in TODOS.md under ## embed --stale follow-ups (v0.34.3.0).

Test plan

  • bun run test — 6385 pass / 0 fail / 0 skip across 8 shards + serial
  • bun run test:e2e — 90 files / 603 tests / 0 fail against fresh Postgres
  • Typecheck clean
  • Migration v60 materializes on PGLite + structurally validated for Postgres CONCURRENTLY

🤖 Generated with Claude Code

garrytan-agents and others added 5 commits May 14, 2026 08:09
…rtial index

Three fixes for embed --stale on large brains (300K+ chunks):

## 1. Cursor-paginated listStaleChunks (embed timeout fix)

The previous implementation pulled ALL stale rows (up to 100K) in one
query. On a 373K-row content_chunks table with 48K stale rows, this
query took >2 min and hit Supabase's 2-min statement_timeout, causing
embed --stale to silently fail with zero progress.

Fix: keyset pagination on (page_id, chunk_index) with a default batch
size of 2000 rows. Each query finishes in <1s. The embedAllStale loop
pages through batches, embeds each batch, then advances the cursor.

## 2. Rate-limit-aware retry (429 backoff)

The OpenAI SDK's built-in retry has a ~4s max backoff window, which is
too short for TPM (tokens-per-minute) limits on large pages (~90K
tokens). The embed loop would fail after 3 SDK retries and skip the
page entirely.

Fix: embedBatchWithBackoff wrapper parses the retry delay from the
429 error message (e.g. 'try again in 248ms') and sleeps for that
duration + 500ms padding. Up to 5 retries with parsed delays (60s
fallback when unparseable).

## 3. Migration v58: partial index for NULL embeddings

`CREATE INDEX idx_chunks_embedding_null ON content_chunks (page_id,
chunk_index) WHERE embedding IS NULL` — makes countStaleChunks() and
the paginated listStaleChunks() instant instead of full-table-scanning
373K rows.

## Testing

Verified on a 99K-page / 373K-chunk brain with 48K stale chunks.
Before: embed --stale hung for 2+ min then timed out (0 progress).
After: loads 2K rows in <1s, embeds concurrently, pages through all
stale chunks without timeout.
Resolves migration version collision: master shipped v58
(edges_backfilled_at_v0_33_3 — v0.33.3 W0c symbol-resolution backfill
watermark) ahead of this branch. The embed-perf cherry-pick from PR #987
also claimed v58 for its idx_chunks_embedding_null partial index;
renumbered to v59 since master landed first. Both migrations coexist
unchanged at their new slots.

No other conflicts. Typecheck clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lands the 9 decisions + regression test set from /plan-eng-review on PR #991's
embed-perf cherry-pick. Implements the codex outside-voice findings folded in
during plan review.

Architecture / correctness:
- D2 jitter on the parsed retry-after delay (±30%) so 20 concurrent workers
  don't relock on the next 429 wave (thundering herd fix).
- D3 + D3a + D8 wall-clock budget (GBRAIN_EMBED_TIME_BUDGET_MS, default 30
  min) threaded as an AbortSignal into THREE places: the retry sleep
  (abortableSleep), the per-key worker claim loop, and the gateway embed
  call itself (so a worker mid-fetch on a ~30s OpenAI HTTP timeout cancels
  within seconds instead of waiting it out).
- D4 structured 429 detection that unwraps the gateway's AITransientError
  wrap via cause chain (depth-limited to 5). Naive `e.status === 429` was
  silently false against normalized errors; message-match stays as
  fallback. detect429FromCause exported as @internal helper.
- D4a `maxRetries: 0` passthrough through embedBatch → gateway →
  embedMany so the AI SDK's default 2-retry stack doesn't multiply this
  wrapper's 5 attempts (was up to 15 total cycles per call).
- D6 migration v59 (embed_stale_partial_index) rewritten to use
  CREATE INDEX CONCURRENTLY + handler-based engine-branching (mirrors v14
  invalid-remnant pattern). Plain CREATE INDEX would have taken ShareLock
  on the 373K-row content_chunks table for the duration of the build.
- D7 sourceId threaded through countStaleChunks + listStaleChunks +
  embedAllStale. `gbrain embed --stale --source X` was silently dropping
  the flag pre-fix and counting/embedding across every source. Both
  Postgres and PGLite engines updated.

Tests added:
- D5 8 unit cases for embedBatchWithBackoff in test/embed.serial.test.ts:
  ms / s retry-after parse, fallback, non-rate-limit rethrow, jitter
  variance, budget abort during sleep+fetch, normalized-error cause
  unwrap, maxRetries:0 passthrough verification.
- D5a fixed every pre-existing stale-row mock to include source_id +
  page_id (required on StaleChunkRow as of v0.33.3 cursor pagination —
  TypeScript's structural typing was hiding these).
- D7 unit cases asserting CLI `--source X` parses + threads sourceId.
- Gap scan: end-to-end wall-clock budget firing in the outer pagination
  loop via runEmbedCore.
- D6 migration v59 test cases in test/migrate.test.ts: source-shape
  assertion (CONCURRENTLY + invalid-remnant DROP-before-CREATE ordering),
  PGLite handler-branch idempotency, partial-index materialization.
- REGRESSION: new test/e2e/embed-stale-pagination.test.ts covering
  static (every chunk visited exactly once), failed-page (cursor advances
  past failures, next run picks up), page-split-across-batches,
  source-scoped scan, duplicate-slug-across-sources.
- PGLite parity cases for cursor pagination, page split, source filter
  in test/pglite-engine.test.ts (pins tuple-compare against WASM build).

Gate:
- bun run test: 6305 pass / 0 fail / 0 skip across all 8 shards + serial.
- DATABASE_URL=... bun run test:e2e: 90 files, 603 tests, 0 failures.

Plan: ~/.claude/plans/system-instruction-you-are-working-iterative-torvalds.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Master shipped v0.34.0.0 Cathedral III which claimed migration v59
(code_traversal_cache_v0_34). The embed-perf branch's idx_chunks_embedding_null
was at v59 from the prior merge wave; renumbered to v60 since master
landed first. Both migrations coexist unchanged at their final slots.

Also fixed a TypeScript stricter-cast nit in the new E2E test (postgres.js
RowList -> typed array now requires going through unknown first; master
likely bumped TS or postgres.js types).

Full unit suite passes: 6385 / 0 / 0 across 8 shards + serial.
Full E2E suite passes against fresh Postgres: 90 files / 603 tests / 0 fail
(verified in the prior commit; re-running is gated on Docker availability).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@garrytan garrytan changed the title perf(embed): cursor-paginated stale loading + rate-limit backoff + partial index v0.34.3.0 fix(embed): cursor-paginated --stale hardening wave (D2/D3/D4/D6/D7/D8 + regression test) May 15, 2026
Master shipped v0.34.1.0 (PR #996, the MCP source-isolation wave) which
added migrations v60-v65 (oauth_clients.source_id FK column, federated_read
TEXT[], backfill, validation, RESTRICT flip, GIN index). Our previously
v60-numbered embed_stale_partial_index renumbered to v66.

Version: 0.34.3.0 → 0.34.4.0 per user direction (skipping over claimed
slots in the queue).

Conflict resolutions:
- package.json: take v0.34.4.0
- src/core/migrate.ts: keep both v0.34.1.0's v60-v65 cluster AND our
  embed_stale_partial_index, the latter renumbered to v66
- CHANGELOG.md: my v0.34.4.0 entry on top, master's v0.34.1.0 below
- TODOS.md: my embed-stale follow-ups + master's MCP fix-wave follow-ups
- test/migrate.test.ts + test/e2e/embed-stale-pagination.test.ts:
  rename v60 → v66 references (24 + 3 occurrences)

Typecheck clean. Migrate + PGLite + embed.serial tests: 238 / 0.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@garrytan garrytan changed the title v0.34.3.0 fix(embed): cursor-paginated --stale hardening wave (D2/D3/D4/D6/D7/D8 + regression test) v0.34.4.0 fix(embed): cursor-paginated --stale hardening wave (D2/D3/D4/D6/D7/D8 + regression test) May 15, 2026
garrytan and others added 2 commits May 14, 2026 21:07
Master shipped v0.34.2.0 (PR #988, path-based import checkpoint). v0.34.2.0
adds no new migrations (only new helper modules src/core/import-checkpoint.ts
and src/core/sort-newest-first.ts), so no migration renumber needed this
wave — our embed_stale_partial_index stays at v66.

Conflict resolutions:
- VERSION: take 0.34.4.0
- package.json: take 0.34.4.0
- CHANGELOG.md: collapse the interleaved auto-merge (master's v0.34.2.0
  entry got woven into mine on shared "### Itemized changes" / "#### Added"
  headers). Reconstructed both entries cleanly back-to-back: v0.34.4.0
  topmost, then v0.34.2.0, then v0.34.1.0.

Updated CHANGELOG references to migration v66 (was inadvertently still
"v60" in three spots from the prior wave's prose — corrected here).

Typecheck clean. lockfile unchanged (no dep delta).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Master shipped v0.34.3.0 (PR #1003, supervisor clean-exit watchdog fix).
No new migrations — only src/core/minions/* + src/commands/autopilot.ts
touched, so our v66 partial-index migration stays in place.

Conflict resolutions:
- VERSION: take 0.34.4.0
- package.json: take 0.34.4.0
- CHANGELOG.md: clean non-interleaved conflict; strip markers and keep
  both entries (v0.34.4.0 on top, v0.34.3.0 below, then the existing
  v0.34.2.0 and v0.34.1.0 in descending order)

Typecheck clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@garrytan garrytan merged commit 24881f6 into master May 15, 2026
7 checks passed
brandonlipman added a commit to brandonlipman/gbrain that referenced this pull request May 29, 2026
* upstream/master:
  v0.35.1.0: embedder shootout prereqs (pricing + gateway export + --resume-from) (garrytan#1055)
  v0.35.0.0 feat: ZeroEntropy zembed-1 + zerank-2 reranker (garrytan#1008)
  v0.34.4.0 fix(embed): cursor-paginated --stale hardening wave (D2/D3/D4/D6/D7/D8 + regression test) (garrytan#991)
  v0.34.3.0 fix: supervisor treats code=0 watchdog exits as crashes (garrytan#1003)
  v0.34.2.0 fix(import): path-based checkpoint resume — kills parallel-drop + failed-file-skip + sort-flip bugs (garrytan#988)
  v0.34.1.0 fix(mcp): MCP fix wave — source-isolation P0 + PKCE DCR + federated_read + 3 more (garrytan#996)
  v0.34.0.0 feat: Cathedral III — recursive code intelligence + Leiden clusters + eval gate (garrytan#994)
  v0.33.3.0 feat(v0.33.3): code intelligence MCP foundation (v0.34 W0a-c + W3) (garrytan#934)
  v0.33.2.1 docs: fork-PR workflow for garrytan-agents (garrytan#992)
  fix(sync): raise maxBuffer to 100 MiB to prevent silent ENOBUFS crash (garrytan#982)
  v0.33.2.0 feat(search-lite): token budget + semantic query cache + intent weighting (garrytan#897)
  v0.33.1.1 fix: Voyage output_dimension + flexible-dim guard + OOM-cap rethrow (garrytan#962)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants