v0.41.25.0 perf: batch sync deletes (supersedes #1538)#1568
Closed
garrytan wants to merge 2 commits into
Closed
Conversation
Bulk deletes no longer jam sync for two days. A single commit deleting 73K files used to take ~5 hours (146K individual DB round-trips) and time out the sync cron every run. This release batches both halves: ~2 minutes on the same scenario, with the per-batch shape inheriting the v0.41.19 Supavisor retry cathedral so transient connection blips don't lose work. Also closes a latent silent-no-op bug class: on brains that didn't pass `--source`, the old per-file resolver could pull a slug from the wrong source and then issue a delete against `source_id='default'`, which would no-op. Files stayed in the brain after the file was gone from disk. The new path scopes lookup AND delete to the same source. Productionized from community PR #1538 (@garrytan-agents) per the CLAUDE.md community-PR-wave workflow. 12 review-driven improvements layered on top via /plan-eng-review + codex outside voice: D1 drop duplicate progress.finish() from PR #1538 D2 batch BOTH sourceId-set and no-sourceId Phase-1 lookups D3 BATCH_SIZE=100 matches v0.12.1 addLinksBatch precedent D4 full test coverage: 12 unit + 3 E2E + D1/D9 regressions D5 codex outside-voice review (12 findings absorbed) D6 scope no-sourceId path to source_id='default' (kills latent bug) D7 ORDER BY slug ASC for deterministic duplicate-source_path collapse D8 wire deletePages through withRetry(BULK_RETRY_OPTS) + audit D9 query-count regression test (the actual perf regression test) D10 deletePages returns Promise<string[]> not Promise<number> D11 implement on both engines, non-optional interface (no fallback) D12 CHANGELOG scoped honestly to delete hotspot Engine surface: BrainEngine.deletePages(slugs, opts?): Promise<string[]> PostgresEngine: withRetry wrap, audit 'deletePages', BATCH_SIZE=100 PGLiteEngine: raw DELETE ... WHERE slug = ANY($1::text[]) BATCH_AUDIT_SITES enum + retry.test.ts regression pin updated Sync rewrite: src/commands/sync.ts delete loop replaced with two-phase batched path. New sync.deletes.resolve progress event (Phase 1 lookup), existing sync.deletes covers Phase 2. Per-batch --timeout abort checks preserve the v0.41.13.0 D-V4-2 contract. Tests: test/sync-batch-deletes.test.ts (NEW, 12 PGLite cases): engine-level contract + sync-level integration (D6/D7/D9 regressions) + D1 structural source-text regression test/e2e/sync.test.ts (extended, 3 real-Postgres cases): FK cascade through content_chunks, 250-file batch-shape verification, abort mid-sync returns partial(timeout) cleanly test/core/retry.test.ts BATCH_AUDIT_SITES regression-pin extended Docs: docs/ENGINES.md documents deletePages alongside deletePage docs/progress-events.md documents sync.deletes.resolve Co-authored-by: garrytan-agents <garrytan-agents@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Catches up to v0.41.23.0 (#1541). VERSION + package.json kept at 0.41.25.0 (ours, higher semver); CHANGELOG carries both entries with v0.41.25.0 on top. llms-full.txt regenerated for the merged CHANGELOG. No code conflicts in src/ — master's extract operator surfaces wave (#1541) touches different files than this branch's batch-deletes work in sync.ts.
Owner
Author
|
Superseded by #1566 (batched deletes + global page-generation clock), which merged as v0.41.25.0. Closing this duplicate. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bulk deletes no longer jam sync for two days. A single commit deleting 73K files used to take ~5 hours (146K individual DB round-trips) and time out the sync cron every run. This release batches both halves — Phase 1 batched slug resolution + Phase 2 batched DELETE — so the same scenario lands in ~2 minutes. The per-batch shape inherits the v0.41.19 Supavisor
withRetrycathedral so transient connection blips don't lose work.Also closes a latent silent-no-op bug class on
gbrain syncwithout--source: the pre-fix per-file resolver could pull a slug from the wrong source and then issue a delete againstsource_id='default', which would silently no-op. Files stayed in the brain after the file was gone from disk. The new path scopes lookup AND delete to the same source.Scope honesty: this fixes the delete-specific hotspot only. The full RFC #1472 cathedral (per-source
--timeout,--break-lock-if-stale,--independentmode) already landed in v0.41.15.0 (#1506). Renames, imports, extraction, embedding, and lock contention can still cause sync slowness in their own ways.Provenance
Productionized from community PR #1538 (
@garrytan-agents) per CLAUDE.md's community-PR-wave workflow. The original PR couldn't run CI with secrets because GitHub doesn't pass them to fork PRs from non-collaborators. Moving to a base-repo branch + 12 review-driven improvements layered on top via/plan-eng-review+ codex outside voice:progress.finish()from PR #1538BATCH_SIZE=100matches v0.12.1addLinksBatchprecedentsource_id='default'(kills latent bug)ORDER BY slug ASCfor deterministic duplicate-source_path collapsedeletePagesthroughwithRetry(BULK_RETRY_OPTS)+ auditdeletePagesreturnsPromise<string[]>notPromise<number>Engine surface
PostgresEngine wraps each batch in
withRetry(BULK_RETRY_OPTS)withauditSite: 'deletePages'. PGLite runs the bare DELETE (single-writer, no Supavisor concern).BATCH_AUDIT_SITESenum extended;test/core/retry.test.tsregression pin updated.Sync rewrite
src/commands/sync.tsdelete loop replaced with the two-phase batched path:SELECT slug, source_path FROM pages WHERE source_path = ANY($1::text[]) AND source_id = $2 ORDER BY slug ASC(BATCH_SIZE=100).ORDER BYpins deterministic Map collapse on duplicate source_paths.engine.deletePages(batch, deleteOpts)(BATCH_SIZE=100). Returns actually-deleted slugs; caller'spagesAffectedappends INPUT batch for parity with pre-fix (downstream extract/embed sweeps no-op on missing pages).New
sync.deletes.resolveprogress event fires for Phase 1; existingsync.deletescovers Phase 2. Per-batch--timeoutabort checks preserve the v0.41.13.0 D-V4-2 contract.Tests
test/sync-batch-deletes.test.ts(NEW, 12 PGLite cases): engine-level contract (D10), missing slugs, abort prefix, source scoping (D6 + default), 250-slug internal batching, D7 deterministicORDER BY, source_path-NULL fallback, D9 query-count regression, D1 structural source-text regression.test/e2e/sync.test.ts(extended, 3 real-Postgres cases): FK CASCADE throughcontent_chunks(100-row batch), end-to-end 250-file delete batches into ≤3 SELECT + ≤3 deletePages calls, abort mid-sync returnspartial('timeout')with emptypagesAffected.test/core/retry.test.tsregression-pin extended with'deletePages'.Docs
docs/ENGINES.mddocumentsdeletePagesalongsidedeletePage.docs/progress-events.mddocumentssync.deletes.resolve.Closes / supersedes
Supersedes #1538 (same fix, base-repo branch so CI has secret access, plus the 12 improvements above).
Test plan
bun run typecheck— passbun run verify— 28/28 checks passbun test test/sync-batch-deletes.test.ts— 12/12 (PGLite)bun test test/core/retry.test.ts— 37/37 (regression-pin updated)bun test test/build-llms.test.ts— 7/7 (llms.txt regenerated for CHANGELOG)bun run test:serial— 50/50 serial filesbun run test— 3105 pass / 1 fail (BATCH_AUDIT_SITES regression-pin, fixed in same commit)bun test test/e2e/sync.test.tsagainst pgvector container — 17/17Notes
llms-full.txtregenerated for the CHANGELOG entry per the CLAUDE.md mandatory-regen-after-CHANGELOG-edit rule.TODOS.md: composite(source_id, source_path)partial index for the optimal Phase-1 hot path on 4-source federated brains (P3, latent).🤖 Generated with Claude Code