perf: batch sync deletes — 73K deletes in ~2 min instead of ~5 hours#1538
perf: batch sync deletes — 73K deletes in ~2 min instead of ~5 hours#1538garrytan-agents wants to merge 1 commit into
Conversation
The sync delete path processed each file deletion as two sequential DB round-trips: one to resolve the slug (SELECT), one to delete (DELETE). For a commit that removes 73K files, this means 146K individual queries taking ~5 hours and jamming the entire sync pipeline for 2 days. This change batches both phases: Phase 1 — Slug resolution: Before: 73K individual SELECT ... WHERE source_path = $1 After: 146 batched SELECT ... WHERE source_path = ANY($1) (500 per batch) Phase 2 — Deletion: Before: 73K individual DELETE ... WHERE slug = $1 After: 146 batched DELETE ... WHERE slug = ANY($1) (500 per batch) Total: ~292 DB round-trips instead of ~146K. Real-world improvement: 5 hours → ~2 minutes. The batch path requires engine.deletePages() (new optional method on BrainEngine interface). Falls back to sequential deletePage() for PGLite or engines that don't implement it.
|
Closing as superseded — the supersede PR is in active design. Plan locked in via Why supersede instead of merging this + iterating: this PR fixes the headline bottleneck on Postgres (73K-delete: ~5h → ~2min, real win), but ships with PGLite users getting zero benefit (deletePages falls back to sequential The replacement work expands scope:
Credit: the SQL shape ( The PR source is preserved locally in Plan file: |
|
Superseded by #1568 per the CLAUDE.md community-PR-wave workflow — same fix, base-repo branch so CI has secret access, plus 12 review-driven improvements (
Plus a full test suite: 12 PGLite unit cases + 3 real-Postgres E2E cases. Your contribution preserved via |
… (supersedes #1538) (#1566) * feat(engine): add deletePages + resolveSlugsByPaths to BrainEngine (v0.41.21.0 T1) Two new REQUIRED methods on the BrainEngine interface, implemented on both Postgres and PGLite engines. Closes the per-file N+1 query pattern that PR #1538 batched on Postgres only. deletePages(slugs: string[], opts: { sourceId: string }): Promise<string[]> — Single SQL round-trip: DELETE FROM pages WHERE slug = ANY($1::text[]) AND source_id = $2 RETURNING slug — Returns slugs ACTUALLY DELETED (D6, codex CDX-8) so callers can filter pagesAffected to exclude phantom slugs (paths in the deletion list but with no DB row). — Single-batch primitive: caller chunks input to DELETE_BATCH_SIZE. Throws if input exceeds the cap. — sourceId is REQUIRED at the type level (D5, codex CDX-10). Asymmetric with single-row deletePage which keeps the optional 'default' fallback for back-compat. v0.42+ TODO to tighten. resolveSlugsByPaths(paths, opts): Promise<Map<path, slug>> — Batch path → slug lookup. Single SQL round-trip: SELECT slug, source_path FROM pages WHERE source_path = ANY($1::text[]) AND source_id = $2 — Missing paths absent from the Map (caller falls back to path-derived slug, same contract as resolveSlugByPathOrSourcePath). — Empty input short-circuits to empty Map (no SQL). src/core/engine-constants.ts (NEW) — Single source of truth for DELETE_BATCH_SIZE = 500. — Both engines import; no engine-from-engine coupling. — Lives outside engine.ts (the interface module) to avoid circular imports. Also updates the deletePage JSDoc (CDX-11): drops the misleading "hard delete is admin-only" framing. `gbrain sync` hard-deletes on every run that sees a deleted file; not admin-only. Co-Authored-By: garrytan-agents <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf(sync): batched delete + rename + DRY refactor (v0.41.21.0 T2/T3/T4) Replaces the per-file delete loop (sync.ts:1241-1257) and per-file rename slug-resolve (sync.ts:1263-1295) with interleaved per-batch flows using engine.resolveSlugsByPaths + engine.deletePages. Also refactors resolveSlugByPathOrSourcePath (sync.ts:267) to delegate to the new batch helper when sourceId is set — one owner of the SQL + fallback semantics (D8). ROUND-TRIP COUNTS (73K-delete commit): pre-fix: 73,000 SELECTs + 73,000 DELETEs = 146,000 (~5 hours) post-fix: 146 SELECTs + 146 DELETEs = 292 (~2 minutes) Headline win: a single commit deleting 73K files no longer jams the sync pipeline for hours, no longer cascades staleness across every other source on the brain. Shape (T2 delete loop, per the plan's ASCII diagram): filtered.deleted (73K paths) │ ▼ slice into batches of DELETE_BATCH_SIZE (500) │ ▼ for each batch: abort-check ──► partial('timeout') │ ▼ engine.resolveSlugsByPaths(batch, {sourceId}) ◀── 1 SQL round-trip │ ▼ slugs = batch.map(path => map.get(path) ?? resolveSlugForPath(path)) ◀── pure-JS fallback │ for frontmatter- ▼ fallback slugs try { deleted = engine.deletePages(slugs, opts) ◀── 1 SQL round-trip pagesAffected.push(...deleted) ◀── D6 confirmed only } catch { // D7 decompose: per-slug deletePage, // unrecoverable failures → failedFiles } Per-batch try-catch (D7) decomposes batch DELETE failures to per-slug deletePage so a transient blip on batch 73 doesn't lose 500 deletes — it self-heals to one-at-a-time for that batch only. Unrecoverable per-slug failures land in failedFiles (matching the existing import-loop pattern at sync.ts:~1350). failedFiles declaration hoisted above the delete loop so both delete decompose and import loops feed the same sync-bookmark gate. T4 rename loop: pre-resolves all `from` slugs in batches via resolveSlugsByPaths BEFORE iterating. Per-file updateSlug + importFile calls stay (those are inherently per-file). The try/catch around updateSlug for slug-doesn't-exist preserves verbatim. T3 DRY refactor: resolveSlugByPathOrSourcePath delegates to resolveSlugsByPaths via a single-element array when sourceId is set. When sourceId is undefined (legacy unscoped callers), falls back to the original executeRaw shape — the batch engine surface requires sourceId per D5 (multi-source-bug-class defense). Atomicity coarsening (D3): each batch is one transaction. A mid-batch abort or connection failure rolls back up to DELETE_BATCH_SIZE - 1 successful deletes from the in-flight batch. Sync is idempotent so the next run picks them up via git diff regenerating the deletion list. Documented at the call site + in the deletePages JSDoc. Co-Authored-By: garrytan-agents <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(schema): global page-generation clock + statement-level trigger (v0.41.21.0 T5) Migration v104: page_generation_clock_and_statement_trigger. The pre-v0.41.21.0 query-cache Layer 1 bookmark read MAX(generation) FROM pages to detect "writes happened since cache-store". Two bugs in that contract — independent of any sync work, surfaced by codex outside-voice on the /plan-eng-review pass: 1. The row-level bump_page_generation_trg (migration v91) sets NEW.generation = OLD.generation + 1 on UPDATE. Updating a NON-MAX page didn't advance MAX(generation). Cache silently served stale for any UPDATE-to-non-max page. (CDX-2) 2. The trigger is BEFORE INSERT OR UPDATE — DELETE doesn't fire it at all. Even an AFTER DELETE wouldn't move MAX (surviving rows are untouched). (CDX-1) Fix: single-row page_generation_clock counter, bumped per-statement (FOR EACH STATEMENT — per-row would turn a 73K-row batch DELETE into 73K UPDATEs on the same counter, recreating the bottleneck this PR fixes elsewhere — codex CDX-4). Layer 1 reads the clock value directly (T6, separate commit). Per-row pages.generation stays for Layer 2 (per-page snapshot via jsonb_each + LEFT JOIN pages) which doesn't care about MAX, only per-page advancement. Seeded with COALESCE(MAX(pages.generation), 0) so existing query_cache rows stored under the old MAX semantics aren't all instantly invalidated on upgrade. Their max_generation_at_store stamp compares cleanly against the seeded clock; future writes bump the clock and the bookmark fires correctly. CREATE TABLE page_generation_clock ( id INTEGER PRIMARY KEY CHECK (id = 1), value BIGINT NOT NULL DEFAULT 0 ); CREATE TRIGGER bump_page_generation_clock_trg AFTER INSERT OR UPDATE OR DELETE ON pages FOR EACH STATEMENT EXECUTE FUNCTION bump_page_generation_clock_fn(); Mirror in src/core/pglite-schema.ts so fresh PGLite installs get the table + trigger via SCHEMA_SQL replay. The forward-reference bootstrap probe doesn't need an entry: page_generation_clock is created directly by SCHEMA_SQL (no separate index or FK references it), so the schema-bootstrap-coverage gate is satisfied as-is. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(cache): move Layer 1 to global clock + invalidate empty snapshots (v0.41.21.0 T6) Closes the silent stale-cache bug class that's been live in master since the bookmark feature shipped. Pre-fix, gbrain search would silently serve stale cached results in three independent scenarios: 1. UPDATE to a non-max-generation page (CDX-2) — the row-level trigger advanced per-page generation but didn't move MAX(generation), so the bookmark passed. 2. DELETE of any page (CDX-1) — the trigger didn't fire at all, and even an AFTER DELETE wouldn't move MAX. 3. Empty-result cache row + subsequent matching INSERT (CDX-6 / D20) — page_generations = '{}'::jsonb was "vacuously valid" via Layer 2, surviving any clock bump. Fix: buildPageGenerationsSnapshot (store path) — Replaces the SELECT MAX(generation) FROM pages reads at cache-write time with SELECT value FROM page_generation_clock WHERE id = 1. — Empty pageIds path: only need the clock value (D20 contract). — Combined non-empty path: per-page generation (Layer 2 substrate) + clock value, both folded in one round trip via UNION ALL. CACHE_GATE_WHERE_CLAUSE (lookup path) — Layer 1 reads page_generation_clock.value (single-row O(1) lookup, faster than the pre-fix MAX(generation) backward index scan). — Layer 2 stricter: requires page_generations <> '{}'::jsonb AND the per-page check (not OR with the vacuously-valid `= '{}'` shortcut). Empty snapshots can no longer survive a Layer 1 miss. validateCacheRowAgainstPages (pure validator) — Layer 2 returns false for empty snapshots when Layer 1 fails. — Documented contract change. Backward compat: pre-v0.40.3.0 cache rows have max_generation_at_store = 0 AND page_generations = '{}'::jsonb. On a populated brain, Layer 1 fails (clock > 0). Layer 2 is now stricter so legacy rows invalidate once on first post-upgrade lookup, then the cache fills back correctly. Acceptable one-time miss spike; post-upgrade cache is structurally sound. The clock seed (COALESCE(MAX(pages.generation), 0)) from migration v104 keeps NON-empty legacy rows passing Layer 1 until the next write — they don't all invalidate at once. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: cover v0.41.21.0 delete-batch + global clock + cache contract (T7+T8+T9) Tests for every behavior the v0.41.21.0 wave introduces or changes. New test files: test/sync-delete-batch.test.ts (PGLite hermetic) — engine.deletePages: empty input short-circuit, returns confirmed slugs (D6), multi-source isolation, cascade integrity (chunks + links cleared via FK), rejects oversized input. — engine.resolveSlugsByPaths: empty input, present + missing rows, D10 exotic-filename substrate (🌟.md / ทดสอบ.md / عربي.md), source isolation. — D13 pagesAffected filter: 100 deletable + 10 ghost paths → deletePages returns 100 (regression-pin: pre-fix would return all 110 via D6's pre-RETURNING shape). test/sync-delete-batch.slow.test.ts (.slow suffix keeps it out of the fast loop) — 10K-page batched delete completes in <5s on PGLite. Measured 277ms on dev hardware (18x under the gate); pins the headline perf promise. test/sync-rename-batch.test.ts (PGLite hermetic) — 500-rename batch slug-resolve in 1 round-trip (exactly at DELETE_BATCH_SIZE boundary). — Frontmatter-fallback rename: exotic source_paths resolve via the batch SELECT. — Mixed present + missing: partial Map (missing → caller falls back to path-derived). test/page-generation-counter.test.ts (PGLite hermetic) — Statement-level trigger fires once per INSERT statement (raw SQL — NOT putPage, which uses ON CONFLICT DO UPDATE and bumps by 2 in PG semantics). — Statement-level trigger fires once per UPDATE statement. — Headline contract: batch DELETE bumps clock by 1, NOT by row count (25-row batch → +1). — CDX-1 regression: DELETE of non-max page bumps clock. — CDX-2 regression: UPDATE of non-max page bumps clock (raw SQL). — D14 end-to-end: clock advances after batch DELETE → cache rows stamped at the prior clock value are now stale by Layer 1. — CDX-6/D20: empty-result cache + INSERT matching page → clock advances (Layer 1 fires). — Documents the PG quirk: putPage's INSERT...ON CONFLICT DO UPDATE bumps clock by 2 (both INSERT and UPDATE triggers fire). Test-helper update: test/helpers/reset-pglite.ts — Added page_generation_clock to PRESERVE_TABLES so the seeded single-row counter survives resetPgliteState between tests (same treatment as schema_version). Production never truncates. Existing test contract inversions (CDX-6 / D20 fix): test/query-cache-gate.test.ts — Pre-v0.41.21.0 "vacuously valid for legacy empty snapshot" assertion inverted: empty snapshot now invalidates when Layer 1 fires. Add positive CDX-6 regression test (empty-result + INSERT matching page). — SQL shape regression: page_generation_clock in Layer 1 (negative regression guard: MAX(generation) FROM pages MUST be gone). — Empty-snapshot reject guard: `qc.page_generations <> '{}'::jsonb` present; the old `qc.page_generations = '{}'::jsonb OR` shortcut MUST be gone. test/e2e/cache-gate-pglite.test.ts — Pre-v0.41.21.0 "legacy row serves vacuously" test inverted: legacy rows now invalidate on first clock advance post-upgrade. — CDX-1 regression: DELETE bumps clock → cached query for surviving pages invalidates. — CDX-2 regression: UPDATE-to-non-max-page bumps clock → cache invalidates. — CDX-11 comment fix: drop misleading "hard delete is admin-only" framing; gbrain sync hard-deletes on every run. Engine parity extension: test/e2e/engine-parity.test.ts — deletePages parity: same input set, both engines return same string[] of confirmed-deleted slugs (D6). — resolveSlugsByPaths parity: same Map on both engines. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(release): v0.41.21.0 — batched sync deletes + global page-generation clock (T10) VERSION bump (0.41.18.0 → 0.41.21.0; master is at 0.41.20.0 so next free slot per the queue allocator). CHANGELOG entry with the ELI10 lead per CLAUDE.md voice rules. CLAUDE.md annotations on engine.ts, postgres-engine.ts, pglite-engine.ts, sync.ts, and query-cache-gate.ts plus a new entry for engine-constants.ts. llms-full.txt regenerated to match CLAUDE.md (per CLAUDE.md mandatory rule). Co-Authored-By: garrytan-agents <noreply@anthropic.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * dx(test-runner): heartbeat shows real progress instead of 0p 0f Bun's default test reporter doesn't print per-test markers — only a single shard-end summary block when you pass it a file list. The existing heartbeat tried to count `^[[:space:]]+✓` lines as a live pass-count proxy, but bun never emits them in the multi-file mode this runner uses, so every mid-run heartbeat showed `0p 0f` for the entire 12-20 minute wallclock. Users (and agents polling the runner) couldn't distinguish "still bootstrapping" from "wedged" from "almost done." Fix: parse three complementary real-time signals instead. 1. Total files this shard was assigned — parsed from the `[unit-shard N/M] running X files` banner that run-unit-shard.sh echoes before invoking bun test. Available from second 1. 2. PGLite initSchema() count — proxy for "test files started so far." Each PGLite-using test file's beforeAll triggers one initSchema(), which logs `Schema version 1 → 106 (101 migration(s) pending)`. Undercounts because not every test file opens a PGLite engine (covers ~30-60% of files in practice), but it's the only real-time progress signal bun's default reporter leaves in the log. The output uses a `~` prefix to convey "approximate count." 3. Log size in KB — strictly monotonic liveness signal that works even when the PGLite count is still 0 (early-shard startup before the first initSchema fires). 4. Per-shard elapsed time — formatted as MmSSs. New mid-run heartbeat line: [heartbeat] [s1: ~62/190f 476KB 12m31s] [s2: ~63/190f 513KB 12m31s] ... When a shard finishes, the heartbeat upgrades to its final summary including pass/fail counts from bun's end-of-shard summary block: [heartbeat] [s1: done ✓ 2807p 0f] [s2: done ✓ 2784p 0f] ... Portability: BSD awk on macOS doesn't support `match($0, /re/, arr)` with the array sink — that's a gawk extension. The total-files parser uses sed instead so the runner stays portable to the default Mac toolchain. Helpers are pure functions and unit-testable in isolation: pass a log file path, get the parsed number. No mocking. No bun runtime required. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(release): rebump v0.41.23.0 → v0.41.25.0 Per user request — skip v0.41.23.0 / v0.41.24.0 slots to land at v0.41.25.0. Master is at v0.41.22.1, no version-trio collision. Touches VERSION, package.json, CHANGELOG header, CLAUDE.md annotations, src/core/engine-constants.ts header, src/core/migrate.ts migration v106 comment, regenerated llms-full.txt + llms.txt. Migration version (v106) and CDX1-6 trigger semantics unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(schema): reorder migration_impact_log AFTER minion_jobs (CI green) Pre-existing bug in master's SCHEMA_SQL ordering, surfaced by CI on this PR but lived silently on master since v0.41.18.0. migration_impact_log declares `job_id BIGINT REFERENCES minion_jobs(id)`, but its CREATE TABLE was at line 658 while minion_jobs's CREATE TABLE was at line 778. On any fresh-install initSchema() the FK target didn't exist yet: psql:/tmp/schema.sql:672: ERROR: relation "minion_jobs" does not exist postgres-js's `unsafe()` aborts the multi-statement batch on the first error response, so every CREATE TABLE after migration_impact_log (including minion_jobs itself) never ran. Every subsequent CLI subprocess that opened a connection then crashed with `relation "minion_jobs" does not exist` on its first query. Why master CI sometimes passed: the per-shard advisory lock + the test setup's `engine.initSchema()` second pass (which runs the migrations array) would eventually create minion_jobs via the v5 `minion_jobs_table` migration. From there migration_impact_log would land via migration v103 with its FK resolving correctly. But CLI subprocesses spawned by mechanical.test.ts's Parallel Import block open their OWN connections and run a fresh `engine.connect() → initSchema()` — that path runs SCHEMA_SQL FIRST and aborted at the same forward-reference error before the migrations array could repair. Fix: relocate the migration_impact_log CREATE TABLE + its two indexes to AFTER the minion_jobs CREATE TABLE block (lines ~865), keeping the rest of the schema layout intact. PGLite schema (pglite-schema.ts) already had the correct ordering — only Postgres SCHEMA_SQL needed the move. Verified: fresh-DB local repro that previously failed 31/34 tests with `relation minion_jobs does not exist` now passes 78/78 in test/e2e/mechanical.test.ts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: garrytan-agents <noreply@anthropic.com>
* upstream/master: v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes garrytan#1567) (garrytan#1572) v0.41.26.0 fix: dream --source + ingest junk titles + emoji-crash (supersedes garrytan#1559, garrytan#1561) (garrytan#1571) v0.41.25.0 perf(sync): batched deletes + global page-generation clock (supersedes garrytan#1538) (garrytan#1566) v0.41.24.0 fix(conversation-parser): threshold gates + bold-paren-time pattern — 20,167 Circleback messages unblocked (closes garrytan#1533) (garrytan#1543) v0.41.23.0 feat: extract operator surfaces + pack-driven extractables (garrytan#1541) v0.41.22.1 feat: brainstorm/lsd judge fixes (closes garrytan#1540 end-to-end) (garrytan#1562) v0.41.22.0 feat: type-unification cathedral — 94 types → 15 canonical (closes garrytan#1479) (garrytan#1542) v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave (garrytan#1545) v0.41.20.0 feat: gbrain status + doctor --scope=brain (fix wave 2: items garrytan#6 + garrytan#7) (garrytan#1544) feat: v0.41.19.0 Supavisor Retry Cathedral (garrytan#1537) v0.41.18.0: gbrain onboard — the activation surface gbrain didn't have before (garrytan#1521) v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity (garrytan#1519) v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes garrytan#1461) (garrytan#1510) v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes garrytan#1472 RFC) (garrytan#1506)
… drain + disconnect audit (closes #1570) (#1608) * merge master: rebump v0.41.25.0 → v0.41.27.0 (queue collision) Master shipped v0.41.25.0 (#1538 batched sync deletes) and v0.41.26.0 (#1571 dream --source fix) while this branch was in flight. Conflict resolution rebumps to the next available slot. - VERSION: 0.41.25.0 → 0.41.27.0 - package.json: synced - CHANGELOG.md: my v0.41.27.0 entry placed above master's v0.41.26.0 and v0.41.25.0; in-entry version references updated 0.41.25.0 → 0.41.27.0 and forward-references bumped to v0.41.28+. - TODOS.md: kept master's v0.41.20.x section + my v0.41.27.0+ follow-ups No source-file conflicts during the merge. * feat(diagnostics): db-disconnect audit + doctor surface (v0.41.27.0) Instruments every db.disconnect() and PostgresEngine.disconnect() call with a JSONL audit record so the next user-reported #1570 cycle gives us the offender's caller stack instead of the symptomatic "No database connection" error. Audit shape (~/.gbrain/audit/db-disconnect-YYYY-Www.jsonl): {ts, engine_kind, connection_style, caller_stack[], command, pid} - src/core/audit/db-disconnect-audit.ts (NEW): the audit writer, built on the v0.40.4.0 createAuditWriter cathedral. Captures a 6-frame stack via new Error().stack so the offender is readable without spending stderr noise. - src/core/db.ts: logDbDisconnect call at the top of disconnect() (best-effort; never blocks the real teardown). - src/core/postgres-engine.ts: same instrumentation in PostgresEngine.disconnect() — distinguishes 'module' vs 'instance' connection_style so we can tell legitimate worker-pool teardowns apart from the load-bearing module-singleton class. - src/commands/doctor.ts: extends batch_retry_health to surface 24h disconnect count + most-recent caller stack. Warns when the caller frame isn't a known CLI-exit frame (e.g. cli.ts's finally block at the end of an op-dispatch). This is the diagnostic that tells v0.41.28+ where to apply the real ownership fix. - test/db-disconnect-audit.test.ts: unit coverage for the audit writer + caller-stack capture + JSONL shape. - test/e2e/db-singleton-shared-recovery.test.ts: real-Postgres regression that exercises the singleton-null path end-to-end. Refs #1570 * feat(retry): self-heal on null singleton — closes #1570 symptom (v0.41.27.0) withRetry gains an opt-in reconnect callback that fires between the isRetryableConnError classification and the inter-attempt sleep. PostgresEngine.batchRetry injects this.reconnect() — race-safe via the existing _reconnecting guard, handles module and instance pools. Closes the production loss reported in #1570: dream cycles on Supabase no longer drop ~150 link rows per cycle when the singleton goes null mid-batch. The retry now rebuilds the connection between attempts so the second try has somewhere to write to. - src/core/retry.ts: WithRetryOpts gains `reconnect?: () => Promise<void>`. Awaited in the catch branch. onRetry is also now awaited (back-compat- safe: every existing in-tree caller is a sync arrow). Reconnect failures propagate as the real cause — replaces the symptomatic "No database connection" error with whatever the connect() throw was, so operators see the truth. - src/core/postgres-engine.ts:batchRetry — injects `reconnect: () => this.reconnect()`. Covers all 9 batch-retry call sites (addLinksBatch, addTimelineEntriesBatch, upsertChunks, plus the 6 caller-supplied auditSite labels in extract / sync / reindex). - test/core/retry-reconnect.test.ts: 8 hermetic cases pinning the contract — reconnect fires before sleep, only on retryable errors, back-compat when omitted, signal-aborted bypasses reconnect, onRetry is awaited, full success path end-to-end. The deeper bug (who's calling disconnect mid-cycle) is left unaddressed in this commit by design — the diagnostic instrumentation in the prior commit will tell us in the next production run. Refs #1570 * feat(facts): drainPending() + CLI await before disconnect (v0.41.27.0) Closes the silent 'No database connection' tail-end errors after gbrain capture / put_page: the facts:absorb fire-and-forget queue sometimes outlived the CLI process's connection lifetime, so absorb attempts after engine.disconnect() landed in stderr as the GBrainError shape. - src/core/facts/queue.ts: new drainPending({timeout: 1000}) method distinct from shutdown(). Stops accepting new enqueues, awaits in-flight settle, bounded by timeout, returns count of unfinished. Semantically different from shutdown() (which aborts in-flight) so the symptom — drop work that hasn't started yet but let in-flight work finish — matches what CLI exit actually needs. - src/cli.ts: op-dispatch finally block awaits the drain BEFORE engine.disconnect(). Bounded 1s. Opt-out env GBRAIN_NO_FACTS_DRAIN for callers that don't enqueue (keeps fast-exit paths fast). Mirrors the v0.41.8.0 awaitPendingLastRetrievedWrites pattern. - test/facts-queue-drain-pending.test.ts: 6 hermetic cases — empty drain returns immediately, single in-flight settles, timeout bounds wait, shutdown-after-drain is idempotent, post-drain enqueues are dropped, signal-aborted skips waiting. Refs #1570 * docs: update project documentation for v0.41.27.0 README.md: added troubleshooting entry for the v0.41.27.0 retry-reconnect + facts:absorb drain fix (closes #1570), pointing operators at `gbrain doctor --json` to find the offending disconnect caller. CLAUDE.md: extended `src/core/retry.ts` entry with the new optional `reconnect` callback (v0.41.27.0); added two new Key Files entries for `src/core/audit/db-disconnect-audit.ts` (the diagnostic half of the "instrument first, fix later" pivot) and `FactsQueue.drainPending`; extended `doctor.ts:checkBatchRetryHealth` entry with the in-place extension that surfaces 24h disconnect-call count. llms-full.txt: regenerated to absorb CLAUDE.md edits. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore: rebump v0.41.27.0 → v0.41.28.0 (queue collision with #1573) Master shipped v0.41.27.0 (#1573 git-aware sync_freshness) claiming the same slot. Rebump to the next available version. - VERSION + package.json → 0.41.28.0 - CHANGELOG.md: my entry header + in-entry refs 0.41.27.0 → 0.41.28.0 - TODOS.md: my #1570 follow-up section header + body refs bumped * test: pin gateway in put-page-provenance + embedding-dim-check (CI shard fix) Both files failed on CI shards 1 and 8 under the cross-file gateway-state leak class (CLAUDE.md "Test-isolation lint and helpers"). The v0.41.28.0 merge reshuffled the weight-based shard bin-packing, landing a gateway-mutating sibling ahead of these two victims in the same `bun test` process. Mechanism: - put-page-provenance: put_page embeds via the gateway. A sibling left the gateway configured with OpenAI + the CI placeholder `sk-test` (captured at configureGateway time, survives the withEnv restore as cached gateway state). put_page's embed then fired against live OpenAI and 401'd. The bunfig legacy-embedding preload's beforeEach only re-applies legacy when the gateway was RESET — it does NOT correct a sibling that configured a different LIVE config. - embedding-dim-check: initSchema builds the content_chunks vector column at the gateway's configured dim. A sibling leaking ZE/1280 made the column 1280-d, so `expect(dims).toBe(1536)` failed. Fix (victim-side pinning, the escape hatch the preload documents): - Both: configure the gateway explicitly in beforeAll BEFORE initSchema (OpenAI/1536), resetGateway() in afterAll so neither leaks onward. - put-page-provenance also stubs the embed transport via __setEmbedTransportForTests so embed is deterministic and offline; a dummy OPENAI_API_KEY is supplied in the gateway env because instantiateEmbedding builds the OpenAI client (key check) BEFORE the stubbed transport is reached — the stub then intercepts the actual call so the key never leaves the process. Verified: CI shards 1 (1337 pass) + 8 (905 pass) green with OPENAI_API_KEY unset, plus adversarial sibling orderings (gateway.test / doctor-ze-checks preceding). Typecheck + check-test-isolation clean. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Problem
A single commit that deletes 73K files (e.g., atom backfill reorganization) jams the sync pipeline for 2 days. Each file deletion requires 2 sequential DB round-trips:
SELECT slug FROM pages WHERE source_path = $1(slug resolution)DELETE FROM pages WHERE slug = $1 AND source_id = $2At 73K files × 2 queries = 146K individual DB round-trips, this takes ~5 hours. During that time:
Fix
Batch both phases:
Phase 1 — Slug resolution:
Phase 2 — Deletion:
Total: ~292 DB round-trips instead of ~146K.
Changes
src/core/engine.ts— Added optionaldeletePages(slugs[], opts?)toBrainEngineinterfacesrc/core/postgres-engine.ts— ImplementeddeletePages()withDELETE ... WHERE slug = ANY($1), batched at 500src/commands/sync.ts— Rewrote delete loop:SELECT ... WHERE source_path = ANY($1)engine.deletePages()deletePage()for PGLite or engines without batch support--timeoutbehavior)Compatibility
deletePagesis optional on the interface — PGLite and other engines fall back to sequential--timeoutabort checks preserved between batches