Skip to content

perf: batch sync deletes — 73K deletes in ~2 min instead of ~5 hours#1538

Closed
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/batch-sync-deletes
Closed

perf: batch sync deletes — 73K deletes in ~2 min instead of ~5 hours#1538
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/batch-sync-deletes

Conversation

@garrytan-agents

Copy link
Copy Markdown
Contributor

Problem

A single commit that deletes 73K files (e.g., atom backfill reorganization) jams the sync pipeline for 2 days. Each file deletion requires 2 sequential DB round-trips:

  1. SELECT slug FROM pages WHERE source_path = $1 (slug resolution)
  2. DELETE FROM pages WHERE slug = $1 AND source_id = $2

At 73K files × 2 queries = 146K individual DB round-trips, this takes ~5 hours. During that time:

  • The sync cron times out every run
  • All other sources stop syncing (cascade staleness)
  • Doctor health score drops to 0
  • Timeline orphans pile up

Fix

Batch both phases:

Phase 1 — Slug resolution:

-- Before: 73K individual queries
SELECT slug FROM pages WHERE source_path = $1 AND source_id = $2

-- After: 146 batch queries (500 paths per batch)
SELECT slug, source_path FROM pages WHERE source_path = ANY($1) AND source_id = $2

Phase 2 — Deletion:

-- Before: 73K individual queries
DELETE FROM pages WHERE slug = $1 AND source_id = $2

-- After: 146 batch queries (500 slugs per batch)
DELETE FROM pages WHERE slug = ANY($1) AND source_id = $2

Total: ~292 DB round-trips instead of ~146K.

Changes

  1. src/core/engine.ts — Added optional deletePages(slugs[], opts?) to BrainEngine interface
  2. src/core/postgres-engine.ts — Implemented deletePages() with DELETE ... WHERE slug = ANY($1), batched at 500
  3. src/commands/sync.ts — Rewrote delete loop:
    • Batch slug resolution via SELECT ... WHERE source_path = ANY($1)
    • Batch deletion via engine.deletePages()
    • Falls back to sequential deletePage() for PGLite or engines without batch support
    • Abort signal checked between batches (preserves --timeout behavior)

Compatibility

  • deletePages is optional on the interface — PGLite and other engines fall back to sequential
  • Progress reporting preserved (shows batch progress instead of per-file)
  • --timeout abort checks preserved between batches
  • No schema changes
  • FK cascades work identically with batch DELETE

The sync delete path processed each file deletion as two sequential DB
round-trips: one to resolve the slug (SELECT), one to delete (DELETE).
For a commit that removes 73K files, this means 146K individual queries
taking ~5 hours and jamming the entire sync pipeline for 2 days.

This change batches both phases:

Phase 1 — Slug resolution:
  Before: 73K individual SELECT ... WHERE source_path = $1
  After:  146 batched SELECT ... WHERE source_path = ANY($1) (500 per batch)

Phase 2 — Deletion:
  Before: 73K individual DELETE ... WHERE slug = $1
  After:  146 batched DELETE ... WHERE slug = ANY($1) (500 per batch)

Total: ~292 DB round-trips instead of ~146K. Real-world improvement:
5 hours → ~2 minutes.

The batch path requires engine.deletePages() (new optional method on
BrainEngine interface). Falls back to sequential deletePage() for PGLite
or engines that don't implement it.
@garrytan

Copy link
Copy Markdown
Owner

Closing as superseded — the supersede PR is in active design. Plan locked in via /plan-eng-review + codex outside-voice round; 20 architecture/quality/test/perf decisions decided, 5 follow-ups filed as v0.42+ TODOs.

Why supersede instead of merging this + iterating: this PR fixes the headline bottleneck on Postgres (73K-delete: ~5h → ~2min, real win), but ships with PGLite users getting zero benefit (deletePages falls back to sequential deletePage per slug). PGLite is the default engine for new installs, so the perf bug stays in place for most users.

The replacement work expands scope:

  1. PGLite parity. deletePages is required (not optional) on BrainEngine; both engines implement.
  2. Engine API factoring. SQL leaves sync.ts; new engine.resolveSlugsByPaths + engine.deletePages match the established addLinksBatch / setEmotionalWeightBatch convention.
  3. Test coverage. New test/sync-delete-batch.test.ts (multi-source isolation, cascade integrity, exotic-filename fallback, abort mid-batch, decompose-on-batch-error), test/sync-delete-batch.slow.test.ts (10K-delete <5s perf gate), test/sync-rename-batch.test.ts, test/page-generation-counter.test.ts, engine-parity case extension. This PR shipped 89 LOC with zero tests on a perf-critical hot path.
  4. Required sourceId on the new deletePages (D5) — type-level requirement closes a latent multi-source-bug-class foot-gun for the new surface.
  5. deletePages returns string[] of confirmed-deleted slugs — sync's pagesAffected then filters to only those, so downstream extract/embed phases stop wasting round-trips on phantom slugs.
  6. Per-batch try-catch + per-slug decompose on error, matching the existing import-loop's failedFiles pattern. A transient blip on batch 73 doesn't lose 500 deletes; it self-heals to one-at-a-time for that batch only.
  7. Query-cache bookmark correctness. Codex's outside-voice round on the plan uncovered that gbrain's query-cache MAX(generation) bookmark is structurally broken in master TODAY — independent of this PR's bug. UPDATE-to-a-non-max-generation page silently serves stale cached results. DELETE silently serves stale cached results. Empty-result cache rows silently serve stale results on subsequent matching INSERT. The supersede PR replaces the per-row pages.generation Layer-1 contract with a global page-generation clock + statement-level trigger, closing the entire silent-stale-read bug class along with the delete-perf win.

Credit: the SQL shape (DELETE FROM pages WHERE slug = ANY($1::text[]) AND source_id = $2) and the 500-slug batching idea both come from this PR and port verbatim into the replacement. @garrytan-agents Co-Authored-By trailers will appear on the relevant commits in the replacement PR.

The PR source is preserved locally in garrytan-agents/pr-1538-source for diff reference.

Plan file: ~/.claude/plans/system-instruction-you-are-working-ethereal-narwhal.md

@garrytan

Copy link
Copy Markdown
Owner

Superseded by #1568 per the CLAUDE.md community-PR-wave workflow — same fix, base-repo branch so CI has secret access, plus 12 review-driven improvements (/plan-eng-review + codex outside voice):

  • D1 drop the duplicate progress.finish() call
  • D2 batch BOTH sourceId-set AND no-sourceId Phase-1 lookups
  • D3 BATCH_SIZE=100 matches addLinksBatch precedent
  • D6 scope no-sourceId path to source_id='default' — kills a latent silent-no-op bug class on legacy gbrain sync callers
  • D7 ORDER BY slug ASC for deterministic duplicate-source_path collapse
  • D8 wire deletePages through withRetry(BULK_RETRY_OPTS) + audit (the v0.41.19 Supavisor cathedral)
  • D9 query-count regression test (the actual perf regression test)
  • D10 Promise<string[]> return shape (vs original Promise<number> — caller needs the actually-deleted slugs)
  • D11 implement on both PostgresEngine AND PGLiteEngine; method NON-optional on the interface (drops the fallback branch)

Plus a full test suite: 12 PGLite unit cases + 3 real-Postgres E2E cases. Your contribution preserved via Co-authored-by: garrytan-agents <garrytan-agents@users.noreply.github.com> on the new commit. Thanks @garrytan-agents.

garrytan added a commit that referenced this pull request May 27, 2026
… (supersedes #1538) (#1566)

* feat(engine): add deletePages + resolveSlugsByPaths to BrainEngine (v0.41.21.0 T1)

Two new REQUIRED methods on the BrainEngine interface, implemented on
both Postgres and PGLite engines. Closes the per-file N+1 query pattern
that PR #1538 batched on Postgres only.

  deletePages(slugs: string[], opts: { sourceId: string }): Promise<string[]>
    — Single SQL round-trip:
        DELETE FROM pages WHERE slug = ANY($1::text[]) AND source_id = $2
        RETURNING slug
    — Returns slugs ACTUALLY DELETED (D6, codex CDX-8) so callers can
      filter pagesAffected to exclude phantom slugs (paths in the
      deletion list but with no DB row).
    — Single-batch primitive: caller chunks input to DELETE_BATCH_SIZE.
      Throws if input exceeds the cap.
    — sourceId is REQUIRED at the type level (D5, codex CDX-10).
      Asymmetric with single-row deletePage which keeps the optional
      'default' fallback for back-compat. v0.42+ TODO to tighten.

  resolveSlugsByPaths(paths, opts): Promise<Map<path, slug>>
    — Batch path → slug lookup. Single SQL round-trip:
        SELECT slug, source_path FROM pages
         WHERE source_path = ANY($1::text[]) AND source_id = $2
    — Missing paths absent from the Map (caller falls back to
      path-derived slug, same contract as resolveSlugByPathOrSourcePath).
    — Empty input short-circuits to empty Map (no SQL).

  src/core/engine-constants.ts (NEW)
    — Single source of truth for DELETE_BATCH_SIZE = 500.
    — Both engines import; no engine-from-engine coupling.
    — Lives outside engine.ts (the interface module) to avoid circular
      imports.

Also updates the deletePage JSDoc (CDX-11): drops the misleading
"hard delete is admin-only" framing. `gbrain sync` hard-deletes on
every run that sees a deleted file; not admin-only.

Co-Authored-By: garrytan-agents <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* perf(sync): batched delete + rename + DRY refactor (v0.41.21.0 T2/T3/T4)

Replaces the per-file delete loop (sync.ts:1241-1257) and per-file
rename slug-resolve (sync.ts:1263-1295) with interleaved per-batch
flows using engine.resolveSlugsByPaths + engine.deletePages. Also
refactors resolveSlugByPathOrSourcePath (sync.ts:267) to delegate to
the new batch helper when sourceId is set — one owner of the SQL +
fallback semantics (D8).

ROUND-TRIP COUNTS (73K-delete commit):
  pre-fix:   73,000 SELECTs + 73,000 DELETEs = 146,000 (~5 hours)
  post-fix:     146 SELECTs +     146 DELETEs =     292 (~2 minutes)

Headline win: a single commit deleting 73K files no longer jams the
sync pipeline for hours, no longer cascades staleness across every
other source on the brain.

Shape (T2 delete loop, per the plan's ASCII diagram):

  filtered.deleted (73K paths)
        │
        ▼
   slice into batches of DELETE_BATCH_SIZE (500)
        │
        ▼  for each batch:
   abort-check ──► partial('timeout')
        │
        ▼
   engine.resolveSlugsByPaths(batch, {sourceId})  ◀── 1 SQL round-trip
        │
        ▼
   slugs = batch.map(path => map.get(path)
                  ?? resolveSlugForPath(path))    ◀── pure-JS fallback
        │                                              for frontmatter-
        ▼                                              fallback slugs
   try {
     deleted = engine.deletePages(slugs, opts)    ◀── 1 SQL round-trip
     pagesAffected.push(...deleted)               ◀── D6 confirmed only
   } catch {
     // D7 decompose: per-slug deletePage,
     // unrecoverable failures → failedFiles
   }

Per-batch try-catch (D7) decomposes batch DELETE failures to per-slug
deletePage so a transient blip on batch 73 doesn't lose 500 deletes —
it self-heals to one-at-a-time for that batch only. Unrecoverable
per-slug failures land in failedFiles (matching the existing
import-loop pattern at sync.ts:~1350). failedFiles declaration
hoisted above the delete loop so both delete decompose and import
loops feed the same sync-bookmark gate.

T4 rename loop: pre-resolves all `from` slugs in batches via
resolveSlugsByPaths BEFORE iterating. Per-file updateSlug + importFile
calls stay (those are inherently per-file). The try/catch around
updateSlug for slug-doesn't-exist preserves verbatim.

T3 DRY refactor: resolveSlugByPathOrSourcePath delegates to
resolveSlugsByPaths via a single-element array when sourceId is set.
When sourceId is undefined (legacy unscoped callers), falls back to
the original executeRaw shape — the batch engine surface requires
sourceId per D5 (multi-source-bug-class defense).

Atomicity coarsening (D3): each batch is one transaction. A mid-batch
abort or connection failure rolls back up to DELETE_BATCH_SIZE - 1
successful deletes from the in-flight batch. Sync is idempotent so
the next run picks them up via git diff regenerating the deletion
list. Documented at the call site + in the deletePages JSDoc.

Co-Authored-By: garrytan-agents <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(schema): global page-generation clock + statement-level trigger (v0.41.21.0 T5)

Migration v104: page_generation_clock_and_statement_trigger.

The pre-v0.41.21.0 query-cache Layer 1 bookmark read MAX(generation) FROM
pages to detect "writes happened since cache-store". Two bugs in that
contract — independent of any sync work, surfaced by codex
outside-voice on the /plan-eng-review pass:

  1. The row-level bump_page_generation_trg (migration v91) sets
     NEW.generation = OLD.generation + 1 on UPDATE. Updating a NON-MAX
     page didn't advance MAX(generation). Cache silently served stale
     for any UPDATE-to-non-max page. (CDX-2)
  2. The trigger is BEFORE INSERT OR UPDATE — DELETE doesn't fire it
     at all. Even an AFTER DELETE wouldn't move MAX (surviving rows
     are untouched). (CDX-1)

Fix: single-row page_generation_clock counter, bumped per-statement
(FOR EACH STATEMENT — per-row would turn a 73K-row batch DELETE into
73K UPDATEs on the same counter, recreating the bottleneck this PR
fixes elsewhere — codex CDX-4). Layer 1 reads the clock value
directly (T6, separate commit). Per-row pages.generation stays for
Layer 2 (per-page snapshot via jsonb_each + LEFT JOIN pages) which
doesn't care about MAX, only per-page advancement.

Seeded with COALESCE(MAX(pages.generation), 0) so existing query_cache
rows stored under the old MAX semantics aren't all instantly
invalidated on upgrade. Their max_generation_at_store stamp compares
cleanly against the seeded clock; future writes bump the clock and
the bookmark fires correctly.

  CREATE TABLE page_generation_clock (
    id    INTEGER PRIMARY KEY CHECK (id = 1),
    value BIGINT  NOT NULL DEFAULT 0
  );
  CREATE TRIGGER bump_page_generation_clock_trg
    AFTER INSERT OR UPDATE OR DELETE ON pages
    FOR EACH STATEMENT
    EXECUTE FUNCTION bump_page_generation_clock_fn();

Mirror in src/core/pglite-schema.ts so fresh PGLite installs get the
table + trigger via SCHEMA_SQL replay. The forward-reference bootstrap
probe doesn't need an entry: page_generation_clock is created directly
by SCHEMA_SQL (no separate index or FK references it), so the
schema-bootstrap-coverage gate is satisfied as-is.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cache): move Layer 1 to global clock + invalidate empty snapshots (v0.41.21.0 T6)

Closes the silent stale-cache bug class that's been live in master
since the bookmark feature shipped. Pre-fix, gbrain search would
silently serve stale cached results in three independent scenarios:

  1. UPDATE to a non-max-generation page (CDX-2) — the row-level
     trigger advanced per-page generation but didn't move
     MAX(generation), so the bookmark passed.
  2. DELETE of any page (CDX-1) — the trigger didn't fire at all,
     and even an AFTER DELETE wouldn't move MAX.
  3. Empty-result cache row + subsequent matching INSERT (CDX-6 /
     D20) — page_generations = '{}'::jsonb was "vacuously valid" via
     Layer 2, surviving any clock bump.

Fix:

  buildPageGenerationsSnapshot (store path)
    — Replaces the SELECT MAX(generation) FROM pages reads at
      cache-write time with SELECT value FROM page_generation_clock
      WHERE id = 1.
    — Empty pageIds path: only need the clock value (D20 contract).
    — Combined non-empty path: per-page generation (Layer 2 substrate)
      + clock value, both folded in one round trip via UNION ALL.

  CACHE_GATE_WHERE_CLAUSE (lookup path)
    — Layer 1 reads page_generation_clock.value (single-row O(1)
      lookup, faster than the pre-fix MAX(generation) backward index
      scan).
    — Layer 2 stricter: requires page_generations <> '{}'::jsonb AND
      the per-page check (not OR with the vacuously-valid `= '{}'`
      shortcut). Empty snapshots can no longer survive a Layer 1 miss.

  validateCacheRowAgainstPages (pure validator)
    — Layer 2 returns false for empty snapshots when Layer 1 fails.
    — Documented contract change.

Backward compat: pre-v0.40.3.0 cache rows have
max_generation_at_store = 0 AND page_generations = '{}'::jsonb. On a
populated brain, Layer 1 fails (clock > 0). Layer 2 is now stricter
so legacy rows invalidate once on first post-upgrade lookup, then the
cache fills back correctly. Acceptable one-time miss spike;
post-upgrade cache is structurally sound.

The clock seed (COALESCE(MAX(pages.generation), 0)) from migration
v104 keeps NON-empty legacy rows passing Layer 1 until the next
write — they don't all invalidate at once.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: cover v0.41.21.0 delete-batch + global clock + cache contract (T7+T8+T9)

Tests for every behavior the v0.41.21.0 wave introduces or changes.

New test files:

  test/sync-delete-batch.test.ts (PGLite hermetic)
    — engine.deletePages: empty input short-circuit, returns confirmed
      slugs (D6), multi-source isolation, cascade integrity (chunks +
      links cleared via FK), rejects oversized input.
    — engine.resolveSlugsByPaths: empty input, present + missing rows,
      D10 exotic-filename substrate (🌟.md / ทดสอบ.md / عربي.md),
      source isolation.
    — D13 pagesAffected filter: 100 deletable + 10 ghost paths →
      deletePages returns 100 (regression-pin: pre-fix would return
      all 110 via D6's pre-RETURNING shape).

  test/sync-delete-batch.slow.test.ts (.slow suffix keeps it out of
  the fast loop)
    — 10K-page batched delete completes in <5s on PGLite. Measured
      277ms on dev hardware (18x under the gate); pins the headline
      perf promise.

  test/sync-rename-batch.test.ts (PGLite hermetic)
    — 500-rename batch slug-resolve in 1 round-trip (exactly at
      DELETE_BATCH_SIZE boundary).
    — Frontmatter-fallback rename: exotic source_paths resolve via
      the batch SELECT.
    — Mixed present + missing: partial Map (missing → caller falls
      back to path-derived).

  test/page-generation-counter.test.ts (PGLite hermetic)
    — Statement-level trigger fires once per INSERT statement (raw
      SQL — NOT putPage, which uses ON CONFLICT DO UPDATE and bumps
      by 2 in PG semantics).
    — Statement-level trigger fires once per UPDATE statement.
    — Headline contract: batch DELETE bumps clock by 1, NOT by row
      count (25-row batch → +1).
    — CDX-1 regression: DELETE of non-max page bumps clock.
    — CDX-2 regression: UPDATE of non-max page bumps clock (raw SQL).
    — D14 end-to-end: clock advances after batch DELETE → cache rows
      stamped at the prior clock value are now stale by Layer 1.
    — CDX-6/D20: empty-result cache + INSERT matching page → clock
      advances (Layer 1 fires).
    — Documents the PG quirk: putPage's INSERT...ON CONFLICT DO UPDATE
      bumps clock by 2 (both INSERT and UPDATE triggers fire).

Test-helper update:

  test/helpers/reset-pglite.ts
    — Added page_generation_clock to PRESERVE_TABLES so the seeded
      single-row counter survives resetPgliteState between tests
      (same treatment as schema_version). Production never truncates.

Existing test contract inversions (CDX-6 / D20 fix):

  test/query-cache-gate.test.ts
    — Pre-v0.41.21.0 "vacuously valid for legacy empty snapshot"
      assertion inverted: empty snapshot now invalidates when Layer 1
      fires. Add positive CDX-6 regression test (empty-result + INSERT
      matching page).
    — SQL shape regression: page_generation_clock in Layer 1 (negative
      regression guard: MAX(generation) FROM pages MUST be gone).
    — Empty-snapshot reject guard:
      `qc.page_generations <> '{}'::jsonb` present; the old
      `qc.page_generations = '{}'::jsonb OR` shortcut MUST be gone.

  test/e2e/cache-gate-pglite.test.ts
    — Pre-v0.41.21.0 "legacy row serves vacuously" test inverted:
      legacy rows now invalidate on first clock advance post-upgrade.
    — CDX-1 regression: DELETE bumps clock → cached query for
      surviving pages invalidates.
    — CDX-2 regression: UPDATE-to-non-max-page bumps clock → cache
      invalidates.
    — CDX-11 comment fix: drop misleading "hard delete is admin-only"
      framing; gbrain sync hard-deletes on every run.

Engine parity extension:

  test/e2e/engine-parity.test.ts
    — deletePages parity: same input set, both engines return same
      string[] of confirmed-deleted slugs (D6).
    — resolveSlugsByPaths parity: same Map on both engines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(release): v0.41.21.0 — batched sync deletes + global page-generation clock (T10)

VERSION bump (0.41.18.0 → 0.41.21.0; master is at 0.41.20.0 so
next free slot per the queue allocator). CHANGELOG entry with the
ELI10 lead per CLAUDE.md voice rules. CLAUDE.md annotations on
engine.ts, postgres-engine.ts, pglite-engine.ts, sync.ts, and
query-cache-gate.ts plus a new entry for engine-constants.ts.
llms-full.txt regenerated to match CLAUDE.md (per CLAUDE.md
mandatory rule).

Co-Authored-By: garrytan-agents <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* dx(test-runner): heartbeat shows real progress instead of 0p 0f

Bun's default test reporter doesn't print per-test markers — only a
single shard-end summary block when you pass it a file list. The
existing heartbeat tried to count `^[[:space:]]+✓` lines as a live
pass-count proxy, but bun never emits them in the multi-file mode this
runner uses, so every mid-run heartbeat showed `0p 0f` for the entire
12-20 minute wallclock. Users (and agents polling the runner) couldn't
distinguish "still bootstrapping" from "wedged" from "almost done."

Fix: parse three complementary real-time signals instead.

  1. Total files this shard was assigned — parsed from the
     `[unit-shard N/M] running X files` banner that run-unit-shard.sh
     echoes before invoking bun test. Available from second 1.

  2. PGLite initSchema() count — proxy for "test files started so far."
     Each PGLite-using test file's beforeAll triggers one initSchema(),
     which logs `Schema version 1 → 106 (101 migration(s) pending)`.
     Undercounts because not every test file opens a PGLite engine
     (covers ~30-60% of files in practice), but it's the only real-time
     progress signal bun's default reporter leaves in the log. The
     output uses a `~` prefix to convey "approximate count."

  3. Log size in KB — strictly monotonic liveness signal that works
     even when the PGLite count is still 0 (early-shard startup before
     the first initSchema fires).

  4. Per-shard elapsed time — formatted as MmSSs.

New mid-run heartbeat line:

  [heartbeat]  [s1: ~62/190f 476KB 12m31s] [s2: ~63/190f 513KB 12m31s] ...

When a shard finishes, the heartbeat upgrades to its final summary
including pass/fail counts from bun's end-of-shard summary block:

  [heartbeat]  [s1: done ✓ 2807p 0f] [s2: done ✓ 2784p 0f] ...

Portability: BSD awk on macOS doesn't support `match($0, /re/, arr)`
with the array sink — that's a gawk extension. The total-files parser
uses sed instead so the runner stays portable to the default Mac toolchain.

Helpers are pure functions and unit-testable in isolation: pass a log
file path, get the parsed number. No mocking. No bun runtime required.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(release): rebump v0.41.23.0 → v0.41.25.0

Per user request — skip v0.41.23.0 / v0.41.24.0 slots to land at
v0.41.25.0. Master is at v0.41.22.1, no version-trio collision.

Touches VERSION, package.json, CHANGELOG header, CLAUDE.md
annotations, src/core/engine-constants.ts header, src/core/migrate.ts
migration v106 comment, regenerated llms-full.txt + llms.txt.

Migration version (v106) and CDX1-6 trigger semantics unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(schema): reorder migration_impact_log AFTER minion_jobs (CI green)

Pre-existing bug in master's SCHEMA_SQL ordering, surfaced by CI on
this PR but lived silently on master since v0.41.18.0.

migration_impact_log declares `job_id BIGINT REFERENCES minion_jobs(id)`,
but its CREATE TABLE was at line 658 while minion_jobs's CREATE TABLE
was at line 778. On any fresh-install initSchema() the FK target
didn't exist yet:

  psql:/tmp/schema.sql:672: ERROR: relation "minion_jobs" does not exist

postgres-js's `unsafe()` aborts the multi-statement batch on the first
error response, so every CREATE TABLE after migration_impact_log
(including minion_jobs itself) never ran. Every subsequent CLI
subprocess that opened a connection then crashed with
`relation "minion_jobs" does not exist` on its first query.

Why master CI sometimes passed: the per-shard advisory lock + the
test setup's `engine.initSchema()` second pass (which runs the
migrations array) would eventually create minion_jobs via the v5
`minion_jobs_table` migration. From there migration_impact_log would
land via migration v103 with its FK resolving correctly. But CLI
subprocesses spawned by mechanical.test.ts's Parallel Import block
open their OWN connections and run a fresh `engine.connect() →
initSchema()` — that path runs SCHEMA_SQL FIRST and aborted at the
same forward-reference error before the migrations array could repair.

Fix: relocate the migration_impact_log CREATE TABLE + its two indexes
to AFTER the minion_jobs CREATE TABLE block (lines ~865), keeping
the rest of the schema layout intact. PGLite schema (pglite-schema.ts)
already had the correct ordering — only Postgres SCHEMA_SQL needed
the move.

Verified: fresh-DB local repro that previously failed 31/34 tests
with `relation minion_jobs does not exist` now passes 78/78 in
test/e2e/mechanical.test.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: garrytan-agents <noreply@anthropic.com>
mgunnin added a commit to mgunnin/gbrain that referenced this pull request May 28, 2026
* upstream/master:
  v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes garrytan#1567) (garrytan#1572)
  v0.41.26.0 fix: dream --source + ingest junk titles + emoji-crash (supersedes garrytan#1559, garrytan#1561) (garrytan#1571)
  v0.41.25.0 perf(sync): batched deletes + global page-generation clock (supersedes garrytan#1538) (garrytan#1566)
  v0.41.24.0 fix(conversation-parser): threshold gates + bold-paren-time pattern — 20,167 Circleback messages unblocked (closes garrytan#1533) (garrytan#1543)
  v0.41.23.0 feat: extract operator surfaces + pack-driven extractables (garrytan#1541)
  v0.41.22.1 feat: brainstorm/lsd judge fixes (closes garrytan#1540 end-to-end) (garrytan#1562)
  v0.41.22.0 feat: type-unification cathedral — 94 types → 15 canonical (closes garrytan#1479) (garrytan#1542)
  v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave (garrytan#1545)
  v0.41.20.0 feat: gbrain status + doctor --scope=brain (fix wave 2: items garrytan#6 + garrytan#7) (garrytan#1544)
  feat: v0.41.19.0 Supavisor Retry Cathedral (garrytan#1537)
  v0.41.18.0: gbrain onboard — the activation surface gbrain didn't have before (garrytan#1521)
  v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity (garrytan#1519)
  v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes garrytan#1461) (garrytan#1510)
  v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes garrytan#1472 RFC) (garrytan#1506)
garrytan added a commit that referenced this pull request May 29, 2026
… drain + disconnect audit (closes #1570) (#1608)

* merge master: rebump v0.41.25.0 → v0.41.27.0 (queue collision)

Master shipped v0.41.25.0 (#1538 batched sync deletes) and v0.41.26.0
(#1571 dream --source fix) while this branch was in flight. Conflict
resolution rebumps to the next available slot.

- VERSION: 0.41.25.0 → 0.41.27.0
- package.json: synced
- CHANGELOG.md: my v0.41.27.0 entry placed above master's v0.41.26.0
  and v0.41.25.0; in-entry version references updated 0.41.25.0 →
  0.41.27.0 and forward-references bumped to v0.41.28+.
- TODOS.md: kept master's v0.41.20.x section + my v0.41.27.0+ follow-ups

No source-file conflicts during the merge.

* feat(diagnostics): db-disconnect audit + doctor surface (v0.41.27.0)

Instruments every db.disconnect() and PostgresEngine.disconnect() call
with a JSONL audit record so the next user-reported #1570 cycle gives
us the offender's caller stack instead of the symptomatic
"No database connection" error.

Audit shape (~/.gbrain/audit/db-disconnect-YYYY-Www.jsonl):
  {ts, engine_kind, connection_style, caller_stack[], command, pid}

- src/core/audit/db-disconnect-audit.ts (NEW): the audit writer,
  built on the v0.40.4.0 createAuditWriter cathedral. Captures a
  6-frame stack via new Error().stack so the offender is readable
  without spending stderr noise.
- src/core/db.ts: logDbDisconnect call at the top of disconnect()
  (best-effort; never blocks the real teardown).
- src/core/postgres-engine.ts: same instrumentation in
  PostgresEngine.disconnect() — distinguishes 'module' vs 'instance'
  connection_style so we can tell legitimate worker-pool teardowns
  apart from the load-bearing module-singleton class.
- src/commands/doctor.ts: extends batch_retry_health to surface
  24h disconnect count + most-recent caller stack. Warns when the
  caller frame isn't a known CLI-exit frame (e.g. cli.ts's finally
  block at the end of an op-dispatch). This is the diagnostic that
  tells v0.41.28+ where to apply the real ownership fix.
- test/db-disconnect-audit.test.ts: unit coverage for the audit
  writer + caller-stack capture + JSONL shape.
- test/e2e/db-singleton-shared-recovery.test.ts: real-Postgres
  regression that exercises the singleton-null path end-to-end.

Refs #1570

* feat(retry): self-heal on null singleton — closes #1570 symptom (v0.41.27.0)

withRetry gains an opt-in reconnect callback that fires between the
isRetryableConnError classification and the inter-attempt sleep.
PostgresEngine.batchRetry injects this.reconnect() — race-safe via
the existing _reconnecting guard, handles module and instance pools.

Closes the production loss reported in #1570: dream cycles on Supabase
no longer drop ~150 link rows per cycle when the singleton goes null
mid-batch. The retry now rebuilds the connection between attempts so
the second try has somewhere to write to.

- src/core/retry.ts: WithRetryOpts gains `reconnect?: () => Promise<void>`.
  Awaited in the catch branch. onRetry is also now awaited (back-compat-
  safe: every existing in-tree caller is a sync arrow). Reconnect
  failures propagate as the real cause — replaces the symptomatic
  "No database connection" error with whatever the connect() throw
  was, so operators see the truth.
- src/core/postgres-engine.ts:batchRetry — injects
  `reconnect: () => this.reconnect()`. Covers all 9 batch-retry call
  sites (addLinksBatch, addTimelineEntriesBatch, upsertChunks, plus
  the 6 caller-supplied auditSite labels in extract / sync / reindex).
- test/core/retry-reconnect.test.ts: 8 hermetic cases pinning the
  contract — reconnect fires before sleep, only on retryable errors,
  back-compat when omitted, signal-aborted bypasses reconnect,
  onRetry is awaited, full success path end-to-end.

The deeper bug (who's calling disconnect mid-cycle) is left
unaddressed in this commit by design — the diagnostic instrumentation
in the prior commit will tell us in the next production run.

Refs #1570

* feat(facts): drainPending() + CLI await before disconnect (v0.41.27.0)

Closes the silent 'No database connection' tail-end errors after
gbrain capture / put_page: the facts:absorb fire-and-forget queue
sometimes outlived the CLI process's connection lifetime, so absorb
attempts after engine.disconnect() landed in stderr as the
GBrainError shape.

- src/core/facts/queue.ts: new drainPending({timeout: 1000}) method
  distinct from shutdown(). Stops accepting new enqueues, awaits
  in-flight settle, bounded by timeout, returns count of unfinished.
  Semantically different from shutdown() (which aborts in-flight)
  so the symptom — drop work that hasn't started yet but let
  in-flight work finish — matches what CLI exit actually needs.
- src/cli.ts: op-dispatch finally block awaits the drain BEFORE
  engine.disconnect(). Bounded 1s. Opt-out env GBRAIN_NO_FACTS_DRAIN
  for callers that don't enqueue (keeps fast-exit paths fast).
  Mirrors the v0.41.8.0 awaitPendingLastRetrievedWrites pattern.
- test/facts-queue-drain-pending.test.ts: 6 hermetic cases — empty
  drain returns immediately, single in-flight settles, timeout
  bounds wait, shutdown-after-drain is idempotent, post-drain
  enqueues are dropped, signal-aborted skips waiting.

Refs #1570

* docs: update project documentation for v0.41.27.0

README.md: added troubleshooting entry for the v0.41.27.0 retry-reconnect
+ facts:absorb drain fix (closes #1570), pointing operators at
`gbrain doctor --json` to find the offending disconnect caller.

CLAUDE.md: extended `src/core/retry.ts` entry with the new optional
`reconnect` callback (v0.41.27.0); added two new Key Files entries for
`src/core/audit/db-disconnect-audit.ts` (the diagnostic half of the
"instrument first, fix later" pivot) and `FactsQueue.drainPending`;
extended `doctor.ts:checkBatchRetryHealth` entry with the in-place
extension that surfaces 24h disconnect-call count.

llms-full.txt: regenerated to absorb CLAUDE.md edits.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore: rebump v0.41.27.0 → v0.41.28.0 (queue collision with #1573)

Master shipped v0.41.27.0 (#1573 git-aware sync_freshness) claiming the
same slot. Rebump to the next available version.

- VERSION + package.json → 0.41.28.0
- CHANGELOG.md: my entry header + in-entry refs 0.41.27.0 → 0.41.28.0
- TODOS.md: my #1570 follow-up section header + body refs bumped

* test: pin gateway in put-page-provenance + embedding-dim-check (CI shard fix)

Both files failed on CI shards 1 and 8 under the cross-file gateway-state
leak class (CLAUDE.md "Test-isolation lint and helpers"). The v0.41.28.0
merge reshuffled the weight-based shard bin-packing, landing a
gateway-mutating sibling ahead of these two victims in the same `bun test`
process.

Mechanism:
- put-page-provenance: put_page embeds via the gateway. A sibling left
  the gateway configured with OpenAI + the CI placeholder `sk-test`
  (captured at configureGateway time, survives the withEnv restore as
  cached gateway state). put_page's embed then fired against live OpenAI
  and 401'd. The bunfig legacy-embedding preload's beforeEach only
  re-applies legacy when the gateway was RESET — it does NOT correct a
  sibling that configured a different LIVE config.
- embedding-dim-check: initSchema builds the content_chunks vector column
  at the gateway's configured dim. A sibling leaking ZE/1280 made the
  column 1280-d, so `expect(dims).toBe(1536)` failed.

Fix (victim-side pinning, the escape hatch the preload documents):
- Both: configure the gateway explicitly in beforeAll BEFORE initSchema
  (OpenAI/1536), resetGateway() in afterAll so neither leaks onward.
- put-page-provenance also stubs the embed transport via
  __setEmbedTransportForTests so embed is deterministic and offline; a
  dummy OPENAI_API_KEY is supplied in the gateway env because
  instantiateEmbedding builds the OpenAI client (key check) BEFORE the
  stubbed transport is reached — the stub then intercepts the actual
  call so the key never leaves the process.

Verified: CI shards 1 (1337 pass) + 8 (905 pass) green with
OPENAI_API_KEY unset, plus adversarial sibling orderings (gateway.test /
doctor-ze-checks preceding). Typecheck + check-test-isolation clean.

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants