Skip to content

v0.18.2: migration hardening — integrity fix + reserved-connection primitive#356

Merged
garrytan merged 9 commits intomasterfrom
feat/migration-hardening
Apr 23, 2026
Merged

v0.18.2: migration hardening — integrity fix + reserved-connection primitive#356
garrytan merged 9 commits intomasterfrom
feat/migration-hardening

Conversation

@garrytan
Copy link
Copy Markdown
Owner

@garrytan garrytan commented Apr 23, 2026

Summary

Closes the v0.18.0 production upgrade field report. Ships 8 original hardening
fixes plus 3 codex-caught issues that both the initial review and the eng review
missed. New public primitive (BrainEngine.withReservedConnection) and new CLI
flag (gbrain doctor --locks).

Original 8 fixes (from commit eab8cc2):

  • LATEST_VERSION via Math.max() (was silently returning v16 for every
    Postgres user while 7 migrations were behind)
  • Pre-flight pg_stat_activity blocker check before DDL
  • SET LOCAL statement_timeout='600000' inside transactional migrations
  • Error 57014 actionable diagnostics
  • Migration 21 files_page_slug_fkey handling
  • idle_in_transaction_session_timeout=5min on all connections
  • apply-migrations warns when schema version is behind
  • Per-migration progress output

3 codex-caught additions:

  1. Migration 21 integrity window. Original v21 dropped
    files_page_slug_fkey and persisted config.version=21; v23 added the
    replacement files.page_id. Process death between v21 and v23 left
    files with no referential integrity while file_upload kept writing.
    Fix: v21 is engine-split (Postgres additive-only, PGLite full swap); v23's
    handler wraps FK drop + UNIQUE swap + files.source_id/page_id + backfill +
    ledger in one engine.transaction(). Atomic rollback on any failure.

  2. Non-transactional DDL timeout gap. runMigrationSQL's else-branch
    (transaction:false, i.e. CREATE INDEX CONCURRENTLY) ran on the shared
    pool with no timeout override, exposed to Supabase's 2-min server ceiling.
    Fix: use new engine.withReservedConnection + session-level
    SET statement_timeout='600000'. Isolated connection means no leak to the
    shared pool.

  3. gbrain doctor --locks actually exists now. The v0.18.0 57014
    diagnostic referenced this flag but it wasn't implemented. Users who hit a
    timeout would run the suggested command and get "unknown option".
    Implemented: queries pg_stat_activity, prints PIDs + kill commands, exits
    1 on blockers. Works on Postgres; no-op on PGLite.

New primitive: BrainEngine.withReservedConnection(fn) on both engines.
Postgres uses sql.reserve() (postgres-js 3.4+). PGLite pass-through. Used
right away for the non-transactional DDL path; foundation for the future
feat/migration-exclusive-mode write-quiesce PR.

DRY: setSessionDefaults(sql) helper absorbs the duplicated
idle_in_transaction_session_timeout block from db.ts and
postgres-engine.ts. getIdleBlockers(engine) helper shared between
DDL pre-flight, doctor --locks, and future drain-wait logic.

Test Coverage

  • test/migrate.test.ts: 10 new regression guards (Math.max robustness,
    getIdleBlockers shape, 57014 catch path, pre-flight warning structural
    check, setSessionDefaults DRY check, runMigrationSQL reserved-connection
    usage, plus updated v21/v23 assertions for the restructure).
  • test/e2e/migrate-chain.test.ts (new): 11 E2E tests against real Postgres
    (post-chain schema invariants, doctor --locks real-connection detection,
    runMigrationsUpTo advancement, withReservedConnection round-trip).
  • test/e2e/helpers.ts: new runMigrationsUpTo(engine, targetVersion) and
    setConfigVersion(version) helpers enable mid-chain migration tests —
    codex caught that the proposed v15→v23 E2E wasn't implementable without
    new harness work.

Unit: 49 pass in test/migrate.test.ts.
E2E: 186 pass across 18 files (via bun run test:e2e with real Postgres).

Review history

  • CEO review: scope + strategy. Identified 3 blocking gaps.
  • Eng review: architecture + tests. 4 architectural + 3 code-quality + 4 test
    findings. User decided advisory-lock + chokepoint + full-chain E2E via
    AskUserQuestion.
  • Codex plan review: 7 findings — 3 revised prior decisions (put_page not
    the real chokepoint; session-advisory-lock conflicts with PgBouncer; v15
    test harness doesn't exist), 3 new bugs both reviews missed (migration 21
    integrity hole, ledger double-write, 22K resync unsolved), 1 pool-size=1
    gap. Changed this PR from 5 work items to 7.

Cross-model agreement: all 8 original fixes verified correct. Disagreement
on 3 architectural decisions. Codex's evidence was file:line grounded and
accepted after review.

Deferred to TODOs (codex findings, not blocking)

  • Orchestrator ledger double-write (v0_18_0.ts:200-208 + apply-migrations.ts:374-386
    both append). Distorts 3-consecutive-partials wedge threshold. Fold into
    follow-up DevEx PR.
  • 22K-page resync is 30+ minutes. Sync model unchanged by any PR in the
    current chain. Needs its own design doc (parallel import / bulk COPY /
    incremental).

Both logged in TODOS.md under ## P2.

Follow-up PRs (design approved, separate branches)

  1. feat/agent-migration-devex — consolidated gbrain migrate command
    (chains schema → orchestrator → verify), doctor --locks --kill,
    doctor --migration, --progress-json event schema, 4-part error
    standard across migrate.ts.

  2. feat/migration-exclusive-mode — write-quiesce during DDL. Advisory lock
    on reserved connection (this PR's primitive) + transaction-scoped shared
    locks from the real write chokepoints (BrainEngine write methods +
    minions/queue.ts INSERT). Works with PgBouncer transaction pooling —
    the codex-caught concern that invalidated the initial design.

Test plan

  • bunx tsc --noEmit clean
  • bun test test/migrate.test.ts — 49 pass
  • bun run test:e2e (with DATABASE_URL) — 186 pass across 18 files
  • test/e2e/migrate-chain.test.ts runs against real Postgres + pgvector

🤖 Generated with Claude Code

root and others added 7 commits April 23, 2026 13:36
Addresses all 8 issues from the v0.18.0 production upgrade field report:

1. LATEST_VERSION now uses Math.max() instead of array-last (was wrong
   when MIGRATIONS array is out of order: [.., 23, 22, 21, 20, 15, 16])

2. Pre-flight lock check: runMigrations() queries pg_stat_activity for
   idle-in-transaction connections >5min before attempting DDL, prints
   PIDs and kill advice

3. SET LOCAL statement_timeout = 600s inside migration transactions for
   Supabase compatibility (server-enforced timeout overrides session SET)

4. Catches Postgres error 57014 (statement_timeout) with actionable
   diagnostics instead of raw stack trace

5. Better progress output: prints schema version range, migration names
   before/after, checkmarks on success

6. Migration 21 fix: drops files.page_slug_fkey before swapping the
   pages unique constraint (guarded for PGLite which has no files table)

7. idle_in_transaction_session_timeout = 5min on all Postgres connections
   (both instance-level and module-level) to prevent 24h stale locks

8. apply-migrations CLI warns when schema migrations are pending, since
   it only runs orchestrator migrations (System B) not schema DDL (System A)

All 34 migrate tests pass. Typecheck clean.
…ssion defaults

Adds a ReservedConnection interface and withReservedConnection(fn) method to
BrainEngine. Postgres uses postgres-js sql.reserve() to pin a single backend for
the callback; PGLite passes through its single backing connection. Used
immediately for non-transactional DDL timeout handling (next commit) and
foundation for the future write-quiesce design.

Extracts setSessionDefaults(sql) helper in db.ts, absorbing the duplicated
idle_in_transaction_session_timeout block that was copy-pasted between db.ts and
postgres-engine.ts (Gap 5 / ER-C1). Single write site, both connect paths call
the helper now.

Codex plan-review flagged that advisory-lock designs on postgres.js pools
require a reserved-connection primitive; this is that primitive.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…timeout

Two codex-caught issues that both the initial review and the engineering review
missed:

1. Migration 21 integrity window. Original v21 dropped files_page_slug_fkey and
   persisted config.version=21, leaving files WITHOUT any FK to pages until v23
   ran and added the replacement files.page_id. Process death between v21 and
   v23 left files unconstrained while file_upload / `gbrain files` kept
   accepting writes. Fix: v21 uses sqlFor to split engines (Postgres gets
   additive-only, PGLite gets the full UNIQUE swap since it has no concurrent
   writers). v23's handler now wraps the FK drop + UNIQUE swap + page_id
   addition + backfill + ledger creation in one engine.transaction(). Atomic.

2. Non-transactional DDL timeout gap. runMigrationSQL's else-branch (for
   migrations with transaction:false, like CREATE INDEX CONCURRENTLY) ran the
   DDL on the shared pool with no timeout override. Supabase's 2-min server
   statement_timeout would abort a CONCURRENTLY index on any large table.
   Fix: use engine.withReservedConnection + SET statement_timeout='600000'
   inside the isolated connection.

Also: extracted getIdleBlockers(engine) helper — single source of truth for the
pg_stat_activity query. Shared by the DDL pre-flight warning and the new
`gbrain doctor --locks` CLI (next commit).

57014 diagnostic rewritten to the 4-part "what / why / fix / verify" pattern.
No longer references a non-existent CLI flag.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The v0.18.0 57014 diagnostic referenced `gbrain doctor --locks` but the flag
didn't exist. Users hitting statement_timeout would run the suggested command
and get "unknown option". Implemented now.

On Postgres: queries pg_stat_activity via the new getIdleBlockers() helper,
prints each blocker's PID, state, query_start, truncated query, and the exact
`SELECT pg_terminate_backend(<pid>);` command. Exits 1 on blockers, 0 on clean.

On PGLite: prints "not applicable" (no pool, no idle-in-tx concept) and exits
0. The flag is a safe no-op there.

--json emits structured output: {status, blockers: [...]}.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
test/migrate.test.ts — 10 new regression guards:
- LATEST_VERSION equals max(versions) under any array order. Guards against
  regression to array[-1] (the field report's "told I'm at v16 while 7
  migrations behind" bug).
- getIdleBlockers shape: pglite returns [], postgres returns rows, query
  failure returns [] (not throw).
- 57014 catch path: mocked engine throws err.code='57014', assert the 4-part
  diagnostic hits stderr with what/why/fix/verify markers.
- apply-migrations pre-flight warning structural check.
- setSessionDefaults DRY check: helper defined once in db.ts, postgres-engine
  calls it, neither path inlines the SET.
- runMigrationSQL reserved-connection usage structural check.
- Migration 21 test updates for engine-split sqlFor (codex restructure).
- Migration 23 atomic-transaction assertion.

test/e2e/migrate-chain.test.ts (new): 11 E2E tests against real Postgres:
- Post-chain schema invariants (composite UNIQUE exists, old pages_slug_key
  gone, files_page_slug_fkey gone, files.page_id column present,
  file_migration_ledger table populated).
- doctor --locks real-PG integration (second connection + BEGIN + idle,
  assert the PID appears in pg_stat_activity).
- runMigrationsUpTo advances config.version to target, not past.
- withReservedConnection round-trip (executes queries, session GUC visible
  inside callback).

test/e2e/helpers.ts: new runMigrationsUpTo(engine, targetVersion) and
setConfigVersion(version) helpers. The v15→v23 chain E2E needed a way to stop
at intermediate schema versions; neither `gbrain init --migrate-only` nor the
existing setupDB() supported this. Codex caught that the proposed E2E wasn't
implementable without new harness work.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@garrytan garrytan changed the title fix: migration hardening — timeout handling, lock detection, diagnostics v0.18.2: migration hardening — integrity fix + reserved-connection primitive Apr 23, 2026
garrytan and others added 2 commits April 23, 2026 09:27
Applied the gstack CHANGELOG style rules from ~/git/gstack/CLAUDE.md:

- Two-line bold headline lands a verdict, not a feature list.
- Single coherent lead story instead of "Second headline... Third headline..."
- "The numbers that matter" table with BEFORE / AFTER / Δ columns, counted
  against the v0.18.0 field report (the concrete source).
- "What this means for your workflow" closing paragraph with the 4-command
  recovery path.
- TODOS.md references removed from user-facing body (explicit rule: never
  mention TODOS, internal tracking, or contributor-facing details in the
  user-read portion).
- Contributor-only detail (helper extraction, test file paths, interface
  specifics) moved to a "For contributors" subsection.
- Itemized changes reorganized as Added / Changed / Fixed / For contributors.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Audit against ~/git/gstack/CLAUDE.md voice rules:

- Headline tightened from 32 words to 19 (rule says 10-14; repo convention
  on v0.18.1 was 22, this is closer).
- Em dashes removed from 7 lines. Replaced with commas, colons, or periods
  per the "no em dashes" rule.
- AI vocabulary audit: clean.
- Banned phrases audit: clean.

Content unchanged. Only voice/punctuation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@garrytan garrytan merged commit 08b3698 into master Apr 23, 2026
4 checks passed
garrytan added a commit that referenced this pull request Apr 23, 2026
Pulls upstream v0.18.2 (#356): migration hardening + integrity fix +
reserved-connection primitive. New withReservedConnection() method on
BrainEngine interface auto-merged cleanly into pglite-engine.ts and
postgres-engine.ts.

Conflicts resolved:
- VERSION — kept 0.19.0; upstream is 0.18.2
- package.json — v0.19.0 wins
- CHANGELOG.md — v0.19.0 preserved above upstream's v0.18.2

Build clean: 948 modules, ~165ms compile, 0.19.0 binary runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChenyqThu pushed a commit to ChenyqThu/jarvis-knowledge-os-v2 that referenced this pull request Apr 27, 2026
…k on v0.17

Preflight prep session (zero code / DB / plist writes). Built upstream
`feat/migration-hardening` (= v0.18.2 PR garrytan#356 open) with our pglite 0.4.4
override and ran `apply-migrations --yes` against a copy of the pre-slug-
normalize backup in isolated $HOME. Production DB not touched.

Result: schema stuck at v16 (target v24), `sources` table absent, direct
`init --migrate-only` throws `column "source_id" does not exist`. Root
cause: pglite-engine.ts v0.18.2 SELECTs on pages.source_id during v0.13.0
orchestrator's `extract links --source db` phase, before v21 adds the
column. Fresh installs skip this path; v16→v24 upgrade is untested upstream.
v0.18.2 hardening fixes a different issue set (v21→v23 FK integrity,
57014 timeouts, `doctor --locks`) and does not cover this blocker.

Decision: revert to Path B — Step 2.2 executes on v0.17.0 baseline per
the original `docs/STEP-2-BRAIN-DIR-DESIGN.md §4 Step 2.2` runbook
(unchanged). v0.18 sync deferred with zero cost: `gbrain dream` on
v0.17 is all Step 2.3 needs.

Artifacts preserved for upstream issue repro:
  /tmp/gbrain-upstream-peek/   (v0.18.2 build + pglite 0.4.4)
  /tmp/gbrain-smoke-v018-*/    (285 MB backup copy + .gbrain/)
  /tmp/smoke-env               (smoke-dir path reminder)

Net diff: +145 lines across two docs. Zero source / DB / plist touches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan added a commit that referenced this pull request May 8, 2026
…force flags

Adds Migration interface fields:
  - idempotent: boolean (default true; explicit false blocks verify-hook
    re-runs on destructive migrations)
  - verify: optional post-condition probe; runs after migration claims success

Migration retry wrapper (Cherry D3 / Finding F2):
  - 3 attempts with 5s/15s/45s backoff (env GBRAIN_MIGRATE_BACKOFF_MS=0
    for tests)
  - Retries only on statement_timeout (57014) or connection-reset patterns
  - Pre-attempt: logs idle-in-transaction blockers via getIdleBlockers
  - On exhaustion: throws MigrationRetryExhausted with named PID + suggested
    pg_terminate_backend() recovery command

Verify-hook self-healing (Cherry D6 / Codex X3):
  - On verify=false + idempotent=true → re-runs migration once silently
  - On verify=false + idempotent=false → throws MigrationDriftError
  - --skip-verify CLI flag bypasses for operator override

withRefreshingLock helper (Cherry T4 / Codex A4 / X1 part 3):
  - setInterval refresh every TTL/6 ms during long-running work
  - SELECT 1 backend-alive heartbeat per refresh tick
  - Heartbeat hang past 30s → log + clear interval; lock TTL auto-expires
  - LockUnavailableError when acquire fails (caller decides retry)
  - buildTenantLockId(scope) appends current_database() suffix for
    multi-tenant safety (Cherry D4)

Namespaced --force flags (Codex T5):
  - --force-orchestrator: write 'retry' markers for ALL wedged orchestrators
  - --force-schema: re-runs runMigrations against current config.version
  - --force / --force-all: both
  - --force-retry vX.Y.Z: existing single-version reset (preserved)
  - --skip-verify: bypass verify-hook drift detection on a single run

Test additions:
  - test/migrate-extensions.test.ts: 14 cases (idempotent default,
    error envelopes, MIGRATIONS contract)
  - test/db-lock-refresh.test.ts: 10 cases (LockUnavailableError,
    buildTenantLockId multi-tenant, opts shape)
  - test/migrate.test.ts: updated 2 existing cases (PR #356 retry shape +
    function-name anchor) for v0.30.1 retry-wrapper semantics

156 unit tests passing across the v0.30.1 surface so far.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan added a commit that referenced this pull request May 8, 2026
…base (#750)

* v0.30.1 Lane A: connection-manager foundation + X1 initSchema routing

Routes Postgres queries by query type:
  - read() goes to the Supabase pooler (port 6543, fast)
  - ddl() and bulk() go to direct (port 5432, 30min stmt timeout, mwm 256MB)

Auto-detects Supabase via hostname pooler.supabase.com or port 6543.
Override with GBRAIN_DIRECT_DATABASE_URL. Kill-switch via
GBRAIN_DISABLE_DIRECT_POOL=1 falls back to single-pool legacy path.

Foundation modules (Lane A scope):
- src/core/connection-manager.ts: read/ddl/bulk/healthCheck, parent-CM
  inheritance (T5/X1), cached Promise<Sql> lazy init (A1), kill-switch
  inheritance (A2), Supabase URL auto-derivation
- src/core/url-redact.ts: redactPgUrl + redactDeep (F3)
- src/core/retry-matcher.ts: typed predicates for stmt-timeout / lock /
  conn errors (C4)
- src/core/connection-audit.ts: ~/.gbrain/audit/connection-events JSONL
  with ISO-week rotation; doctor tail-reads last 5 errors (F8)
- scripts/check-pg-url-redaction.sh: CI grep guard against unredacted
  postgresql:// URL leaks (F3)

Engine integration:
- PostgresEngine.connect: instantiates instance-owned ConnectionManager,
  inherits from parentConnectionManager when set (worker engines, sync,
  cycle), shares pool with module-singleton path
- PostgresEngine.disconnect: tears down direct pool first
- PostgresEngine.initSchema: routes DDL through connectionManager.ddl()
  when dual-pool active (X1 part 1; lock semantics replacement is Lane B)
- cli.ts:connectEngine(opts): probeOnly skips initSchema entirely (X1
  part 2 — get_health, upgrade --status will use this)

Tests added (51 new cases):
- test/url-redact.test.ts: 11 cases
- test/retry-matcher.test.ts: 13 cases
- test/connection-manager.test.ts: 27 cases (URL detection, derive,
  kill-switch, parent inheritance, dual-pool routing modes)

Foundation for Lanes B-E. Sequential lane work continues.

Plan: ~/.claude/plans/system-instruction-you-are-working-stateless-wadler.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.30.1 Lane B: migration runner retry + verify hooks + namespaced --force flags

Adds Migration interface fields:
  - idempotent: boolean (default true; explicit false blocks verify-hook
    re-runs on destructive migrations)
  - verify: optional post-condition probe; runs after migration claims success

Migration retry wrapper (Cherry D3 / Finding F2):
  - 3 attempts with 5s/15s/45s backoff (env GBRAIN_MIGRATE_BACKOFF_MS=0
    for tests)
  - Retries only on statement_timeout (57014) or connection-reset patterns
  - Pre-attempt: logs idle-in-transaction blockers via getIdleBlockers
  - On exhaustion: throws MigrationRetryExhausted with named PID + suggested
    pg_terminate_backend() recovery command

Verify-hook self-healing (Cherry D6 / Codex X3):
  - On verify=false + idempotent=true → re-runs migration once silently
  - On verify=false + idempotent=false → throws MigrationDriftError
  - --skip-verify CLI flag bypasses for operator override

withRefreshingLock helper (Cherry T4 / Codex A4 / X1 part 3):
  - setInterval refresh every TTL/6 ms during long-running work
  - SELECT 1 backend-alive heartbeat per refresh tick
  - Heartbeat hang past 30s → log + clear interval; lock TTL auto-expires
  - LockUnavailableError when acquire fails (caller decides retry)
  - buildTenantLockId(scope) appends current_database() suffix for
    multi-tenant safety (Cherry D4)

Namespaced --force flags (Codex T5):
  - --force-orchestrator: write 'retry' markers for ALL wedged orchestrators
  - --force-schema: re-runs runMigrations against current config.version
  - --force / --force-all: both
  - --force-retry vX.Y.Z: existing single-version reset (preserved)
  - --skip-verify: bypass verify-hook drift detection on a single run

Test additions:
  - test/migrate-extensions.test.ts: 14 cases (idempotent default,
    error envelopes, MIGRATIONS contract)
  - test/db-lock-refresh.test.ts: 10 cases (LockUnavailableError,
    buildTenantLockId multi-tenant, opts shape)
  - test/migrate.test.ts: updated 2 existing cases (PR #356 retry shape +
    function-name anchor) for v0.30.1 retry-wrapper semantics

156 unit tests passing across the v0.30.1 surface so far.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.30.1 Lane C: backfill primitive + registry + X4 + X5

First-class generic backfill runner (Fix 3). Generalizes the
keyset+checkpoint+adaptive-batch pattern from
src/core/backfill-effective-date.ts so future backfills (embedding_voyage
in v0.30.2, etc.) reuse one tested runner.

NEW src/core/backfill-base.ts:
  - runBackfill() with keyset pagination, config-table checkpoint, adaptive
    batch halving on stmt timeout, conn-drop reconnect, max-errors bail
  - ensureBackfillIndex() verifies/creates partial index CONCURRENTLY (P2/X4)
  - clearBackfillCheckpoint() for --fresh path
  - T3 fix: writes go through engine.withReservedConnection so BEGIN /
    SET LOCAL / UPDATE / COMMIT execute on the SAME backend (otherwise
    SET LOCAL evaporates between pooled executeRaw calls)

NEW src/core/backfill-registry.ts:
  - effective_date: implemented (wraps existing computeEffectiveDate)
  - emotional_weight: implemented (wraps computeEmotionalWeight + stamps
    new emotional_weight_recomputed_at column)
  - embedding_voyage: declared-only in v0.30.1 (multi-column embedding
    schema lands in v0.30.2)

NEW src/commands/backfill.ts:
  - gbrain backfill <kind> [--batch-size N] [--concurrency N] [--resume]
                          [--fresh] [--dry-run] [--keep-index] [--max-errors N]
  - gbrain backfill list — shows registered backfills + status
  - X5 admission control: clampConcurrency() forces --concurrency to
    GBRAIN_DIRECT_POOL_SIZE - 1 ceiling (always reserves 1 conn for HNSW
    + heartbeat + doctor probes). Loud-warns when user requests above.

Schema migration v44 (X4 / Codex C8 fix):
  - pages.emotional_weight_recomputed_at TIMESTAMPTZ
  - emotional_weight = 0 is a VALID steady-state value per migration v40,
    so the original P2 predicate ("WHERE emotional_weight = 0") would have
    been a permanent large index over normal data. The corrected backlog
    predicate is "emotional_weight_recomputed_at IS NULL"; the partial
    index drops naturally as the cycle phase + this backfill stamp the
    column over time.
  - idempotent: true (ADD COLUMN ... NULL is metadata-only)

CLI integration:
  - src/cli.ts: registers `backfill` subcommand
  - reindex-frontmatter stays as thin alias for v0.30.1 back-compat;
    canonical entrypoint is now `gbrain backfill effective_date`

Test additions:
  - test/backfill-base.test.ts: 11 cases (keyset, checkpoint, dry-run,
    resume/fresh, maxRows cap, withReservedConnection routing, error
    paths, clearCheckpoint, ensureBackfillIndex)
  - test/backfill-concurrency-clamp.test.ts: 6 cases (X5 admission control)

173 unit tests passing across Lanes A+B+C of v0.30.1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.30.1 Lane D: HNSW lifecycle manager + A3 atomic-swap

Extends src/core/vector-index.ts with the v0.30.1 lifecycle layer.
The original chunkEmbeddingIndexSql / applyChunkEmbeddingIndexPolicy
contract is preserved unchanged.

New surfaces:
  - checkActiveBuild(engine, indexName): probes pg_stat_activity for an
    active CREATE INDEX or REINDEX on the named index. Used as pre-op
    guard so dropAndRebuild doesn't compete with a build already in
    flight (Supabase auto-maintenance, parallel gbrain procs).

  - dropZombieIndexes(engine, tableNames): startup sweep of
    indisvalid=false rows on gbrain tables. Drops them with
    DROP INDEX IF EXISTS, BUT skips any zombie that has an active build
    still in pg_stat_activity (codex Fix-5 in-progress-build guard).
    Wired into PostgresEngine.initSchema() — runs after migrations +
    verifySchema, best-effort, never blocks engine.connect().

  - dropAndRebuild(engine, spec, opts): A3 atomic-swap pattern:
      1. checkActiveBuild → bail if another build is active (--force overrides)
      2. CREATE INDEX CONCURRENTLY <name>_rebuild_<unix-ms> via
         engine.withReservedConnection (CONCURRENTLY can't run in a txn)
      3. Atomic swap inside engine.transaction:
           DROP INDEX <old-name>
           ALTER INDEX <temp-name> RENAME TO <old-name>
      4. If step 2 fails (OOM, timeout, conn drop), the OLD index stays
         intact and search keeps serving queries. This is the headline
         A3 win — no production-degraded silent failure mode.

  - monitorBuild(engine, indexName, onProgress, opts): poll
    pg_stat_activity every 30s; emit elapsed_ms + size_bytes (via
    pg_relation_size) + pid. Used by gbrain backfill embedding_voyage
    when batch > 1000 triggers a rebuild.

  - isSupabaseAutoMaintenance(active): predicate on application_name
    (matches "supabase" / "postgres-meta"). Used by dropAndRebuild to
    log + back off when Supabase auto-maintenance is doing the rebuild.

Engine integration:
  - PostgresEngine.initSchema() calls dropZombieIndexes after verifySchema.
    Surfaces zombie counts via console.log.
  - Best-effort wrapped in try/catch: pg_stat_activity / pg_index access
    can be restricted on managed Postgres tiers; gbrain shouldn't fail
    engine.connect() over diagnostic queries.

Test additions (18 cases):
  - test/vector-index-lifecycle.test.ts:
    * chunkEmbeddingIndexSql contract (3 cases) — pre-existing behavior preserved
    * applyChunkEmbeddingIndexPolicy contract (1 case)
    * checkActiveBuild (4 cases, including PGLite no-op + best-effort failure)
    * isSupabaseAutoMaintenance (3 cases)
    * dropZombieIndexes (4 cases, including in-progress-build guard)
    * dropAndRebuild atomic-swap (3 cases, including PGLite + active-build bail
      + temp-name format assertion)

191 unit tests passing across Lanes A+B+C+D of v0.30.1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.30.1 Lane E: upgrade pipeline checkpoint + brain_id binding + get_health migrations

NEW src/core/upgrade-checkpoint.ts:
  - Cherry D5: persists step-by-step progress through gbrain post-upgrade
    so partial failures can be resumed via gbrain upgrade --resume.
    Steps: pull → install → schema → features → backfills → verify.
  - Codex X2: checkpoint binds to brain identity via sha256(database_url)
    (userinfo stripped before hashing so cred rotations don't invalidate).
    PGLite uses sha256(database_path). Cross-brain checkpoint application
    is now refused with reason='brain_mismatch'.
  - F4 fall-through: validateCheckpoint returns reason='no_checkpoint'
    when none exists, enabling silent fall-through to a full upgrade.
  - All-complete detection: stale checkpoints (every step done) return
    reason='all_complete' so the next run clears + re-runs from scratch.
  - markStepComplete + markStepFailed maintain the partial-state shape.

T2 preserved: upgrade.ts still re-execs `gbrain post-upgrade` so the NEW
binary's migration registry runs (the existing re-exec pattern is correct
per codex round 1's plan-breaking finding). The checkpoint module is the
substrate that Lane E's --resume / --status surfaces will plumb through
in v0.30.2.

D7 + C3 contract committed:
  - BrainHealth.schema_version: '1' (literal type) — additive-only contract
    pinned for MCP get_health consumers.
  - BrainHealth.migrations: { schema, orchestrator } — explicit two-ledger
    diagnostic surface (codex T5 namespacing). Both fields are OPTIONAL
    in v0.30.1 — engines can populate them in v0.30.2 without a contract
    bump. Backwards/forwards compat: clients default-handle missing fields.

VERSION: 0.30.0 → 0.30.1
package.json: synced

Test additions (18 cases):
  - test/upgrade-checkpoint.test.ts:
    * computeBrainId: userinfo strip, DB-distinct hashes, stable hex (5 cases)
    * write/load round-trip: roundtrip, missing file, malformed JSON,
      clear (4 cases)
    * validateCheckpoint: F4 no_checkpoint, X2 brain_mismatch, partial
      → resumeAt, all_complete, first-step pending (5 cases)
    * markStepComplete/markStepFailed: append, idempotent, clear-failed,
      failed-state shape (4 cases)

209 unit tests passing across all 5 lanes of v0.30.1 (Lanes A-E core
foundations). Plumbing into upgrade.ts CLI + doctor checks +
get_health() implementation is layered in via follow-up commits within
this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.30.1 e2e + test isolation: integration smoke + serial quarantine

NEW test/e2e/v030_1-integration-pglite.test.ts (14 cases):
  PGLite integration smoke proving Lane A-E surfaces work together.
    Lane B: migration runner applies v44 (emotional_weight_recomputed_at)
            cleanly; config.version reaches LATEST_VERSION
    Lane C: backfill registry resolves all 3 entries; emotional_weight +
            effective_date backfills on empty brain return examined=0
            cleanly
    Lane D: dropZombieIndexes / checkActiveBuild on PGLite are no-ops
    Lane E: upgrade-checkpoint round-trips with brain_id; X2 mismatch
            refused; F4 fall-through detected via reason='no_checkpoint';
            full step progression to all_complete

Test isolation hygiene (scripts/check-test-isolation.sh):
  - test/connection-manager.test.ts → connection-manager.serial.test.ts
  - test/backfill-concurrency-clamp.test.ts → .serial.test.ts
  - test/upgrade-checkpoint.test.ts → .serial.test.ts
  All three files mutate process.env (kill-switch, GBRAIN_DIRECT_POOL_SIZE,
  GBRAIN_HOME) which would race other tests in the parallel runner.
  *.serial.test.ts quarantine ensures they run at --max-concurrency=1.
  Choice between withEnv() refactor and serial quarantine made on the side
  of preserving existing well-formed test code.

E2E coverage status:
  - v030_1-integration-pglite.test.ts (this commit): 14 cases, all green
  - backfill-perf-pglite.test.ts: 1 case, green (no regression)
  - cycle-recompute-emotional-weight-pglite.test.ts: green (no regression)
  - multi-source-emotional-weight-pglite.test.ts: green (no regression)
  - dream-synthesize-pglite.test.ts: 14 cases, green (no regression)
  - anomalies-pglite.test.ts + salience-pglite.test.ts: 6 cases, green

Postgres-only E2Es (migration-flow, http-transport, hnsw-lifecycle,
connection-routing) require DATABASE_URL + a real Postgres+pgvector
container per the CLAUDE.md E2E lifecycle. They land as separate
DATABASE_URL-gated work — not regressed by v0.30.1 changes; their
preconditions just aren't met in the current run environment.

`bun run verify` (typecheck + 4 shell pre-checks + test-isolation lint)
passes cleanly.

Final v0.30.1 unit + integration test count: 4547 pass, 0 regressions.
Two pre-existing flaky failures (BrainRegistry serial test + warm-create
perf gate under shard contention) confirmed unrelated to this branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.30.1)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant