v0.18.2: migration hardening — integrity fix + reserved-connection primitive by garrytan · Pull Request #356 · garrytan/gbrain

garrytan · 2026-04-23T13:36:53Z

Summary

Closes the v0.18.0 production upgrade field report. Ships 8 original hardening
fixes plus 3 codex-caught issues that both the initial review and the eng review
missed. New public primitive (BrainEngine.withReservedConnection) and new CLI
flag (gbrain doctor --locks).

Original 8 fixes (from commit eab8cc2):

LATEST_VERSION via Math.max() (was silently returning v16 for every
Postgres user while 7 migrations were behind)
Pre-flight pg_stat_activity blocker check before DDL
SET LOCAL statement_timeout='600000' inside transactional migrations
Error 57014 actionable diagnostics
Migration 21 files_page_slug_fkey handling
idle_in_transaction_session_timeout=5min on all connections
apply-migrations warns when schema version is behind
Per-migration progress output

3 codex-caught additions:

Migration 21 integrity window. Original v21 dropped
files_page_slug_fkey and persisted config.version=21; v23 added the
replacement files.page_id. Process death between v21 and v23 left
files with no referential integrity while file_upload kept writing.
Fix: v21 is engine-split (Postgres additive-only, PGLite full swap); v23's
handler wraps FK drop + UNIQUE swap + files.source_id/page_id + backfill +
ledger in one engine.transaction(). Atomic rollback on any failure.
Non-transactional DDL timeout gap. runMigrationSQL's else-branch
(transaction:false, i.e. CREATE INDEX CONCURRENTLY) ran on the shared
pool with no timeout override, exposed to Supabase's 2-min server ceiling.
Fix: use new engine.withReservedConnection + session-level
SET statement_timeout='600000'. Isolated connection means no leak to the
shared pool.
gbrain doctor --locks actually exists now. The v0.18.0 57014
diagnostic referenced this flag but it wasn't implemented. Users who hit a
timeout would run the suggested command and get "unknown option".
Implemented: queries pg_stat_activity, prints PIDs + kill commands, exits
1 on blockers. Works on Postgres; no-op on PGLite.

New primitive: BrainEngine.withReservedConnection(fn) on both engines.
Postgres uses sql.reserve() (postgres-js 3.4+). PGLite pass-through. Used
right away for the non-transactional DDL path; foundation for the future
feat/migration-exclusive-mode write-quiesce PR.

DRY: setSessionDefaults(sql) helper absorbs the duplicated
idle_in_transaction_session_timeout block from db.ts and
postgres-engine.ts. getIdleBlockers(engine) helper shared between
DDL pre-flight, doctor --locks, and future drain-wait logic.

Test Coverage

test/migrate.test.ts: 10 new regression guards (Math.max robustness,
getIdleBlockers shape, 57014 catch path, pre-flight warning structural
check, setSessionDefaults DRY check, runMigrationSQL reserved-connection
usage, plus updated v21/v23 assertions for the restructure).
test/e2e/migrate-chain.test.ts (new): 11 E2E tests against real Postgres
(post-chain schema invariants, doctor --locks real-connection detection,
runMigrationsUpTo advancement, withReservedConnection round-trip).
test/e2e/helpers.ts: new runMigrationsUpTo(engine, targetVersion) and
setConfigVersion(version) helpers enable mid-chain migration tests —
codex caught that the proposed v15→v23 E2E wasn't implementable without
new harness work.

Unit: 49 pass in test/migrate.test.ts.
E2E: 186 pass across 18 files (via bun run test:e2e with real Postgres).

Review history

CEO review: scope + strategy. Identified 3 blocking gaps.
Eng review: architecture + tests. 4 architectural + 3 code-quality + 4 test
findings. User decided advisory-lock + chokepoint + full-chain E2E via
AskUserQuestion.
Codex plan review: 7 findings — 3 revised prior decisions (put_page not
the real chokepoint; session-advisory-lock conflicts with PgBouncer; v15
test harness doesn't exist), 3 new bugs both reviews missed (migration 21
integrity hole, ledger double-write, 22K resync unsolved), 1 pool-size=1
gap. Changed this PR from 5 work items to 7.

Cross-model agreement: all 8 original fixes verified correct. Disagreement
on 3 architectural decisions. Codex's evidence was file:line grounded and
accepted after review.

Deferred to TODOs (codex findings, not blocking)

Orchestrator ledger double-write (v0_18_0.ts:200-208 + apply-migrations.ts:374-386
both append). Distorts 3-consecutive-partials wedge threshold. Fold into
follow-up DevEx PR.
22K-page resync is 30+ minutes. Sync model unchanged by any PR in the
current chain. Needs its own design doc (parallel import / bulk COPY /
incremental).

Both logged in TODOS.md under ## P2.

Follow-up PRs (design approved, separate branches)

feat/agent-migration-devex — consolidated gbrain migrate command
(chains schema → orchestrator → verify), doctor --locks --kill,
doctor --migration, --progress-json event schema, 4-part error
standard across migrate.ts.
feat/migration-exclusive-mode — write-quiesce during DDL. Advisory lock
on reserved connection (this PR's primitive) + transaction-scoped shared
locks from the real write chokepoints (BrainEngine write methods +
minions/queue.ts INSERT). Works with PgBouncer transaction pooling —
the codex-caught concern that invalidated the initial design.

Test plan

bunx tsc --noEmit clean
bun test test/migrate.test.ts — 49 pass
bun run test:e2e (with DATABASE_URL) — 186 pass across 18 files
test/e2e/migrate-chain.test.ts runs against real Postgres + pgvector

🤖 Generated with Claude Code

Addresses all 8 issues from the v0.18.0 production upgrade field report: 1. LATEST_VERSION now uses Math.max() instead of array-last (was wrong when MIGRATIONS array is out of order: [.., 23, 22, 21, 20, 15, 16]) 2. Pre-flight lock check: runMigrations() queries pg_stat_activity for idle-in-transaction connections >5min before attempting DDL, prints PIDs and kill advice 3. SET LOCAL statement_timeout = 600s inside migration transactions for Supabase compatibility (server-enforced timeout overrides session SET) 4. Catches Postgres error 57014 (statement_timeout) with actionable diagnostics instead of raw stack trace 5. Better progress output: prints schema version range, migration names before/after, checkmarks on success 6. Migration 21 fix: drops files.page_slug_fkey before swapping the pages unique constraint (guarded for PGLite which has no files table) 7. idle_in_transaction_session_timeout = 5min on all Postgres connections (both instance-level and module-level) to prevent 24h stale locks 8. apply-migrations CLI warns when schema migrations are pending, since it only runs orchestrator migrations (System B) not schema DDL (System A) All 34 migrate tests pass. Typecheck clean.

…ening

…ssion defaults Adds a ReservedConnection interface and withReservedConnection(fn) method to BrainEngine. Postgres uses postgres-js sql.reserve() to pin a single backend for the callback; PGLite passes through its single backing connection. Used immediately for non-transactional DDL timeout handling (next commit) and foundation for the future write-quiesce design. Extracts setSessionDefaults(sql) helper in db.ts, absorbing the duplicated idle_in_transaction_session_timeout block that was copy-pasted between db.ts and postgres-engine.ts (Gap 5 / ER-C1). Single write site, both connect paths call the helper now. Codex plan-review flagged that advisory-lock designs on postgres.js pools require a reserved-connection primitive; this is that primitive. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…timeout Two codex-caught issues that both the initial review and the engineering review missed: 1. Migration 21 integrity window. Original v21 dropped files_page_slug_fkey and persisted config.version=21, leaving files WITHOUT any FK to pages until v23 ran and added the replacement files.page_id. Process death between v21 and v23 left files unconstrained while file_upload / `gbrain files` kept accepting writes. Fix: v21 uses sqlFor to split engines (Postgres gets additive-only, PGLite gets the full UNIQUE swap since it has no concurrent writers). v23's handler now wraps the FK drop + UNIQUE swap + page_id addition + backfill + ledger creation in one engine.transaction(). Atomic. 2. Non-transactional DDL timeout gap. runMigrationSQL's else-branch (for migrations with transaction:false, like CREATE INDEX CONCURRENTLY) ran the DDL on the shared pool with no timeout override. Supabase's 2-min server statement_timeout would abort a CONCURRENTLY index on any large table. Fix: use engine.withReservedConnection + SET statement_timeout='600000' inside the isolated connection. Also: extracted getIdleBlockers(engine) helper — single source of truth for the pg_stat_activity query. Shared by the DDL pre-flight warning and the new `gbrain doctor --locks` CLI (next commit). 57014 diagnostic rewritten to the 4-part "what / why / fix / verify" pattern. No longer references a non-existent CLI flag. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The v0.18.0 57014 diagnostic referenced `gbrain doctor --locks` but the flag didn't exist. Users hitting statement_timeout would run the suggested command and get "unknown option". Implemented now. On Postgres: queries pg_stat_activity via the new getIdleBlockers() helper, prints each blocker's PID, state, query_start, truncated query, and the exact `SELECT pg_terminate_backend(<pid>);` command. Exits 1 on blockers, 0 on clean. On PGLite: prints "not applicable" (no pool, no idle-in-tx concept) and exits 0. The flag is a safe no-op there. --json emits structured output: {status, blockers: [...]}. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

test/migrate.test.ts — 10 new regression guards: - LATEST_VERSION equals max(versions) under any array order. Guards against regression to array[-1] (the field report's "told I'm at v16 while 7 migrations behind" bug). - getIdleBlockers shape: pglite returns [], postgres returns rows, query failure returns [] (not throw). - 57014 catch path: mocked engine throws err.code='57014', assert the 4-part diagnostic hits stderr with what/why/fix/verify markers. - apply-migrations pre-flight warning structural check. - setSessionDefaults DRY check: helper defined once in db.ts, postgres-engine calls it, neither path inlines the SET. - runMigrationSQL reserved-connection usage structural check. - Migration 21 test updates for engine-split sqlFor (codex restructure). - Migration 23 atomic-transaction assertion. test/e2e/migrate-chain.test.ts (new): 11 E2E tests against real Postgres: - Post-chain schema invariants (composite UNIQUE exists, old pages_slug_key gone, files_page_slug_fkey gone, files.page_id column present, file_migration_ledger table populated). - doctor --locks real-PG integration (second connection + BEGIN + idle, assert the PID appears in pg_stat_activity). - runMigrationsUpTo advances config.version to target, not past. - withReservedConnection round-trip (executes queries, session GUC visible inside callback). test/e2e/helpers.ts: new runMigrationsUpTo(engine, targetVersion) and setConfigVersion(version) helpers. The v15→v23 chain E2E needed a way to stop at intermediate schema versions; neither `gbrain init --migrate-only` nor the existing setupDB() supported this. Codex caught that the proposed E2E wasn't implementable without new harness work. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Applied the gstack CHANGELOG style rules from ~/git/gstack/CLAUDE.md: - Two-line bold headline lands a verdict, not a feature list. - Single coherent lead story instead of "Second headline... Third headline..." - "The numbers that matter" table with BEFORE / AFTER / Δ columns, counted against the v0.18.0 field report (the concrete source). - "What this means for your workflow" closing paragraph with the 4-command recovery path. - TODOS.md references removed from user-facing body (explicit rule: never mention TODOS, internal tracking, or contributor-facing details in the user-read portion). - Contributor-only detail (helper extraction, test file paths, interface specifics) moved to a "For contributors" subsection. - Itemized changes reorganized as Added / Changed / Fixed / For contributors. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Audit against ~/git/gstack/CLAUDE.md voice rules: - Headline tightened from 32 words to 19 (rule says 10-14; repo convention on v0.18.1 was 22, this is closer). - Em dashes removed from 7 lines. Replaced with commas, colons, or periods per the "no em dashes" rule. - AI vocabulary audit: clean. - Banned phrases audit: clean. Content unchanged. Only voice/punctuation. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Pulls upstream v0.18.2 (#356): migration hardening + integrity fix + reserved-connection primitive. New withReservedConnection() method on BrainEngine interface auto-merged cleanly into pglite-engine.ts and postgres-engine.ts. Conflicts resolved: - VERSION — kept 0.19.0; upstream is 0.18.2 - package.json — v0.19.0 wins - CHANGELOG.md — v0.19.0 preserved above upstream's v0.18.2 Build clean: 948 modules, ~165ms compile, 0.19.0 binary runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…k on v0.17 Preflight prep session (zero code / DB / plist writes). Built upstream `feat/migration-hardening` (= v0.18.2 PR garrytan#356 open) with our pglite 0.4.4 override and ran `apply-migrations --yes` against a copy of the pre-slug- normalize backup in isolated $HOME. Production DB not touched. Result: schema stuck at v16 (target v24), `sources` table absent, direct `init --migrate-only` throws `column "source_id" does not exist`. Root cause: pglite-engine.ts v0.18.2 SELECTs on pages.source_id during v0.13.0 orchestrator's `extract links --source db` phase, before v21 adds the column. Fresh installs skip this path; v16→v24 upgrade is untested upstream. v0.18.2 hardening fixes a different issue set (v21→v23 FK integrity, 57014 timeouts, `doctor --locks`) and does not cover this blocker. Decision: revert to Path B — Step 2.2 executes on v0.17.0 baseline per the original `docs/STEP-2-BRAIN-DIR-DESIGN.md §4 Step 2.2` runbook (unchanged). v0.18 sync deferred with zero cost: `gbrain dream` on v0.17 is all Step 2.3 needs. Artifacts preserved for upstream issue repro: /tmp/gbrain-upstream-peek/ (v0.18.2 build + pglite 0.4.4) /tmp/gbrain-smoke-v018-*/ (285 MB backup copy + .gbrain/) /tmp/smoke-env (smoke-dir path reminder) Net diff: +145 lines across two docs. Zero source / DB / plist touches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…force flags Adds Migration interface fields: - idempotent: boolean (default true; explicit false blocks verify-hook re-runs on destructive migrations) - verify: optional post-condition probe; runs after migration claims success Migration retry wrapper (Cherry D3 / Finding F2): - 3 attempts with 5s/15s/45s backoff (env GBRAIN_MIGRATE_BACKOFF_MS=0 for tests) - Retries only on statement_timeout (57014) or connection-reset patterns - Pre-attempt: logs idle-in-transaction blockers via getIdleBlockers - On exhaustion: throws MigrationRetryExhausted with named PID + suggested pg_terminate_backend() recovery command Verify-hook self-healing (Cherry D6 / Codex X3): - On verify=false + idempotent=true → re-runs migration once silently - On verify=false + idempotent=false → throws MigrationDriftError - --skip-verify CLI flag bypasses for operator override withRefreshingLock helper (Cherry T4 / Codex A4 / X1 part 3): - setInterval refresh every TTL/6 ms during long-running work - SELECT 1 backend-alive heartbeat per refresh tick - Heartbeat hang past 30s → log + clear interval; lock TTL auto-expires - LockUnavailableError when acquire fails (caller decides retry) - buildTenantLockId(scope) appends current_database() suffix for multi-tenant safety (Cherry D4) Namespaced --force flags (Codex T5): - --force-orchestrator: write 'retry' markers for ALL wedged orchestrators - --force-schema: re-runs runMigrations against current config.version - --force / --force-all: both - --force-retry vX.Y.Z: existing single-version reset (preserved) - --skip-verify: bypass verify-hook drift detection on a single run Test additions: - test/migrate-extensions.test.ts: 14 cases (idempotent default, error envelopes, MIGRATIONS contract) - test/db-lock-refresh.test.ts: 10 cases (LockUnavailableError, buildTenantLockId multi-tenant, opts shape) - test/migrate.test.ts: updated 2 existing cases (PR #356 retry shape + function-name anchor) for v0.30.1 retry-wrapper semantics 156 unit tests passing across the v0.30.1 surface so far. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…base (#750) * v0.30.1 Lane A: connection-manager foundation + X1 initSchema routing Routes Postgres queries by query type: - read() goes to the Supabase pooler (port 6543, fast) - ddl() and bulk() go to direct (port 5432, 30min stmt timeout, mwm 256MB) Auto-detects Supabase via hostname pooler.supabase.com or port 6543. Override with GBRAIN_DIRECT_DATABASE_URL. Kill-switch via GBRAIN_DISABLE_DIRECT_POOL=1 falls back to single-pool legacy path. Foundation modules (Lane A scope): - src/core/connection-manager.ts: read/ddl/bulk/healthCheck, parent-CM inheritance (T5/X1), cached Promise<Sql> lazy init (A1), kill-switch inheritance (A2), Supabase URL auto-derivation - src/core/url-redact.ts: redactPgUrl + redactDeep (F3) - src/core/retry-matcher.ts: typed predicates for stmt-timeout / lock / conn errors (C4) - src/core/connection-audit.ts: ~/.gbrain/audit/connection-events JSONL with ISO-week rotation; doctor tail-reads last 5 errors (F8) - scripts/check-pg-url-redaction.sh: CI grep guard against unredacted postgresql:// URL leaks (F3) Engine integration: - PostgresEngine.connect: instantiates instance-owned ConnectionManager, inherits from parentConnectionManager when set (worker engines, sync, cycle), shares pool with module-singleton path - PostgresEngine.disconnect: tears down direct pool first - PostgresEngine.initSchema: routes DDL through connectionManager.ddl() when dual-pool active (X1 part 1; lock semantics replacement is Lane B) - cli.ts:connectEngine(opts): probeOnly skips initSchema entirely (X1 part 2 — get_health, upgrade --status will use this) Tests added (51 new cases): - test/url-redact.test.ts: 11 cases - test/retry-matcher.test.ts: 13 cases - test/connection-manager.test.ts: 27 cases (URL detection, derive, kill-switch, parent inheritance, dual-pool routing modes) Foundation for Lanes B-E. Sequential lane work continues. Plan: ~/.claude/plans/system-instruction-you-are-working-stateless-wadler.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * v0.30.1 Lane B: migration runner retry + verify hooks + namespaced --force flags Adds Migration interface fields: - idempotent: boolean (default true; explicit false blocks verify-hook re-runs on destructive migrations) - verify: optional post-condition probe; runs after migration claims success Migration retry wrapper (Cherry D3 / Finding F2): - 3 attempts with 5s/15s/45s backoff (env GBRAIN_MIGRATE_BACKOFF_MS=0 for tests) - Retries only on statement_timeout (57014) or connection-reset patterns - Pre-attempt: logs idle-in-transaction blockers via getIdleBlockers - On exhaustion: throws MigrationRetryExhausted with named PID + suggested pg_terminate_backend() recovery command Verify-hook self-healing (Cherry D6 / Codex X3): - On verify=false + idempotent=true → re-runs migration once silently - On verify=false + idempotent=false → throws MigrationDriftError - --skip-verify CLI flag bypasses for operator override withRefreshingLock helper (Cherry T4 / Codex A4 / X1 part 3): - setInterval refresh every TTL/6 ms during long-running work - SELECT 1 backend-alive heartbeat per refresh tick - Heartbeat hang past 30s → log + clear interval; lock TTL auto-expires - LockUnavailableError when acquire fails (caller decides retry) - buildTenantLockId(scope) appends current_database() suffix for multi-tenant safety (Cherry D4) Namespaced --force flags (Codex T5): - --force-orchestrator: write 'retry' markers for ALL wedged orchestrators - --force-schema: re-runs runMigrations against current config.version - --force / --force-all: both - --force-retry vX.Y.Z: existing single-version reset (preserved) - --skip-verify: bypass verify-hook drift detection on a single run Test additions: - test/migrate-extensions.test.ts: 14 cases (idempotent default, error envelopes, MIGRATIONS contract) - test/db-lock-refresh.test.ts: 10 cases (LockUnavailableError, buildTenantLockId multi-tenant, opts shape) - test/migrate.test.ts: updated 2 existing cases (PR #356 retry shape + function-name anchor) for v0.30.1 retry-wrapper semantics 156 unit tests passing across the v0.30.1 surface so far. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * v0.30.1 Lane C: backfill primitive + registry + X4 + X5 First-class generic backfill runner (Fix 3). Generalizes the keyset+checkpoint+adaptive-batch pattern from src/core/backfill-effective-date.ts so future backfills (embedding_voyage in v0.30.2, etc.) reuse one tested runner. NEW src/core/backfill-base.ts: - runBackfill() with keyset pagination, config-table checkpoint, adaptive batch halving on stmt timeout, conn-drop reconnect, max-errors bail - ensureBackfillIndex() verifies/creates partial index CONCURRENTLY (P2/X4) - clearBackfillCheckpoint() for --fresh path - T3 fix: writes go through engine.withReservedConnection so BEGIN / SET LOCAL / UPDATE / COMMIT execute on the SAME backend (otherwise SET LOCAL evaporates between pooled executeRaw calls) NEW src/core/backfill-registry.ts: - effective_date: implemented (wraps existing computeEffectiveDate) - emotional_weight: implemented (wraps computeEmotionalWeight + stamps new emotional_weight_recomputed_at column) - embedding_voyage: declared-only in v0.30.1 (multi-column embedding schema lands in v0.30.2) NEW src/commands/backfill.ts: - gbrain backfill <kind> [--batch-size N] [--concurrency N] [--resume] [--fresh] [--dry-run] [--keep-index] [--max-errors N] - gbrain backfill list — shows registered backfills + status - X5 admission control: clampConcurrency() forces --concurrency to GBRAIN_DIRECT_POOL_SIZE - 1 ceiling (always reserves 1 conn for HNSW + heartbeat + doctor probes). Loud-warns when user requests above. Schema migration v44 (X4 / Codex C8 fix): - pages.emotional_weight_recomputed_at TIMESTAMPTZ - emotional_weight = 0 is a VALID steady-state value per migration v40, so the original P2 predicate ("WHERE emotional_weight = 0") would have been a permanent large index over normal data. The corrected backlog predicate is "emotional_weight_recomputed_at IS NULL"; the partial index drops naturally as the cycle phase + this backfill stamp the column over time. - idempotent: true (ADD COLUMN ... NULL is metadata-only) CLI integration: - src/cli.ts: registers `backfill` subcommand - reindex-frontmatter stays as thin alias for v0.30.1 back-compat; canonical entrypoint is now `gbrain backfill effective_date` Test additions: - test/backfill-base.test.ts: 11 cases (keyset, checkpoint, dry-run, resume/fresh, maxRows cap, withReservedConnection routing, error paths, clearCheckpoint, ensureBackfillIndex) - test/backfill-concurrency-clamp.test.ts: 6 cases (X5 admission control) 173 unit tests passing across Lanes A+B+C of v0.30.1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * v0.30.1 Lane D: HNSW lifecycle manager + A3 atomic-swap Extends src/core/vector-index.ts with the v0.30.1 lifecycle layer. The original chunkEmbeddingIndexSql / applyChunkEmbeddingIndexPolicy contract is preserved unchanged. New surfaces: - checkActiveBuild(engine, indexName): probes pg_stat_activity for an active CREATE INDEX or REINDEX on the named index. Used as pre-op guard so dropAndRebuild doesn't compete with a build already in flight (Supabase auto-maintenance, parallel gbrain procs). - dropZombieIndexes(engine, tableNames): startup sweep of indisvalid=false rows on gbrain tables. Drops them with DROP INDEX IF EXISTS, BUT skips any zombie that has an active build still in pg_stat_activity (codex Fix-5 in-progress-build guard). Wired into PostgresEngine.initSchema() — runs after migrations + verifySchema, best-effort, never blocks engine.connect(). - dropAndRebuild(engine, spec, opts): A3 atomic-swap pattern: 1. checkActiveBuild → bail if another build is active (--force overrides) 2. CREATE INDEX CONCURRENTLY <name>_rebuild_<unix-ms> via engine.withReservedConnection (CONCURRENTLY can't run in a txn) 3. Atomic swap inside engine.transaction: DROP INDEX <old-name> ALTER INDEX <temp-name> RENAME TO <old-name> 4. If step 2 fails (OOM, timeout, conn drop), the OLD index stays intact and search keeps serving queries. This is the headline A3 win — no production-degraded silent failure mode. - monitorBuild(engine, indexName, onProgress, opts): poll pg_stat_activity every 30s; emit elapsed_ms + size_bytes (via pg_relation_size) + pid. Used by gbrain backfill embedding_voyage when batch > 1000 triggers a rebuild. - isSupabaseAutoMaintenance(active): predicate on application_name (matches "supabase" / "postgres-meta"). Used by dropAndRebuild to log + back off when Supabase auto-maintenance is doing the rebuild. Engine integration: - PostgresEngine.initSchema() calls dropZombieIndexes after verifySchema. Surfaces zombie counts via console.log. - Best-effort wrapped in try/catch: pg_stat_activity / pg_index access can be restricted on managed Postgres tiers; gbrain shouldn't fail engine.connect() over diagnostic queries. Test additions (18 cases): - test/vector-index-lifecycle.test.ts: * chunkEmbeddingIndexSql contract (3 cases) — pre-existing behavior preserved * applyChunkEmbeddingIndexPolicy contract (1 case) * checkActiveBuild (4 cases, including PGLite no-op + best-effort failure) * isSupabaseAutoMaintenance (3 cases) * dropZombieIndexes (4 cases, including in-progress-build guard) * dropAndRebuild atomic-swap (3 cases, including PGLite + active-build bail + temp-name format assertion) 191 unit tests passing across Lanes A+B+C+D of v0.30.1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * v0.30.1 Lane E: upgrade pipeline checkpoint + brain_id binding + get_health migrations NEW src/core/upgrade-checkpoint.ts: - Cherry D5: persists step-by-step progress through gbrain post-upgrade so partial failures can be resumed via gbrain upgrade --resume. Steps: pull → install → schema → features → backfills → verify. - Codex X2: checkpoint binds to brain identity via sha256(database_url) (userinfo stripped before hashing so cred rotations don't invalidate). PGLite uses sha256(database_path). Cross-brain checkpoint application is now refused with reason='brain_mismatch'. - F4 fall-through: validateCheckpoint returns reason='no_checkpoint' when none exists, enabling silent fall-through to a full upgrade. - All-complete detection: stale checkpoints (every step done) return reason='all_complete' so the next run clears + re-runs from scratch. - markStepComplete + markStepFailed maintain the partial-state shape. T2 preserved: upgrade.ts still re-execs `gbrain post-upgrade` so the NEW binary's migration registry runs (the existing re-exec pattern is correct per codex round 1's plan-breaking finding). The checkpoint module is the substrate that Lane E's --resume / --status surfaces will plumb through in v0.30.2. D7 + C3 contract committed: - BrainHealth.schema_version: '1' (literal type) — additive-only contract pinned for MCP get_health consumers. - BrainHealth.migrations: { schema, orchestrator } — explicit two-ledger diagnostic surface (codex T5 namespacing). Both fields are OPTIONAL in v0.30.1 — engines can populate them in v0.30.2 without a contract bump. Backwards/forwards compat: clients default-handle missing fields. VERSION: 0.30.0 → 0.30.1 package.json: synced Test additions (18 cases): - test/upgrade-checkpoint.test.ts: * computeBrainId: userinfo strip, DB-distinct hashes, stable hex (5 cases) * write/load round-trip: roundtrip, missing file, malformed JSON, clear (4 cases) * validateCheckpoint: F4 no_checkpoint, X2 brain_mismatch, partial → resumeAt, all_complete, first-step pending (5 cases) * markStepComplete/markStepFailed: append, idempotent, clear-failed, failed-state shape (4 cases) 209 unit tests passing across all 5 lanes of v0.30.1 (Lanes A-E core foundations). Plumbing into upgrade.ts CLI + doctor checks + get_health() implementation is layered in via follow-up commits within this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * v0.30.1 e2e + test isolation: integration smoke + serial quarantine NEW test/e2e/v030_1-integration-pglite.test.ts (14 cases): PGLite integration smoke proving Lane A-E surfaces work together. Lane B: migration runner applies v44 (emotional_weight_recomputed_at) cleanly; config.version reaches LATEST_VERSION Lane C: backfill registry resolves all 3 entries; emotional_weight + effective_date backfills on empty brain return examined=0 cleanly Lane D: dropZombieIndexes / checkActiveBuild on PGLite are no-ops Lane E: upgrade-checkpoint round-trips with brain_id; X2 mismatch refused; F4 fall-through detected via reason='no_checkpoint'; full step progression to all_complete Test isolation hygiene (scripts/check-test-isolation.sh): - test/connection-manager.test.ts → connection-manager.serial.test.ts - test/backfill-concurrency-clamp.test.ts → .serial.test.ts - test/upgrade-checkpoint.test.ts → .serial.test.ts All three files mutate process.env (kill-switch, GBRAIN_DIRECT_POOL_SIZE, GBRAIN_HOME) which would race other tests in the parallel runner. *.serial.test.ts quarantine ensures they run at --max-concurrency=1. Choice between withEnv() refactor and serial quarantine made on the side of preserving existing well-formed test code. E2E coverage status: - v030_1-integration-pglite.test.ts (this commit): 14 cases, all green - backfill-perf-pglite.test.ts: 1 case, green (no regression) - cycle-recompute-emotional-weight-pglite.test.ts: green (no regression) - multi-source-emotional-weight-pglite.test.ts: green (no regression) - dream-synthesize-pglite.test.ts: 14 cases, green (no regression) - anomalies-pglite.test.ts + salience-pglite.test.ts: 6 cases, green Postgres-only E2Es (migration-flow, http-transport, hnsw-lifecycle, connection-routing) require DATABASE_URL + a real Postgres+pgvector container per the CLAUDE.md E2E lifecycle. They land as separate DATABASE_URL-gated work — not regressed by v0.30.1 changes; their preconditions just aren't met in the current run environment. `bun run verify` (typecheck + 4 shell pre-checks + test-isolation lint) passes cleanly. Final v0.30.1 unit + integration test count: 4547 pass, 0 regressions. Two pre-existing flaky failures (BrainRegistry serial test + warm-create perf gate under shard contention) confirmed unrelated to this branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.30.1) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

root and others added 7 commits April 23, 2026 13:36

Merge remote-tracking branch 'origin/master' into feat/migration-hard…

15a4b23

…ening

chore: bump version and changelog (v0.18.2)

ad0bd1a

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

garrytan changed the title ~~fix: migration hardening — timeout handling, lock detection, diagnostics~~ v0.18.2: migration hardening — integrity fix + reserved-connection primitive Apr 23, 2026

garrytan and others added 2 commits April 23, 2026 09:27

garrytan merged commit 08b3698 into master Apr 23, 2026
4 checks passed

ChenyqThu mentioned this pull request Apr 23, 2026

v0.18 migration chain fails on PGLite v16→v24 upgrade: column "source_id" does not exist #370

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.18.2: migration hardening — integrity fix + reserved-connection primitive#356

v0.18.2: migration hardening — integrity fix + reserved-connection primitive#356
garrytan merged 9 commits intomasterfrom
feat/migration-hardening

garrytan commented Apr 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Coverage

Review history

Deferred to TODOs (codex findings, not blocking)

Follow-up PRs (design approved, separate branches)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

garrytan commented Apr 23, 2026 •

edited

Loading