v0.21.1 fix: initSchema bootstrap order — multi-version upgrades succeed by lloydarmbrust · Pull Request #432 · garrytan/gbrain

lloydarmbrust · 2026-04-26T00:48:23Z

Summary

PostgresEngine.initSchema() ran SCHEMA_SQL (latest target schema) before runMigrations(). For any existing brain at an intermediate schema version, SCHEMA_SQL contains CREATE INDEX statements that reference columns added by intermediate migrations (e.g. idx_pages_source_id ON pages(source_id), where pages.source_id is added by migration v21). The index creation aborted with column "source_id" does not exist and the upgrade wedged. gbrain doctor reported MINIONS HALF-INSTALLED (partial migration: 0.11.0) and re-running gbrain apply-migrations --yes looped on the same error.

This PR detects existing brains via config.version and runs migrations first on a best-effort basis, so column-dependent indexes succeed. Then SCHEMA_SQL runs, then a final runMigrations() pass mops up post-SCHEMA_SQL migrations like the v24 RLS-backfill that needs subagent_messages (which SCHEMA_SQL creates).

Repro case: a real production brain pinned at v4 (running gbrain 0.10.x for a while), tried gbrain apply-migrations --yes after upgrading to 0.20.4. Hit the wedge. Patched initSchema() and the same brain advanced to v24 cleanly.

Pre-Landing Review

Diff is 30 lines of bootstrap logic + 113 lines of E2E test. Reviewed locally:

Behavior preservation: Fresh install path is unchanged. existingVersion is only non-null when config table exists and has a version row. New installs hit the else branch with the same SCHEMA_SQL → migrations order as today.
Idempotency: Calling initSchema() on an already-up-to-date brain runs migrations (no-op), then SCHEMA_SQL (no-op via IF NOT EXISTS), then migrations again (no-op). Verified by the 4th test case.
Failure mode: If the first migrations pass throws (e.g. v24 RLS-backfill failing because SCHEMA_SQL hasn't created subagent_messages yet), the swallow logs and continues. SCHEMA_SQL runs, then the second migrations pass retries the same migration with the now-existing tables and succeeds.
No new SQL: All SQL is reused from existing migrations. The fix is pure ordering.

Test plan

New test/e2e/initschema-bootstrap.test.ts: 4 tests, all pass against pgvector/pgvector:pg16. Rolls back to v4-era schema (drops pages.source_id, files.source_id, files.page_id, sources table, file_migration_ledger table, restores pre-v21 pages_slug_key), then asserts initSchema() reaches LATEST_VERSION and that pages.source_id + sources('default') exist post-recovery. Also asserts idempotency on re-run.
E2E suite: 210 of 211 pass on this branch (the 1 fail is minions-shell.test.ts, unrelated to this change). Master before this patch: same 210/211.
Unit tests: 38 pre-existing failures on master (v0.21.0 Cathedral II + PGLite churn), no NEW failures introduced by this patch. Verified by stashing the patch and re-running master.
bun build --compile succeeds.

Manual repro before/after

# Before this patch, against a real v4-era brain:
$ gbrain init --migrate-only
...
column "source_id" does not exist

# After this patch, same brain:
$ gbrain init --migrate-only
  Schema version 4 → 24 (17 migration(s) pending)
  [5] minion_jobs_table... ✓
  ... (15 more migrations succeed) ...
  [24] rls_backfill_missing_tables... ✓
Schema up to date (engine: postgres).

CHANGELOG

v0.21.1 entry added per the gbrain CHANGELOG voice rules: bold headline, lead paragraph, "what this means" closer, "To take advantage of v0.21.1" self-repair block (existing brains running gbrain doctor need it to surface this fix). No Privacy or Security hardening framing — this is a straight bug fix.

🤖 Generated with Claude Code

PostgresEngine.initSchema() ran SCHEMA_SQL (the embedded latest-state schema) before runMigrations(). For an existing brain at an intermediate version, SCHEMA_SQL contains CREATE INDEX statements that reference columns added by intermediate migrations (e.g. idx_pages_source_id on pages.source_id, added in migration v21). The index creation aborts with `column "source_id" does not exist` and the upgrade wedges. When config.version > 0 is detected, run migrations first on a best-effort basis so column-dependent indexes succeed. Then SCHEMA_SQL runs as latest-state enforcer. Then a final runMigrations() pass mops up post-SCHEMA_SQL migrations like the v24 RLS-backfill that reference tables created by SCHEMA_SQL (subagent_messages, etc.). Fresh installs unchanged: SCHEMA_SQL first, migrations second, no config.version row to trigger the existing-brain branch. Verified against a real Postgres rolled back to a v4-era schema: recovery now reaches LATEST_VERSION and pages.source_id + sources('default') exist post-recovery. Test added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Master is at v0.21.0. Open PRs claim v0.21.1 (#432) and v0.24.0 (#387). v0.25 is the first uncontested slot, so this branch claims it. Pure rename across VERSION, package.json, CHANGELOG header, and every "v0.22.0" reference in CLAUDE.md / README.md / TODOS.md / docs/eval-capture.md / src/ / test/ files. CHANGELOG date bumped to 2026-04-26. llms.txt + llms-full.txt regenerated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Qodo-Free-For-OSS · 2026-04-27T07:20:06Z

Hi, PostgresEngine.initSchema() prints pre-schema migration errors to stdout, which can corrupt commands that must emit clean JSON (e.g. gbrain init --migrate-only --json). This can break automation/CI that parses stdout as JSON when upgrades hit the pre-schema failure path.

Severity: action required | Category: reliability

How to fix: Log to stderr only

Agent prompt to fix - you can give this to your LLM of choice:

Issue description

PostgresEngine.initSchema() currently uses console.log() to report pre-schema migration failures. This writes to stdout and can corrupt commands that must output clean JSON on stdout (e.g. gbrain init --migrate-only --json).

Issue Context

src/commands/init.ts prints JSON to stdout when --json is passed.

Repo convention requires progress/logging to stderr so stdout remains clean.

Fix Focus Areas

src/core/postgres-engine.ts[73-115]

src/commands/init.ts[78-103]

CLAUDE.md[364-374]

Expected change

Replace console.log(...) with console.error(...) (or route through the shared progress/logger if available) for the pre-schema migration catch path.

Ensure no non-JSON stdout is emitted during --json flows.

We noticed a couple of other issues in this PR as well - happy to share if helpful.

Found by Qodo code review

…ct test (#437) * feat(v0.22.0): eval_candidates + eval_capture_failures schema (Lane 1A) R1 substrate for BrainBench-Real, replayed onto master after Cathedral II landed. Migration v30 (slotted after master's v25-v29 Cathedral II wave) creates two tables: eval_candidates: per-call capture of MCP/CLI/subagent query+search traffic. Column set lets gbrain-evals replay with full fidelity — source_ids from v0.18 multi-source, vector_enabled/detail_resolved/ expansion_applied so replay knows what hybridSearch actually did, remote + job_id + subagent_id so rows are traceable to their origin. query is CHECK-capped at 50KB; PII scrubber (Lane 1B) runs before insert. eval_capture_failures: cross-process audit trail. In-process counters don't work because `gbrain doctor` runs in a separate process from the MCP server. Persistent rows let doctor query capture health via COUNT(*) GROUP BY reason over the last 24h. Both tables get RLS on Postgres gated on BYPASSRLS (matches v24/v29 posture). PGLite ignores RLS; sqlFor split carries only DDL. 5 new BrainEngine methods (breaking-interface addition, drives v0.22.0 minor bump): logEvalCandidate, listEvalCandidates, deleteEvalCandidatesBefore, logEvalCaptureFailure, listEvalCaptureFailures. listEvalCandidates uses ORDER BY created_at DESC, id DESC so `gbrain eval export` is deterministic across same-millisecond inserts. Also adds HybridSearchMeta type for the side-channel callback used by Lane 1C's op-layer capture (no change to hybridSearch return shape — that respects Cathedral II's existing SearchResult[] contract). Tests: 14 PGLite round-trip cases + 8 v30 structural assertions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v0.22.0): PII scrubber + op-layer capture module (Lane 1B) Replayed onto master post-Cathedral II. Same semantics as the original v0.21.0 work — only adjusted to import HybridSearchMeta from types.ts (canonical home) instead of redeclaring it locally. src/core/eval-capture-scrub.ts — pure-function regex scrubber with 6 pattern families: emails, phones (US + E.164), SSN (year-aware), Luhn-verified credit cards, JWT-shaped tokens, bearer tokens. Zero deps. Adversarial-input safe. src/core/eval-capture.ts — op-layer hook helper: - buildEvalCandidateInput(ctx, {scrub_pii}) — pure row builder - classifyCaptureFailure(err) — Postgres SQLSTATE → reason tag - captureEvalCandidate(engine, ctx, opts) — best-effort, never throws - isEvalCaptureEnabled / isEvalScrubEnabled — file-plane config checks GBrainConfig gains `eval?: {capture?, scrub_pii?}`. Both default ON. File-plane only — `gbrain config set` writes the DB plane, doesn't control capture. Tests: 17 scrubber + 21 capture-module cases. Zero regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v0.22.0): hybridSearch onMeta callback + op-layer capture (Lane 1C) Replayed onto master. Adapted from the original v0.21.0 work to keep Cathedral II's contract intact: hybridSearch's return stays `Promise<SearchResult[]>` (unchanged), and meta surfaces via an optional `onMeta?: (meta: HybridSearchMeta) => void` callback in HybridSearchOpts. Cathedral II callers leave onMeta undefined and pay no cost. The op-layer capture wrapper passes a closure that threads meta into the captured row so gbrain-evals can distinguish: - "with OPENAI_API_KEY" vs "keyword-only fallback" (vector_enabled) - "expansion fired" vs "expansion requested + silently fell back" (expansion_applied) - what hybridSearch actually used after auto-detect (detail_resolved) Op-layer capture wired into both `query` and `search` op handlers in src/core/operations.ts. Single hook site catches MCP dispatch + CLI + subagent tool-bridge from the same place. Fire-and-forget, never throws, respects ctx.config.eval.capture off-switch. Tests: - test/hybrid-meta.test.ts (8 cases) — onMeta accuracy across the 4 return paths in hybridSearch + verification that omitting onMeta leaves Cathedral II callers unchanged. - test/mcp-eval-capture.test.ts (10 cases) — query/search ops capture correctly with MCP/CLI/subagent contexts, scrub on/off, capture=false off-switch, non-captured ops (list_pages, get_page), F1 failure isolation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v0.22.0): gbrain eval export/prune + doctor eval_capture check (Lane 1D) Replayed onto master. Same semantics as the original v0.21.0 work. CLI: gbrain eval export [--since DUR] [--limit N] [--tool query|search] NDJSON to stdout, every row prefixed with "schema_version":1 per docs/eval-capture.md contract. EPIPE-safe streaming, stderr heartbeats, deterministic ordering (created_at DESC, id DESC). gbrain eval prune --older-than DUR [--dry-run] Explicit retention cleanup. Requires --older-than (never deletes without a window). Duration strings: 30d, 7d, 1h, 90m, 3600s. Legacy bare `gbrain eval --qrels …` still works via sub-subcommand fall-through. gbrain doctor gains an eval_capture check between markdown_body_completeness and queue_health: reads eval_capture_failures for the last 24h, groups by reason, warns when non-zero. Pre-v30 brains get "Skipped (table unavailable)" — non-fatal. docs/eval-capture.md ships the stable NDJSON schema reference for gbrain-evals consumers. Tests: 9 export cases + 5 prune cases. Doctor check covered by existing doctor tests on master. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v0.22.0): public-exports contract test + CI count guard (Lane 2 / R2) Master locks 17 public subpath exports as gbrain's stable third-party contract. Zero enforcement existed. This PR locks the surface in two layers: 1. test/public-exports.test.ts — runtime contract test. Reads package.json "exports" at startup. For each subpath, imports via the package name ("gbrain/engine"), NOT the relative filesystem path — that's the difference between exercising the actual resolver and bypassing it. Every subpath gets a canary symbol pinned (e.g. gbrain/search/hybrid must export hybridSearch + rrfFusion) so a refactor that renames or removes one fails CI before downstream consumers (gbrain-evals) silently break. 2. scripts/check-exports-count.sh — CI structural guard. Wired into `bun test` after check-jsonb-pattern.sh + check-progress-to-stdout.sh + check-wasm-embedded.sh per master's precedent. EXPECTED_COUNT=17 baseline — shrinks fail loudly, growth also fails so the new canary must be pinned in the runtime test deliberately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs+e2e(v0.22.0): VERSION/CHANGELOG/CLAUDE/README + Postgres E2E (Lane 3) Bump VERSION + package.json to 0.22.0 (next free slot after master's v0.21.0 Code Cathedral II minor). CHANGELOG.md v0.22.0 entry follows the Garry voice template: - Bold 2-line headline - Lead paragraph contextualizing v0.20 + v0.21 + v0.22 progression - Numbers-that-matter table comparing v0.21.0 → v0.22.0 - "What this means for you" sectioned by audience - "## To take advantage of v0.22.0" operator runbook - Itemized changes CLAUDE.md updates: - Key files: 8 new module entries (eval-capture*, eval-export, eval-prune, docs/eval-capture.md, public-exports test). hybrid.ts entry rewritten to reflect the additive `onMeta` callback (return shape unchanged). - Key commands: new v0.22.0 section for `gbrain eval export`, `gbrain eval prune`, and the doctor `eval_capture` check, with the file-plane vs DB-plane config gotcha called out. README.md: one-paragraph pointer after the BrainBench blurb so anyone reading the landing page sees the new session-capture feature. llms.txt + llms-full.txt regenerated to pick up the doc additions. test/e2e/eval-capture.test.ts (Postgres-only E1 spec): - CHECK violation surfaces as Postgres SQLSTATE 23514 on oversize input - RLS is actually enabled on both eval_candidates + eval_capture_failures - 50 concurrent logEvalCandidate calls — no deadlock, all distinct IDs Skips gracefully when DATABASE_URL is unset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(todos): P0 — PGLite test-runner concurrency flake Pre-existing on master, surfaces ~27 false failures when bun test runs all 174 files together. Each failing file passes in isolation. Tracked for a dedicated investigation branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(v0.22.0): adversarial review post-fixes (doctor RLS, onMeta safety) Two surgical fixes from /ship adversarial review, plus 6 follow-ups TODO'd into v0.22.1: - doctor.ts: distinguish pre-v30 missing-table (42P01, ok skip) from RLS-denied SELECT (42501, warn) and other DB errors (warn). The check exists specifically to surface capture-failure misconfigs cross-process, so silently reporting "ok / skipped" on the most diagnostic class defeated the purpose. - hybrid.ts: wrap onMeta invocation in try/catch via small emitMeta helper. The callback is part of the public gbrain/search/hybrid contract; a throwing user-supplied closure must never break the search hot path. - TODOS.md: 6 P1 follow-ups (eval prune real COUNT, scrubber CC false positives, dead 'scrubber_exception' enum value, id-cursor for cross-window dedup, public-export canary pinning, EXPECTED_COUNT dedup). - TODOS.md: P0 entry for the pre-existing PGLite test-runner concurrency flake (~27 false failures in full bun test on master). - CHANGELOG.md: 2 bullets noting the doctor + onMeta hardening. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(version): bump v0.22.0 → v0.25.0 (queue-aware version pick) Master is at v0.21.0. Open PRs claim v0.21.1 (#432) and v0.24.0 (#387). v0.25 is the first uncontested slot, so this branch claims it. Pure rename across VERSION, package.json, CHANGELOG header, and every "v0.22.0" reference in CLAUDE.md / README.md / TODOS.md / docs/eval-capture.md / src/ / test/ files. CHANGELOG date bumped to 2026-04-26. llms.txt + llms-full.txt regenerated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v0.25.0): gbrain eval replay + contributor doc + CONTRIBUTING link Closes the gap between "session capture works" (this PR's core) and "contributors actually use it before merging." Three artifacts: - src/commands/eval-replay.ts (~340 LOC) — reads NDJSON from `gbrain eval export`, re-runs each captured query/search against the current brain, computes set-Jaccard@k, top-1 stability, and latency delta. Stable JSON shape (schema_version:1) for CI gating; human mode prints a regression table sorted worst-first. Pure Bun, zero new deps. Stub-engine tests cover Jaccard math, NDJSON parser (including v2 forward-compat rejection + line-numbered errors), --limit, --verbose, --json, and graceful per-row error handling. 16/16 passing. - docs/eval-bench.md (~80 lines) — contributor guide. The 4-command loop (export → change → replay → diff), metric definitions with healthy ranges (Jaccard ≥0.85, top-1 ≥85%, latency Δ within ±50ms), trigger paths, CI integration snippet, hand-crafted NDJSON corpus path for fresh installs, and the off-switch. Pairs with the existing docs/eval-capture.md which is the consumer-facing wire format. - CONTRIBUTING.md gains a "Running real-world eval benchmarks (touching retrieval code)" section with the trigger paths and a link to docs/eval-bench.md. Reviewers now have a one-line ask: "did you run replay?" CLAUDE.md key files updated. CHANGELOG bullets added. llms.txt regenerated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(v0.25.0): CONTRIBUTOR_MODE flag — capture off by default for users Eval capture was on for everyone in the v0.25.0 draft. Privacy footgun: end users had retrieval traffic accumulate in their brain DB without asking, even with PII scrubbing. Flips to off by default + explicit opt-in for contributors who actually use the replay loop. Resolution order in isEvalCaptureEnabled(): 1. config.eval.capture === true → on 2. config.eval.capture === false → off 3. process.env.GBRAIN_CONTRIBUTOR_MODE === '1' → on 4. otherwise → off The env var is the contributor-facing toggle (one line in .zshrc, no JSON edit). Explicit config wins both directions for users who want to override per-brain. PII scrubbing gate stays independent — default true regardless of CONTRIBUTOR_MODE — so any brain that does capture still scrubs. Tests rewritten: env var hygiene per-test (origMode preserved + restored in finally). 9/9 pass; total v0.25.0 suite is 198/198. Docs: - README.md gains a Contributing-section pointer to the env var. - CONTRIBUTING.md gains a "CONTRIBUTOR_MODE — turn on the dev loop" section with verification commands and resolution-order table. - docs/eval-bench.md leads with the prerequisite (must set the env var for the rest of the doc to be useful). - docs/eval-capture.md "Config" section split into Path A (env var) + Path B (config) with explicit resolution-order rules. - CHANGELOG v0.25.0 entry corrected ("on by default" was wrong) plus a new top itemized bullet calling out the gate change. - CLAUDE.md eval-capture entry annotated with the new gate logic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: post-ship documentation pass for v0.25.0 Cross-references every doc against the final state of the branch (CONTRIBUTOR_MODE flag, eval replay tool, off-by-default capture): - README.md: top callout rewritten — was implying capture-on-by-default contradicting the gate landed in 7a80ce2. Now leads with "contributor opt-in" and links docs/eval-bench.md alongside docs/eval-capture.md. - AGENTS.md: new "Eval retrieval changes" task entry with the CONTRIBUTOR_MODE+replay one-liner so non-Claude agents (Codex, Cursor, Aider) have the same path. - CLAUDE.md: "Key commands added in v0.25.0" gains the replay command and a CONTRIBUTOR_MODE bullet covering the resolution order. - CHANGELOG.md: headline rewritten to match the actual feature ("benchmark retrieval changes against real captured queries before merging" — was "every real query is captured"). Stale "v0.22 ships the substrate" → v0.25. Test count corrected 82 → 144 (added 16 replay + 9 CONTRIBUTOR_MODE + 8 v31-shape tests since the original count). Two metric rows added to the numbers table: default-off posture, in-tree replay tooling. "To take advantage" block split into user vs contributor branches with shell-rc instructions. - TODOS.md: v0.22.1 follow-up reference corrected to v0.25.1. llms.txt + llms-full.txt regenerated. Typecheck clean. 198/198 v0.25.0 tests still green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

garrytan · 2026-05-10T03:33:44Z

Closing — initSchema bootstrap order fix shipped in v0.22.6.1.

Thanks for the report. If anything still reproduces on the latest release, please reopen with the version + repro.

garrytan closed this May 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.21.1 fix: initSchema bootstrap order — multi-version upgrades succeed#432

v0.21.1 fix: initSchema bootstrap order — multi-version upgrades succeed#432
lloydarmbrust wants to merge 1 commit into
garrytan:masterfrom
lloydarmbrust:fix/initschema-bootstrap-order

lloydarmbrust commented Apr 26, 2026

Uh oh!

Qodo-Free-For-OSS commented Apr 27, 2026

Issue description

Issue Context

Fix Focus Areas

Expected change

Uh oh!

garrytan commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lloydarmbrust commented Apr 26, 2026

Summary

Pre-Landing Review

Test plan

Manual repro before/after

CHANGELOG

Uh oh!

Qodo-Free-For-OSS commented Apr 27, 2026

Issue description

Issue Context

Fix Focus Areas

Expected change

Uh oh!

garrytan commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants