Skip to content

bug: HNSW 2000-dim cap still hit on upgrade path despite #640 fix (zembed-1 2560d, schema v66 → v67) #1141

@metaWuming

Description

@metaWuming

Summary

Follow-up to closed #640 — the schema-side fix (applyChunkEmbeddingIndexPolicy + getPostgresSchema(dims, model)) works for fresh installs, but upgrading an existing brain still trips the same column cannot have more than 2000 dimensions for hnsw index error, even when ~/.gbrain/config.json correctly reports embedding_dimensions: 2560.

The fix didn't reach the upgrade path. Migration v67 (and presumably any future migration shipped on master) is gated behind a bootstrap stage that still emits the HNSW index.

Repro

Pre-conditions:

  • Brain on Postgres (Supabase Session pooler, port 5432).
  • content_chunks.embedding was ALTER COLUMN ... TYPE vector(2560) for ZeroEntropy zembed-1.
  • ~/.gbrain/config.json:
    {
      "engine": "postgres",
      "embedding_model": "zeroentropyai:zembed-1",
      "embedding_dimensions": 2560
    }
  • idx_chunks_embedding HNSW dropped at switch-time (pgvector refused to keep HNSW on 2560-dim column — that part is well-understood).
  • idx_takes_embedding_hnsw likewise dropped.
  • Schema version: 66 (latest on v0.35.1.1).

Steps:

cd ~/gbrain
git pull origin master   # → v0.35.7.0 (1dadd9ed)
bun install              # postinstall runs `gbrain apply-migrations` (prints `All migrations up to date.` — misleading, see below)
gbrain init --migrate-only

Output (last 30 lines, after ~25 expected 42P07 NOTICEs):

{ severity: "NOTICE", code: "42P07",
  message: 'relation "idx_chunks_page" already exists, skipping', ... }
column cannot have more than 2000 dimensions for hnsw index

Exit code: 1. Migration v67 never runs.

Expected

getPostgresSchema(dims=2560, model='zembed-1') should be invoked on the upgrade path (same way it's invoked on fresh-install). applyChunkEmbeddingIndexPolicy should replace the HNSW CREATE INDEX with the skip-comment, and conn.unsafe(sqlText) should pass cleanly.

Hypothesis (for the maintainer to confirm / falsify)

Looking at src/core/postgres-engine.ts on master:

let dims = 1536;
let model = 'text-embedding-3-large';
try {
  const gw = await import('./ai/gateway.ts');
  dims = gw.getEmbeddingDimensions();
  model = gw.getEmbeddingModel().split(':').slice(1).join(':') || model;
} catch { /* gateway not yet configured — use defaults */ }

const sqlText = getPostgresSchema(dims, model);
// ...
await this.applyForwardReferenceBootstrap(conn);
await conn.unsafe(sqlText);

Two candidate roots:

(a) gw.getEmbeddingDimensions() returns the default 1536 instead of reading embedding_dimensions: 2560 from ~/.gbrain/config.json. The gateway likely sources from a separate config (env var? a different JSON file? a DB-resident setting?) that wasn't synced when the user switched embedders via direct SQL ALTER COLUMN.

(b) applyForwardReferenceBootstrap (which runs before conn.unsafe(sqlText)) emits an HNSW CREATE INDEX somewhere that the policy doesn't touch. Source-text scan against master doesn't show it in the bootstrap method, but the order-of-operations means the error throws before sqlText even reaches the connection, which is consistent with bootstrap being the culprit rather than the policy-applied SCHEMA_SQL.

The misleading All migrations up to date. line from the postinstall path is a separate UX bug worth fixing too — the apply-migrations --yes --non-interactive shell command appears to swallow the bootstrap failure silently, but gbrain init --migrate-only reproduces it deterministically.

Suggested fixes

  1. Trace and patch the gap. If hypothesis (a) is correct, make the gateway read embedding_dimensions from ~/.gbrain/config.json as the authoritative source on the init/migrate path (or add an explicit gbrain config set embedding_dimensions setter that updates both gateway state and the on-disk JSON). If (b), apply the policy inside applyForwardReferenceBootstrap too.

  2. Defensive: drop existing HNSW before re-emit. Even with (1) fixed, future contributors will reintroduce this class of bug whenever they touch the bootstrap. Add a guard: any time SCHEMA_SQL is about to emit CREATE INDEX ... USING hnsw on a vector(N) column, introspect pg_attribute.atttypmod first and skip when (typmod - 4) > 2000. That makes the bootstrap dim-aware at the SQL layer instead of relying on the gateway value being threaded correctly all the way down.

  3. Postinstall reporting. Make gbrain apply-migrations --yes --non-interactive exit non-zero when bootstrap fails, instead of printing All migrations up to date. over a hidden failure.

I'm happy to draft a PR with fix (2) + a regression test that:

  1. Creates a fresh test brain with embedding vector(2560),
  2. Drops idx_chunks_embedding,
  3. Runs initSchema(),
  4. Asserts no HNSW is created and runMigrations() reaches the latest version.

Why this matters now

zembed-1 (2560-dim) and zembed-large (3072-dim) from ZeroEntropy beat OpenAI's text-embedding-3-large on long-document retrieval benchmarks; Voyage voyage-3-large is 2048-dim. The era where "1536 is the default and everything is HNSW-friendly" is ending. The #640 fix was the right shape but didn't cover the upgrade path — any user who switched embedders before v0.35.2 (when #640 landed) and now tries to upgrade past v0.35.1.1 will hit this.

Workaround for affected users

Stay on v0.35.1.1 (DB schema v66 — latest reachable without tripping bootstrap):

cd ~/gbrain && git reset --hard f004a274 && bun install && bun link

v67 (typed-claim columns on facts) is the only post-66 migration today, and it's purely additive, so rollback loses no data. But all future migrations are blocked on this getting fixed.

Environment

  • gbrain master HEAD = 1dadd9ed (v0.35.7.0)
  • bun 1.3.13, macOS Darwin 25.3.0 (Apple Silicon)
  • Postgres: Supabase Session pooler (aws-1-ap-northeast-1.pooler.supabase.com:5432), pgvector 0.7+
  • GBRAIN_DISABLE_DIRECT_POOL=1 (IPv6 workaround; immaterial to this bug)
  • Install method: git clone + bun link
  • Brain stats: 250 pages / 4887 chunks, all embedded at 2560-dim

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions