Skip to content

Stateless host deployments fall through to ZeroEntropy default after v0.36.2.0, breaking writes against existing vector(1536) brains #1196

@xaviroblessarries

Description

@xaviroblessarries

Bug

gbrain serve --http running on a stateless container (Railway, Fly, Docker, etc.) where the brain's content_chunks.embedding column is vector(1536) from a pre-v0.36.2.0 OpenAI install will silently fall back to DEFAULT_EMBEDDING_MODEL = 'zeroentropyai:zembed-1' (1280d) after upgrade. Every put_page then fails with:

{"error":"internal_error","message":"expected 1536 dimensions, not 1280"}

Reads continue to work because searchVector doesn't embed at query time when the cache is warm — so the issue surfaces only when writes start failing silently (signal-detector, ingestion skills, MCP put_page callers).

Root cause

src/core/config.ts:236-238:

"fields (embedding_model, etc.) keep their file/env-only loading because they..."

embedding_model and embedding_dimensions are NOT read from the DB config table — only from ~/.gbrain/config.json (file plane) or GBRAIN_EMBEDDING_MODEL / GBRAIN_EMBEDDING_DIMENSIONS env vars.

In a Railway / Fly / Docker deployment:

  • ~/.gbrain/config.json doesn't exist (stateless container, no persistent home dir).
  • No env var is set by default.
  • configureGateway resolves both fields to undefined and applies DEFAULT_EMBEDDING_MODEL (which v0.36.2.0 flipped to zeroentropyai:zembed-1 / 1280d).
  • The vector emitted at write time is 1280d; the column is vector(1536); pgvector rejects the insert.

The v0.36.2.0 release notes describe the TTY-only ze-switch prompt and gbrain ze-switch --resume for recovery, but the failure mode for stateless server deployments isn't mentioned in the release notes or migration skill (skills/migrations/v0.36.2.0.md). The prompt skips in non-TTY by design, which is correct, but there's no fallback path that pins the existing model when the prompt is skipped.

Reproducer

  1. Provision a Supabase brain on v0.35 with default OpenAI text-embedding-3-large (1536d).
  2. Deploy gbrain serve --http on Railway / Fly / Docker without ~/.gbrain/config.json and without GBRAIN_EMBEDDING_MODEL.
  3. Upgrade the deployed binary to v0.36.2.0+.
  4. Call put_page via the MCP. Observe expected 1536 dimensions, not 1280 on every write.

Suggested fixes

Three options, not mutually exclusive:

  1. In loadConfigWithEngine(), also read embedding_model and embedding_dimensions from the DB config plane (with file/env still winning by precedence). The existing comment justifying the file/env-only path for these keys is from a different era and isn't obviously load-bearing now. Stateless hosts can then set the values once via gbrain config set against the remote DB.

  2. In gbrain serve --http startup, refuse to start when the brain's content_chunks.embedding column width doesn't match the resolved embedding_dimensions (the existing embedding_width_consistency doctor check has the logic — fire it at startup instead of waiting for gbrain doctor to be invoked). Fail loud, paste-ready fix hint.

  3. In the v0.36.2.0 migration skill, add a section for stateless host deployments explaining that they need to set GBRAIN_EMBEDDING_MODEL + GBRAIN_EMBEDDING_DIMENSIONS env vars OR run gbrain ze-switch against the host before the upgrade, or writes will silently break.

The first option is the structural fix; the second is the defense-in-depth; the third is the documentation patch.

Workaround (for anyone hitting this now)

Set both env vars on the host service and redeploy:

# Railway example
railway variables --set GBRAIN_EMBEDDING_MODEL=openai:text-embedding-3-large \
                   --set GBRAIN_EMBEDDING_DIMENSIONS=1536 \
                   --service gbrain-http
# Railway auto-redeploys when env vars change

After redeploy, put_page works again. No re-embed, no data loss, no schema change.

Environment

  • gbrain v0.36.3.0 (also reproducible on v0.36.2.0)
  • Topology 2 (cross-machine thin-client + remote gbrain serve --http)
  • Host: Railway with Supabase backend (pre-v0.36.2.0 brain, vector(1536) column)
  • Client: macOS thin-client, no local engine

Happy to test a candidate fix or open a PR for any of the three suggestions if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions