Skip to content

feat: pluggable embedding adapter with dynamic schema dimensions#49

Closed
irresi wants to merge 19 commits into
garrytan:masterfrom
irresi:feat/pluggable-embedding
Closed

feat: pluggable embedding adapter with dynamic schema dimensions#49
irresi wants to merge 19 commits into
garrytan:masterfrom
irresi:feat/pluggable-embedding

Conversation

@irresi

@irresi irresi commented Apr 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Pluggable embedding system with OpenAI, Gemini, and Voyage providers — switchable via env var or gbrain config set.

What changed

  • Pluggable providersEmbeddingProvider interface with OpenAI, Gemini, Voyage implementations. Default: OpenAI text-embedding-3-large. Voyage uses voyage-4-large (MoE, RTEB feat: GBrain v0.1.0 — Postgres-native personal knowledge brain #1).
  • BaseProvider abstract class — Shared retry/batch/truncate logic extracted via Template Method pattern. 3 providers went from ~230 lines to ~170 lines total.
  • Config pattern compliance — Embedding settings follow env var → DB config table → defaults, matching the project's existing config architecture. gbrain config set embedding_provider gemini works.
  • Dynamic schema dimensionsvector(N) column auto-sized to provider config at init. Both PGLite and Postgres engines.
  • Migration v5 — Detects provider/model/dimension changes. Runs ALTER COLUMN with correct ordering (DROP INDEX → NULL data → ALTER → CREATE INDEX). Normalizes legacy text-embedding-3-largeopenai:text-embedding-3-large format.
  • Post-migration dimension sync — Reads actual DB column dimension via pg_attribute.atttypmod (not config table) to catch manual config edits or drift.
  • Doctor check — New embedding_config check detects provider/model/dimension mismatch between runtime config and DB state.
  • Shared helpersescapeSql(), qualifiedModel(), embeddingAlterSQL() in utils.ts eliminate duplication across engines and migration.

Configuration

# Environment variables (highest priority)
export GBRAIN_EMBEDDING_PROVIDER=gemini   # openai | gemini | voyage
export GBRAIN_EMBEDDING_MODEL=gemini-embedding-2-preview
export GBRAIN_EMBEDDING_DIMENSIONS=1536

# Or persistent DB config
gbrain config set embedding_provider gemini

Files

Area Files
Provider implementations src/core/embedding/{base,openai,gemini,voyage,types,registry,index}.ts
Schema parameterization src/core/pglite-schema.ts, src/schema.sql, scripts/build-schema.sh
Engine initSchema src/core/pglite-engine.ts, src/core/postgres-engine.ts
Migration src/core/migrate.ts (v5)
Doctor src/commands/doctor.ts
Shared helpers src/core/utils.ts

Test plan

  • 393 unit tests pass, 0 fail
  • BaseProvider: retry, batch splitting, backoff cap, retry exhaustion
  • Registry: env var override, DB config fallback, provider change invalidation, legacy upgrade path, validation (unknown provider, bad dims, empty model)
  • Integration (PGLite, 15 scenarios): fresh init (3 providers), restart persistence, env var override, dimension change (forward + reverse + stale marking + empty brain), model change (same dims), legacy normalization, missing keys, sequential provider changes
  • Migration v5: dimension mismatch detection, model change detection, legacy format normalization
  • E2E Tier 1: 74 pass (mechanical, sync, schema idempotency)
  • E2E Tier 2: 3 pass (LLM skill tests via OpenClaw — ingest, query, health)
  • Smoke test: gbrain init + config get/set + doctor on PGLite

Provider benchmark comparison

OpenAI text-embedding-3-large Gemini gemini-embedding-2-preview Voyage voyage-4-large
MTEB English 64.6 68.3 (#1 overall) 67.2
Retrieval 68.3 RTEB #1
Context window 8,192 tokens 8,192 tokens 32,000 tokens
Max dimensions 3,072 3,072 2,048
GBrain default 1,536 1,536 1,024
Price / 1M tokens $0.13 $0.20 $0.12
Batch discount 50% ($0.065) 50% ($0.10) 33% ($0.08)
Architecture Dense Dense MoE (40% lower cost)
Best for Ecosystem stability Overall quality (MTEB #1), multimodal Retrieval/RAG, long context

Sources: OpenAI Models · OpenAI Embedding Announcement · Gemini Embedding 2 Blog · Gemini API Pricing · Voyage 4-large MoE · Voyage Pricing

I kept default model as OpenAI model, regarding the initial decision of maintainer

irresi and others added 19 commits April 11, 2026 21:28
Replace hardcoded OpenAI embedding with a strategy pattern supporting
multiple providers. Users can switch embedding models via env vars:

  GBRAIN_EMBEDDING_PROVIDER=gemini|openai|voyage
  GBRAIN_EMBEDDING_MODEL=gemini-embedding-2-preview
  GBRAIN_EMBEDDING_DIMENSIONS=3072

Architecture:
- EmbeddingProvider interface (embed, embedBatch, name, model, dimensions)
- OpenAIProvider (text-embedding-3-large, existing logic preserved)
- GeminiProvider (@google/genai SDK, gemini-embedding-2-preview)
- VoyageProvider (Voyage AI REST API, voyage-3)
- Registry with config resolution: env vars > fallback (openai)
- Backward compatible embed()/embedBatch() API preserved
- chunks.model field records provider:model for future per-column support

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Existing brains store 'text-embedding-3-large' without provider prefix.
New format is 'openai:text-embedding-3-large'. Without normalization,
migration would incorrectly mark all embeddings stale on upgrade.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ndling

- Escape single quotes in model name before SQL interpolation to prevent
  SQL syntax errors from malformed env vars
- Normalize legacy embedding_model format in doctor check (same logic
  as migration v5) to avoid false mismatch warnings

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Registry now follows the project's existing config pattern:
  env var → DB config table → defaults

- loadEmbeddingConfig() reads embedding_provider/model/dimensions from
  DB during initSchema, cached for process lifetime
- gbrain config set embedding_provider gemini now works
- embedding_provider added to DB config table defaults
- Both PGLite and Postgres engines load DB config after schema creation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… key

- Registry: when env var overrides provider, ignore DB model/dimensions
  (they belong to the previous provider)
- Add 5 tests for DB config fallback, env var override, reset behavior
- Add embedding_provider to migrate-engine config key list

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
voyage-4-large is the current SOTA (MoE architecture, RTEB garrytan#1,
40% lower serving cost). Same default dimensions (1024).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ange

Bugs found by new integration tests:
1. resetProvider() was called after loadEmbeddingConfig(), wiping the
   DB config cache immediately. Fixed: reset before load.
2. Migration v5 runs once (v4→v5). Later provider changes with different
   dimensions were never ALTER'd. Added post-migration dimension check.

Also adds test coverage for:
- embed()/embedBatch() empty string handling (index.ts wrapper logic)
- initSchema 2-pass DB config persistence (restart simulation)
- Migration v5 actual ALTER TABLE on PGLite (1536→1024)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Integration tests now cover 14 scenarios across 5 categories:
A. Fresh init (defaults, voyage, gemini)
B. Restart persistence (config set, model change, env var override)
C. Dimension change (forward, reverse, stale marking, empty brain)
D. Model change same dims (provider swap, model swap)
E. Legacy/edge (format normalization, missing key, sequential changes)

Bug fix: initSchema now reads actual DB column dimension via pg_attribute
instead of trusting config table values, which can be stale after manual
config set. Also adds model-change stale marking for same-dimension
provider switches (e.g., openai → gemini both at 1536).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…er key

When upgrading from a pre-v5 brain with no embedding_provider in config,
setting GBRAIN_EMBEDDING_PROVIDER env var would incorrectly use the old
provider's model/dimensions. Now treats missing dbConfig.provider as
"provider changed" so defaults are used instead of stale DB values.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- escapeSql(), qualifiedModel(), embeddingAlterSQL() in utils.ts
- Replace 8 inline `${provider.name}:${provider.model}` with qualifiedModel()
- Replace 3 duplicate ALTER DDL blocks with embeddingAlterSQL()
- Replace 2 inline model.replace(/'/g, "''") with escapeSql()
- Add WHERE embedded_at IS NOT NULL to stale-marking UPDATEs
- Promise.all for 3 independent getConfig() calls in migration v5
- Add default: throw in registry switch (remove unsafe cached! assertion)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@garrytan

garrytan commented Jun 8, 2026

Copy link
Copy Markdown
Owner

Thanks for this contribution — and apologies for the slow triage. We did a full pass over the entire PR backlog. gbrain has moved fast, and the maintainer's larger "cathedral" rewrites have superseded a big share of community PRs: the AI gateway + recipes + user_provided_models system replaced almost all individual provider PRs; #1805 fixed the whole Postgres module-singleton class; #1542 unified the type taxonomy; #1657 the retrieval path; #1802 the doctor; and so on.

We're closing this one in that cleanup — either the fix already landed on master, it duplicates another PR or merged change, or it's outside the current merge bar. Where a closed PR carried a genuinely valuable idea, we've recorded it in docs/designs/COMMUNITY_IDEAS.md so nothing good is lost (a few may graduate into TODOs).

Please don't read the close as a judgment of the work — thank you for contributing. If you believe the underlying issue is still live on the latest master, reopen with a quick note and we'll take another look. 🙏

@garrytan garrytan closed this Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants