feat: pluggable embedding adapter with dynamic schema dimensions#49
feat: pluggable embedding adapter with dynamic schema dimensions#49irresi wants to merge 19 commits into
Conversation
Replace hardcoded OpenAI embedding with a strategy pattern supporting multiple providers. Users can switch embedding models via env vars: GBRAIN_EMBEDDING_PROVIDER=gemini|openai|voyage GBRAIN_EMBEDDING_MODEL=gemini-embedding-2-preview GBRAIN_EMBEDDING_DIMENSIONS=3072 Architecture: - EmbeddingProvider interface (embed, embedBatch, name, model, dimensions) - OpenAIProvider (text-embedding-3-large, existing logic preserved) - GeminiProvider (@google/genai SDK, gemini-embedding-2-preview) - VoyageProvider (Voyage AI REST API, voyage-3) - Registry with config resolution: env vars > fallback (openai) - Backward compatible embed()/embedBatch() API preserved - chunks.model field records provider:model for future per-column support Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Existing brains store 'text-embedding-3-large' without provider prefix. New format is 'openai:text-embedding-3-large'. Without normalization, migration would incorrectly mark all embeddings stale on upgrade. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ndling - Escape single quotes in model name before SQL interpolation to prevent SQL syntax errors from malformed env vars - Normalize legacy embedding_model format in doctor check (same logic as migration v5) to avoid false mismatch warnings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Registry now follows the project's existing config pattern: env var → DB config table → defaults - loadEmbeddingConfig() reads embedding_provider/model/dimensions from DB during initSchema, cached for process lifetime - gbrain config set embedding_provider gemini now works - embedding_provider added to DB config table defaults - Both PGLite and Postgres engines load DB config after schema creation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… key - Registry: when env var overrides provider, ignore DB model/dimensions (they belong to the previous provider) - Add 5 tests for DB config fallback, env var override, reset behavior - Add embedding_provider to migrate-engine config key list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
voyage-4-large is the current SOTA (MoE architecture, RTEB garrytan#1, 40% lower serving cost). Same default dimensions (1024). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ange Bugs found by new integration tests: 1. resetProvider() was called after loadEmbeddingConfig(), wiping the DB config cache immediately. Fixed: reset before load. 2. Migration v5 runs once (v4→v5). Later provider changes with different dimensions were never ALTER'd. Added post-migration dimension check. Also adds test coverage for: - embed()/embedBatch() empty string handling (index.ts wrapper logic) - initSchema 2-pass DB config persistence (restart simulation) - Migration v5 actual ALTER TABLE on PGLite (1536→1024) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Integration tests now cover 14 scenarios across 5 categories: A. Fresh init (defaults, voyage, gemini) B. Restart persistence (config set, model change, env var override) C. Dimension change (forward, reverse, stale marking, empty brain) D. Model change same dims (provider swap, model swap) E. Legacy/edge (format normalization, missing key, sequential changes) Bug fix: initSchema now reads actual DB column dimension via pg_attribute instead of trusting config table values, which can be stale after manual config set. Also adds model-change stale marking for same-dimension provider switches (e.g., openai → gemini both at 1536). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…er key When upgrading from a pre-v5 brain with no embedding_provider in config, setting GBRAIN_EMBEDDING_PROVIDER env var would incorrectly use the old provider's model/dimensions. Now treats missing dbConfig.provider as "provider changed" so defaults are used instead of stale DB values. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- escapeSql(), qualifiedModel(), embeddingAlterSQL() in utils.ts
- Replace 8 inline `${provider.name}:${provider.model}` with qualifiedModel()
- Replace 3 duplicate ALTER DDL blocks with embeddingAlterSQL()
- Replace 2 inline model.replace(/'/g, "''") with escapeSql()
- Add WHERE embedded_at IS NOT NULL to stale-marking UPDATEs
- Promise.all for 3 independent getConfig() calls in migration v5
- Add default: throw in registry switch (remove unsafe cached! assertion)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks for this contribution — and apologies for the slow triage. We did a full pass over the entire PR backlog. gbrain has moved fast, and the maintainer's larger "cathedral" rewrites have superseded a big share of community PRs: the AI gateway + recipes + user_provided_models system replaced almost all individual provider PRs; #1805 fixed the whole Postgres module-singleton class; #1542 unified the type taxonomy; #1657 the retrieval path; #1802 the doctor; and so on. We're closing this one in that cleanup — either the fix already landed on master, it duplicates another PR or merged change, or it's outside the current merge bar. Where a closed PR carried a genuinely valuable idea, we've recorded it in docs/designs/COMMUNITY_IDEAS.md so nothing good is lost (a few may graduate into TODOs). Please don't read the close as a judgment of the work — thank you for contributing. If you believe the underlying issue is still live on the latest master, reopen with a quick note and we'll take another look. 🙏 |
Summary
Pluggable embedding system with OpenAI, Gemini, and Voyage providers — switchable via env var or
gbrain config set.What changed
EmbeddingProviderinterface with OpenAI, Gemini, Voyage implementations. Default: OpenAItext-embedding-3-large. Voyage usesvoyage-4-large(MoE, RTEB feat: GBrain v0.1.0 — Postgres-native personal knowledge brain #1).env var → DB config table → defaults, matching the project's existing config architecture.gbrain config set embedding_provider geminiworks.vector(N)column auto-sized to provider config at init. Both PGLite and Postgres engines.ALTER COLUMNwith correct ordering (DROP INDEX → NULL data → ALTER → CREATE INDEX). Normalizes legacytext-embedding-3-large→openai:text-embedding-3-largeformat.pg_attribute.atttypmod(not config table) to catch manual config edits or drift.embedding_configcheck detects provider/model/dimension mismatch between runtime config and DB state.escapeSql(),qualifiedModel(),embeddingAlterSQL()inutils.tseliminate duplication across engines and migration.Configuration
Files
src/core/embedding/{base,openai,gemini,voyage,types,registry,index}.tssrc/core/pglite-schema.ts,src/schema.sql,scripts/build-schema.shsrc/core/pglite-engine.ts,src/core/postgres-engine.tssrc/core/migrate.ts(v5)src/commands/doctor.tssrc/core/utils.tsTest plan
gbrain init+config get/set+doctoron PGLiteProvider benchmark comparison
text-embedding-3-largegemini-embedding-2-previewvoyage-4-largeSources: OpenAI Models · OpenAI Embedding Announcement · Gemini Embedding 2 Blog · Gemini API Pricing · Voyage 4-large MoE · Voyage Pricing
I kept default model as OpenAI model, regarding the initial decision of maintainer