feat: pluggable embedding adapter with dynamic schema dimensions by irresi · Pull Request #49 · garrytan/gbrain

irresi · 2026-04-11T13:28:51Z

Summary

Pluggable embedding system with OpenAI, Gemini, and Voyage providers — switchable via env var or gbrain config set.

What changed

Pluggable providers — EmbeddingProvider interface with OpenAI, Gemini, Voyage implementations. Default: OpenAI text-embedding-3-large. Voyage uses voyage-4-large (MoE, RTEB feat: GBrain v0.1.0 — Postgres-native personal knowledge brain #1).
BaseProvider abstract class — Shared retry/batch/truncate logic extracted via Template Method pattern. 3 providers went from ~230 lines to ~170 lines total.
Config pattern compliance — Embedding settings follow env var → DB config table → defaults, matching the project's existing config architecture. gbrain config set embedding_provider gemini works.
Dynamic schema dimensions — vector(N) column auto-sized to provider config at init. Both PGLite and Postgres engines.
Migration v5 — Detects provider/model/dimension changes. Runs ALTER COLUMN with correct ordering (DROP INDEX → NULL data → ALTER → CREATE INDEX). Normalizes legacy text-embedding-3-large → openai:text-embedding-3-large format.
Post-migration dimension sync — Reads actual DB column dimension via pg_attribute.atttypmod (not config table) to catch manual config edits or drift.
Doctor check — New embedding_config check detects provider/model/dimension mismatch between runtime config and DB state.
Shared helpers — escapeSql(), qualifiedModel(), embeddingAlterSQL() in utils.ts eliminate duplication across engines and migration.

Configuration

# Environment variables (highest priority)
export GBRAIN_EMBEDDING_PROVIDER=gemini   # openai | gemini | voyage
export GBRAIN_EMBEDDING_MODEL=gemini-embedding-2-preview
export GBRAIN_EMBEDDING_DIMENSIONS=1536

# Or persistent DB config
gbrain config set embedding_provider gemini

Files

Area	Files
Provider implementations	`src/core/embedding/{base,openai,gemini,voyage,types,registry,index}.ts`
Schema parameterization	`src/core/pglite-schema.ts`, `src/schema.sql`, `scripts/build-schema.sh`
Engine initSchema	`src/core/pglite-engine.ts`, `src/core/postgres-engine.ts`
Migration	`src/core/migrate.ts` (v5)
Doctor	`src/commands/doctor.ts`
Shared helpers	`src/core/utils.ts`

Test plan

393 unit tests pass, 0 fail
BaseProvider: retry, batch splitting, backoff cap, retry exhaustion
Registry: env var override, DB config fallback, provider change invalidation, legacy upgrade path, validation (unknown provider, bad dims, empty model)
Integration (PGLite, 15 scenarios): fresh init (3 providers), restart persistence, env var override, dimension change (forward + reverse + stale marking + empty brain), model change (same dims), legacy normalization, missing keys, sequential provider changes
Migration v5: dimension mismatch detection, model change detection, legacy format normalization
E2E Tier 1: 74 pass (mechanical, sync, schema idempotency)
E2E Tier 2: 3 pass (LLM skill tests via OpenClaw — ingest, query, health)
Smoke test: gbrain init + config get/set + doctor on PGLite

Provider benchmark comparison

	OpenAI `text-embedding-3-large`	Gemini `gemini-embedding-2-preview`	Voyage `voyage-4-large`
MTEB English	64.6	68.3 (#1 overall)	67.2
Retrieval	—	68.3	RTEB #1
Context window	8,192 tokens	8,192 tokens	32,000 tokens
Max dimensions	3,072	3,072	2,048
GBrain default	1,536	1,536	1,024
Price / 1M tokens	$0.13	$0.20	$0.12
Batch discount	50% ($0.065)	50% ($0.10)	33% ($0.08)
Architecture	Dense	Dense	MoE (40% lower cost)
Best for	Ecosystem stability	Overall quality (MTEB #1), multimodal	Retrieval/RAG, long context

Sources: OpenAI Models · OpenAI Embedding Announcement · Gemini Embedding 2 Blog · Gemini API Pricing · Voyage 4-large MoE · Voyage Pricing

I kept default model as OpenAI model, regarding the initial decision of maintainer

Replace hardcoded OpenAI embedding with a strategy pattern supporting multiple providers. Users can switch embedding models via env vars: GBRAIN_EMBEDDING_PROVIDER=gemini|openai|voyage GBRAIN_EMBEDDING_MODEL=gemini-embedding-2-preview GBRAIN_EMBEDDING_DIMENSIONS=3072 Architecture: - EmbeddingProvider interface (embed, embedBatch, name, model, dimensions) - OpenAIProvider (text-embedding-3-large, existing logic preserved) - GeminiProvider (@google/genai SDK, gemini-embedding-2-preview) - VoyageProvider (Voyage AI REST API, voyage-3) - Registry with config resolution: env vars > fallback (openai) - Backward compatible embed()/embedBatch() API preserved - chunks.model field records provider:model for future per-column support Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Existing brains store 'text-embedding-3-large' without provider prefix. New format is 'openai:text-embedding-3-large'. Without normalization, migration would incorrectly mark all embeddings stale on upgrade. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ndling - Escape single quotes in model name before SQL interpolation to prevent SQL syntax errors from malformed env vars - Normalize legacy embedding_model format in doctor check (same logic as migration v5) to avoid false mismatch warnings Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Registry now follows the project's existing config pattern: env var → DB config table → defaults - loadEmbeddingConfig() reads embedding_provider/model/dimensions from DB during initSchema, cached for process lifetime - gbrain config set embedding_provider gemini now works - embedding_provider added to DB config table defaults - Both PGLite and Postgres engines load DB config after schema creation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… key - Registry: when env var overrides provider, ignore DB model/dimensions (they belong to the previous provider) - Add 5 tests for DB config fallback, env var override, reset behavior - Add embedding_provider to migrate-engine config key list Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

voyage-4-large is the current SOTA (MoE architecture, RTEB garrytan#1, 40% lower serving cost). Same default dimensions (1024). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ange Bugs found by new integration tests: 1. resetProvider() was called after loadEmbeddingConfig(), wiping the DB config cache immediately. Fixed: reset before load. 2. Migration v5 runs once (v4→v5). Later provider changes with different dimensions were never ALTER'd. Added post-migration dimension check. Also adds test coverage for: - embed()/embedBatch() empty string handling (index.ts wrapper logic) - initSchema 2-pass DB config persistence (restart simulation) - Migration v5 actual ALTER TABLE on PGLite (1536→1024) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Integration tests now cover 14 scenarios across 5 categories: A. Fresh init (defaults, voyage, gemini) B. Restart persistence (config set, model change, env var override) C. Dimension change (forward, reverse, stale marking, empty brain) D. Model change same dims (provider swap, model swap) E. Legacy/edge (format normalization, missing key, sequential changes) Bug fix: initSchema now reads actual DB column dimension via pg_attribute instead of trusting config table values, which can be stale after manual config set. Also adds model-change stale marking for same-dimension provider switches (e.g., openai → gemini both at 1536). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…er key When upgrading from a pre-v5 brain with no embedding_provider in config, setting GBRAIN_EMBEDDING_PROVIDER env var would incorrectly use the old provider's model/dimensions. Now treats missing dbConfig.provider as "provider changed" so defaults are used instead of stale DB values. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- escapeSql(), qualifiedModel(), embeddingAlterSQL() in utils.ts - Replace 8 inline `${provider.name}:${provider.model}` with qualifiedModel() - Replace 3 duplicate ALTER DDL blocks with embeddingAlterSQL() - Replace 2 inline model.replace(/'/g, "''") with escapeSql() - Add WHERE embedded_at IS NOT NULL to stale-marking UPDATEs - Promise.all for 3 independent getConfig() calls in migration v5 - Add default: throw in registry switch (remove unsafe cached! assertion) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

garrytan · 2026-06-08T02:56:35Z

Thanks for this contribution — and apologies for the slow triage. We did a full pass over the entire PR backlog. gbrain has moved fast, and the maintainer's larger "cathedral" rewrites have superseded a big share of community PRs: the AI gateway + recipes + user_provided_models system replaced almost all individual provider PRs; #1805 fixed the whole Postgres module-singleton class; #1542 unified the type taxonomy; #1657 the retrieval path; #1802 the doctor; and so on.

We're closing this one in that cleanup — either the fix already landed on master, it duplicates another PR or merged change, or it's outside the current merge bar. Where a closed PR carried a genuinely valuable idea, we've recorded it in docs/designs/COMMUNITY_IDEAS.md so nothing good is lost (a few may graduate into TODOs).

Please don't read the close as a judgment of the work — thank you for contributing. If you believe the underlying issue is still live on the latest master, reopen with a quick note and we'll take another look. 🙏

irresi and others added 19 commits April 11, 2026 21:28

feat: add BaseProvider abstract class with shared retry/batch logic

f8f3a87

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix: add missing await on rejects assertion in base provider test

6a4f54a

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: OpenAIProvider extends BaseProvider

ff1ac33

refactor: GeminiProvider extends BaseProvider

eadefe3

refactor: VoyageProvider extends BaseProvider

d977edd

feat: parameterize PGLite schema dimensions from provider config

974750f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: parameterize Postgres schema dimensions from provider config

c3467f8

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: migration v5 — dynamic embedding dimensions with ALTER + re-embed

47317e4

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: doctor checks embedding config mismatch

1dd926a

chore: update Voyage default model from voyage-3 to voyage-4-large

2baf606

voyage-4-large is the current SOTA (MoE architecture, RTEB garrytan#1, 40% lower serving cost). Same default dimensions (1024). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

irresi marked this pull request as ready for review April 11, 2026 15:12

garrytan mentioned this pull request May 10, 2026

v0.32.0 feat: 5 new embedding recipes + discoverability pass (closes 17-PR cluster) #810

Merged

8 tasks

garrytan closed this Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: pluggable embedding adapter with dynamic schema dimensions#49

feat: pluggable embedding adapter with dynamic schema dimensions#49
irresi wants to merge 19 commits into
garrytan:masterfrom
irresi:feat/pluggable-embedding

irresi commented Apr 11, 2026 •

edited

Loading

Uh oh!

garrytan commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

irresi commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Configuration

Files

Test plan

Provider benchmark comparison

Uh oh!

garrytan commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

irresi commented Apr 11, 2026 •

edited

Loading