[codex] finish voyage embedding slice by trymhaak · Pull Request #197 · garrytan/gbrain

trymhaak · 2026-04-18T16:08:37Z

What changed

This finishes the local Voyage embedding slice and turns it into a deliberate, test-backed change set instead of a half-finished dirty diff.

preserves existing embedding and model metadata when gbrain embed <slug> or gbrain embed --stale hits pages with a mix of fresh and stale chunks
adds explicit provider resolution in src/core/embedding.ts via resolveEmbeddingConfig()
enables Voyage embeddings when VOYAGE_API_KEY is present while keeping OpenAI as the default path
stores EMBEDDING_MODEL on chunks created during import embedding
reuses shared embeddingsEnabled() logic in hybrid search instead of hard-coding OPENAI_API_KEY
adds focused regression coverage for embed preservation, provider selection, and import-time embedding metadata

Why it changed

The branch had two real issues:

A data-loss regression in the embed command path where mixed fresh/stale pages could lose existing embedding metadata.
A broader Voyage-provider diff that was green by accident, but not directly tested at the decision points that now matter.

This closes both gaps.

Impact

gbrain embed --stale is now safe on partially embedded pages.
Voyage can be selected intentionally through environment configuration without leaving the decision logic opaque.
Imported chunks now record the actual embedding model used.
Search skips vector mode whenever neither OpenAI nor Voyage embeddings are configured.

Root cause

The stale-embed updater rebuilt chunk rows from the newly embedded subset and used undefined for missing entries, which wiped pre-existing metadata on untouched chunks. Separately, the Voyage-provider path had been added as a dirty local slice without direct regression tests on provider selection or import-time model stamping.

Validation

bun test
bun build --compile --outfile /tmp/gbrain-current-build src/cli.ts

Adds a provider-agnostic EmbeddingProvider interface so gbrain can use Gemini (text-embedding-004/gemini-embedding-001) instead of OpenAI, selected via GBRAIN_EMBEDDING_PROVIDER env var. The public embed/embedBatch API in embedding.ts is unchanged — callers see no diff. Architecture: - src/core/embedding-provider.ts — EmbeddingProvider interface, factory (getActiveProvider), isEmbeddingAvailable(), resetActiveProvider() - src/core/providers/openai-embedder.ts — OpenAI impl extracted from embedding.ts - src/core/providers/gemini-embedder.ts — Gemini impl with Matryoshka dims - src/core/providers/retry-utils.ts — shared exponentialDelay + sleep Critical fix: operations.ts put_page had hardcoded !process.env.OPENAI_API_KEY, so Gemini users got silent no-embed on every import. Replaced with isEmbeddingAvailable() which checks whichever provider is active. New command: gbrain migrate --provider openai|gemini [--dimensions N] - ALTER TABLE (only when dims change) - Re-embeds all chunks with the new provider - Updates config table + config.json - Remote guard: CLI-only, cannot be called via MCP Schema: getPGLiteSchema(dims, model) replaces hardcoded vector(1536) in PGLite DDL so new Gemini brains get vector(768) from init. Config: GBrainConfig gains embedding_provider + embedding_dimensions; loadConfig() propagates them to env on startup (does not override if already set). Init: gbrain init --provider gemini [--dimensions N] wires provider at brain creation time. Usage: GBRAIN_EMBEDDING_PROVIDER=gemini gbrain init # Gemini brain, 768 dims gbrain migrate --provider gemini # migrate existing brain gbrain migrate --provider openai # migrate back Relates to: upstream PR garrytan#197 (voyage embedding) — same territory but this approach uses an interface/factory pattern that supports N providers without modifying the call sites each time. Co-authored-by: Al's bot <amartis@celitotech.com>

Adds a provider-agnostic EmbeddingProvider interface so gbrain can use Gemini (text-embedding-004/gemini-embedding-001) instead of OpenAI, selected via GBRAIN_EMBEDDING_PROVIDER env var. The public embed/embedBatch API in embedding.ts is unchanged — callers see no diff. Architecture: - src/core/embedding-provider.ts — EmbeddingProvider interface, factory (getActiveProvider), isEmbeddingAvailable(), resetActiveProvider() - src/core/providers/openai-embedder.ts — OpenAI impl extracted from embedding.ts - src/core/providers/gemini-embedder.ts — Gemini impl with Matryoshka dims - src/core/providers/retry-utils.ts — shared exponentialDelay + sleep Critical fix: operations.ts put_page had hardcoded !process.env.OPENAI_API_KEY, so Gemini users got silent no-embed on every import. Replaced with isEmbeddingAvailable() which checks whichever provider is active. New command: gbrain migrate --provider openai|gemini [--dimensions N] - ALTER TABLE (only when dims change) - Re-embeds all chunks with the new provider - Updates config table + config.json - Remote guard: CLI-only, cannot be called via MCP Schema: getPGLiteSchema(dims, model) replaces hardcoded vector(1536) in PGLite DDL so new Gemini brains get vector(768) from init. Config: GBrainConfig gains embedding_provider + embedding_dimensions; loadConfig() propagates them to env on startup (does not override if already set). Init: gbrain init --provider gemini [--dimensions N] wires provider at brain creation time. Usage: GBRAIN_EMBEDDING_PROVIDER=gemini gbrain init # Gemini brain, 768 dims gbrain migrate --provider gemini # migrate existing brain gbrain migrate --provider openai # migrate back Relates to: upstream PR garrytan#197 (voyage embedding) — same territory but this approach uses an interface/factory pattern that supports N providers without modifying the call sites each time. Co-authored-by: Al's bot <aloysiusmartis@users.noreply.github.com>

trymhaak · 2026-04-18T20:03:57Z

Closing this: continuing the work on the fork instead of upstream.

trymhaak marked this pull request as ready for review April 18, 2026 16:15

fix: finish voyage embedding slice

bef0045

trymhaak force-pushed the codex/finish-voyage-embedding-slice branch from 647530b to bef0045 Compare April 18, 2026 16:17

aloysiusmartis mentioned this pull request Apr 18, 2026

feat: pluggable embedding providers (OpenAI + Gemini) #206

Closed

6 tasks

trymhaak closed this Apr 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] finish voyage embedding slice#197

[codex] finish voyage embedding slice#197
trymhaak wants to merge 1 commit into
garrytan:masterfrom
trymhaak:codex/finish-voyage-embedding-slice

trymhaak commented Apr 18, 2026

Uh oh!

trymhaak commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

trymhaak commented Apr 18, 2026

What changed

Why it changed

Impact

Root cause

Validation

Uh oh!

trymhaak commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant