Skip to content

[codex] finish voyage embedding slice#197

Closed
trymhaak wants to merge 1 commit into
garrytan:masterfrom
trymhaak:codex/finish-voyage-embedding-slice
Closed

[codex] finish voyage embedding slice#197
trymhaak wants to merge 1 commit into
garrytan:masterfrom
trymhaak:codex/finish-voyage-embedding-slice

Conversation

@trymhaak

Copy link
Copy Markdown

What changed

This finishes the local Voyage embedding slice and turns it into a deliberate, test-backed change set instead of a half-finished dirty diff.

  • preserves existing embedding and model metadata when gbrain embed <slug> or gbrain embed --stale hits pages with a mix of fresh and stale chunks
  • adds explicit provider resolution in src/core/embedding.ts via resolveEmbeddingConfig()
  • enables Voyage embeddings when VOYAGE_API_KEY is present while keeping OpenAI as the default path
  • stores EMBEDDING_MODEL on chunks created during import embedding
  • reuses shared embeddingsEnabled() logic in hybrid search instead of hard-coding OPENAI_API_KEY
  • adds focused regression coverage for embed preservation, provider selection, and import-time embedding metadata

Why it changed

The branch had two real issues:

  1. A data-loss regression in the embed command path where mixed fresh/stale pages could lose existing embedding metadata.
  2. A broader Voyage-provider diff that was green by accident, but not directly tested at the decision points that now matter.

This closes both gaps.

Impact

  • gbrain embed --stale is now safe on partially embedded pages.
  • Voyage can be selected intentionally through environment configuration without leaving the decision logic opaque.
  • Imported chunks now record the actual embedding model used.
  • Search skips vector mode whenever neither OpenAI nor Voyage embeddings are configured.

Root cause

The stale-embed updater rebuilt chunk rows from the newly embedded subset and used undefined for missing entries, which wiped pre-existing metadata on untouched chunks. Separately, the Voyage-provider path had been added as a dirty local slice without direct regression tests on provider selection or import-time model stamping.

Validation

  • bun test
  • bun build --compile --outfile /tmp/gbrain-current-build src/cli.ts

@trymhaak trymhaak marked this pull request as ready for review April 18, 2026 16:15
@trymhaak trymhaak force-pushed the codex/finish-voyage-embedding-slice branch from 647530b to bef0045 Compare April 18, 2026 16:17
aloysiusmartis pushed a commit to aloysiusmartis/gbrain that referenced this pull request Apr 18, 2026
Adds a provider-agnostic EmbeddingProvider interface so gbrain can use
Gemini (text-embedding-004/gemini-embedding-001) instead of OpenAI, selected
via GBRAIN_EMBEDDING_PROVIDER env var. The public embed/embedBatch API in
embedding.ts is unchanged — callers see no diff.

Architecture:
- src/core/embedding-provider.ts — EmbeddingProvider interface, factory
  (getActiveProvider), isEmbeddingAvailable(), resetActiveProvider()
- src/core/providers/openai-embedder.ts — OpenAI impl extracted from embedding.ts
- src/core/providers/gemini-embedder.ts — Gemini impl with Matryoshka dims
- src/core/providers/retry-utils.ts — shared exponentialDelay + sleep

Critical fix: operations.ts put_page had hardcoded !process.env.OPENAI_API_KEY,
so Gemini users got silent no-embed on every import. Replaced with
isEmbeddingAvailable() which checks whichever provider is active.

New command: gbrain migrate --provider openai|gemini [--dimensions N]
- ALTER TABLE (only when dims change)
- Re-embeds all chunks with the new provider
- Updates config table + config.json
- Remote guard: CLI-only, cannot be called via MCP

Schema: getPGLiteSchema(dims, model) replaces hardcoded vector(1536) in
PGLite DDL so new Gemini brains get vector(768) from init.

Config: GBrainConfig gains embedding_provider + embedding_dimensions;
loadConfig() propagates them to env on startup (does not override if already set).

Init: gbrain init --provider gemini [--dimensions N] wires provider at
brain creation time.

Usage:
  GBRAIN_EMBEDDING_PROVIDER=gemini gbrain init   # Gemini brain, 768 dims
  gbrain migrate --provider gemini               # migrate existing brain
  gbrain migrate --provider openai               # migrate back

Relates to: upstream PR garrytan#197 (voyage embedding) — same territory but this
approach uses an interface/factory pattern that supports N providers without
modifying the call sites each time.

Co-authored-by: Al's bot <amartis@celitotech.com>
aloysiusmartis added a commit to aloysiusmartis/gbrain that referenced this pull request Apr 18, 2026
Adds a provider-agnostic EmbeddingProvider interface so gbrain can use
Gemini (text-embedding-004/gemini-embedding-001) instead of OpenAI, selected
via GBRAIN_EMBEDDING_PROVIDER env var. The public embed/embedBatch API in
embedding.ts is unchanged — callers see no diff.

Architecture:
- src/core/embedding-provider.ts — EmbeddingProvider interface, factory
  (getActiveProvider), isEmbeddingAvailable(), resetActiveProvider()
- src/core/providers/openai-embedder.ts — OpenAI impl extracted from embedding.ts
- src/core/providers/gemini-embedder.ts — Gemini impl with Matryoshka dims
- src/core/providers/retry-utils.ts — shared exponentialDelay + sleep

Critical fix: operations.ts put_page had hardcoded !process.env.OPENAI_API_KEY,
so Gemini users got silent no-embed on every import. Replaced with
isEmbeddingAvailable() which checks whichever provider is active.

New command: gbrain migrate --provider openai|gemini [--dimensions N]
- ALTER TABLE (only when dims change)
- Re-embeds all chunks with the new provider
- Updates config table + config.json
- Remote guard: CLI-only, cannot be called via MCP

Schema: getPGLiteSchema(dims, model) replaces hardcoded vector(1536) in
PGLite DDL so new Gemini brains get vector(768) from init.

Config: GBrainConfig gains embedding_provider + embedding_dimensions;
loadConfig() propagates them to env on startup (does not override if already set).

Init: gbrain init --provider gemini [--dimensions N] wires provider at
brain creation time.

Usage:
  GBRAIN_EMBEDDING_PROVIDER=gemini gbrain init   # Gemini brain, 768 dims
  gbrain migrate --provider gemini               # migrate existing brain
  gbrain migrate --provider openai               # migrate back

Relates to: upstream PR garrytan#197 (voyage embedding) — same territory but this
approach uses an interface/factory pattern that supports N providers without
modifying the call sites each time.

Co-authored-by: Al's bot <aloysiusmartis@users.noreply.github.com>
@trymhaak

Copy link
Copy Markdown
Author

Closing this: continuing the work on the fork instead of upstream.

@trymhaak trymhaak closed this Apr 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant