Summary
When the very first gbrain sources sync --strategy code import for a code source runs without a working embedding provider (e.g. OPENAI_API_KEY unset), pages get imported successfully but their chunks have no embeddings. Once the env is fixed, there is no surfaced CLI path to backfill embeddings for those pages — every available command either ignores the code source's ID or refuses to re-run because content hasn't changed.
The only workaround I can find is destructive: gbrain sources remove <id> followed by a fresh sync, which throws away version history, chunk-level metadata, and any cross-source links the page already participates in.
Repro
# Initial import without an embedding key
unset OPENAI_API_KEY
cd /path/to/repo
gbrain sources add gstack-code-myrepo --path "$(pwd)" --strategy code
gbrain sync --source gstack-code-myrepo --strategy code
# -> Pages imported, every embedding errors with "OpenAI embedding requires OPENAI_API_KEY"
# Now fix the env and try to backfill
export OPENAI_API_KEY=sk-...
# Attempt 1: embed --stale
gbrain embed --stale
# -> Embeds pages in source=default fine, but for every code-source slug:
# Error embedding modules-foo-json: Page not found: modules-foo-json (source=default)
# Attempt 2: embed a specific code-source slug
gbrain embed modules-foo-json
# -> Page not found: modules-foo-json (source=default)
# Attempt 3: re-run sync against the code source
gbrain sync --source gstack-code-myrepo --strategy code
# -> ' +0 added, ~0 modified ' — content unchanged, nothing re-imported, nothing re-embedded
# Attempt 4: reindex-code with --yes
gbrain reindex-code --source gstack-code-myrepo --yes
# -> reindex-code: 0 reindexed, 7 skipped, 0 failed
# Content-hash gated; no --force flag visible in --help.
Observed
gbrain embed --stale / embed <slug> ignore code-source pages. They look up the slug in source=default and throw Page not found: <slug> (source=default). Confirmed in source — src/commands/embed.ts's EmbedOpts exposes slug, slugs, all, stale, dryRun, onProgress. There is no source field, and no --source parsing in runEmbed().
gbrain reindex-code --source <id> --yes is content-hash gated with no force / re-embed flag. Pages whose chunks have null embeddings are still classified as 'skipped' because the source content is unchanged from import time.
gbrain sync --source <id> --strategy code likewise only acts on git-diff changes, so a no-op sync produces a no-op embed.
Net effect: the only way back from a key-less initial import is to remove the source and re-add it.
Expected
At least one of:
gbrain embed accepts a --source <id> flag so callers can scope a backfill to the code source whose pages need vectors.
gbrain reindex-code accepts a --force flag (or a more targeted --embed-only / --missing-embeddings flag) that re-runs the embed step for pages already imported.
gbrain sync --source <id> --strategy code re-runs the embedding pass for pages whose chunks have null embeddings even when content is unchanged.
Any of the three would unblock recovery without losing the imported pages.
Related issues
Workarounds (for users hitting this today)
- Destructive:
gbrain sources remove <id> && gbrain sources add <id> ... && gbrain sync --source <id> --strategy code with the API key set. Loses any source-specific history.
- Live with keyword-only matches on the code source's pages until a fix lands.
gbrain search still returns BM25-style hits (score 0.0000), just no vector relevance.
Environment
- gbrain: 0.31.3 (commit 9c60b3a, master)
- Engine: postgres (managed)
- Bun: 1.3.11
- Platform: Linux 6.8.0-111-generic x86_64
Separate observation (probably its own issue)
CODE_EXTENSIONS in src/core/sync.ts:46 does not include .tf, .tfvars, or .hcl. On a Terraform repo with 163 .tf + 51 .tfvars files (plus 20 .md), gbrain sync --strategy code matched 7 files (6 JSON + 1 Python) and indexed nothing else. This mirrors #709 (.astro missing) and is probably worth a follow-up issue that tracks an extensible / pluggable extension list, since Terraform/HCL, Bicep, Pulumi, Kubernetes manifests, and Helm templates all sit in the same gap.
Summary
When the very first
gbrain sources sync --strategy codeimport for a code source runs without a working embedding provider (e.g.OPENAI_API_KEYunset), pages get imported successfully but their chunks have no embeddings. Once the env is fixed, there is no surfaced CLI path to backfill embeddings for those pages — every available command either ignores the code source's ID or refuses to re-run because content hasn't changed.The only workaround I can find is destructive:
gbrain sources remove <id>followed by a fresh sync, which throws away version history, chunk-level metadata, and any cross-source links the page already participates in.Repro
Observed
gbrain embed --stale/embed <slug>ignore code-source pages. They look up the slug insource=defaultand throwPage not found: <slug> (source=default). Confirmed in source —src/commands/embed.ts'sEmbedOptsexposesslug,slugs,all,stale,dryRun,onProgress. There is nosourcefield, and no--sourceparsing inrunEmbed().gbrain reindex-code --source <id> --yesis content-hash gated with no force / re-embed flag. Pages whose chunks have null embeddings are still classified as 'skipped' because the source content is unchanged from import time.gbrain sync --source <id> --strategy codelikewise only acts on git-diff changes, so a no-op sync produces a no-op embed.Net effect: the only way back from a key-less initial import is to remove the source and re-add it.
Expected
At least one of:
gbrain embedaccepts a--source <id>flag so callers can scope a backfill to the code source whose pages need vectors.gbrain reindex-codeaccepts a--forceflag (or a more targeted--embed-only/--missing-embeddingsflag) that re-runs the embed step for pages already imported.gbrain sync --source <id> --strategy codere-runs the embedding pass for pages whose chunks have null embeddings even when content is unchanged.Any of the three would unblock recovery without losing the imported pages.
Related issues
searchtool only searches default source — federated source content missed #710 — MCPsearchtool only searches default source. Same shape (source-scope assumed = default), different surface (MCP search vs CLI embed). A fix that threads source IDs through embed paths would likely benefit both.gbrain sources listreports 0 pages for federated sources after successful sync #711 —gbrain sources listreports 0 pages for federated sources after successful sync. Related federation/source-scope plumbing.reindex-codereports 'No code pages to reindex' #712 —code-stage sync writes pages with malformed frontmatter; reindex-code reports 'No code pages to reindex'. Different root cause but same user-facing 'can't recover from failed code import' family.sync --strategy codedropped on first sync viaperformFullSync. Adjacent reliability issue on the same code path.Workarounds (for users hitting this today)
gbrain sources remove <id> && gbrain sources add <id> ... && gbrain sync --source <id> --strategy codewith the API key set. Loses any source-specific history.gbrain searchstill returns BM25-style hits (score 0.0000), just no vector relevance.Environment
Separate observation (probably its own issue)
CODE_EXTENSIONSinsrc/core/sync.ts:46does not include.tf,.tfvars, or.hcl. On a Terraform repo with 163.tf+ 51.tfvarsfiles (plus 20.md),gbrain sync --strategy codematched 7 files (6 JSON + 1 Python) and indexed nothing else. This mirrors #709 (.astromissing) and is probably worth a follow-up issue that tracks an extensible / pluggable extension list, since Terraform/HCL, Bicep, Pulumi, Kubernetes manifests, and Helm templates all sit in the same gap.