Skip to content

No recovery path when initial code-source import fails to embed (e.g. missing OPENAI_API_KEY at first sync) #877

@johnybradshaw

Description

@johnybradshaw

Summary

When the very first gbrain sources sync --strategy code import for a code source runs without a working embedding provider (e.g. OPENAI_API_KEY unset), pages get imported successfully but their chunks have no embeddings. Once the env is fixed, there is no surfaced CLI path to backfill embeddings for those pages — every available command either ignores the code source's ID or refuses to re-run because content hasn't changed.

The only workaround I can find is destructive: gbrain sources remove <id> followed by a fresh sync, which throws away version history, chunk-level metadata, and any cross-source links the page already participates in.

Repro

# Initial import without an embedding key
unset OPENAI_API_KEY
cd /path/to/repo
gbrain sources add gstack-code-myrepo --path "$(pwd)" --strategy code
gbrain sync --source gstack-code-myrepo --strategy code
# -> Pages imported, every embedding errors with "OpenAI embedding requires OPENAI_API_KEY"

# Now fix the env and try to backfill
export OPENAI_API_KEY=sk-...

# Attempt 1: embed --stale
gbrain embed --stale
# -> Embeds pages in source=default fine, but for every code-source slug:
#    Error embedding modules-foo-json: Page not found: modules-foo-json (source=default)

# Attempt 2: embed a specific code-source slug
gbrain embed modules-foo-json
# -> Page not found: modules-foo-json (source=default)

# Attempt 3: re-run sync against the code source
gbrain sync --source gstack-code-myrepo --strategy code
# -> ' +0 added, ~0 modified ' — content unchanged, nothing re-imported, nothing re-embedded

# Attempt 4: reindex-code with --yes
gbrain reindex-code --source gstack-code-myrepo --yes
# -> reindex-code: 0 reindexed, 7 skipped, 0 failed
#    Content-hash gated; no --force flag visible in --help.

Observed

  1. gbrain embed --stale / embed <slug> ignore code-source pages. They look up the slug in source=default and throw Page not found: <slug> (source=default). Confirmed in source — src/commands/embed.ts's EmbedOpts exposes slug, slugs, all, stale, dryRun, onProgress. There is no source field, and no --source parsing in runEmbed().
  2. gbrain reindex-code --source <id> --yes is content-hash gated with no force / re-embed flag. Pages whose chunks have null embeddings are still classified as 'skipped' because the source content is unchanged from import time.
  3. gbrain sync --source <id> --strategy code likewise only acts on git-diff changes, so a no-op sync produces a no-op embed.

Net effect: the only way back from a key-less initial import is to remove the source and re-add it.

Expected

At least one of:

  • gbrain embed accepts a --source <id> flag so callers can scope a backfill to the code source whose pages need vectors.
  • gbrain reindex-code accepts a --force flag (or a more targeted --embed-only / --missing-embeddings flag) that re-runs the embed step for pages already imported.
  • gbrain sync --source <id> --strategy code re-runs the embedding pass for pages whose chunks have null embeddings even when content is unchanged.

Any of the three would unblock recovery without losing the imported pages.

Related issues

Workarounds (for users hitting this today)

  • Destructive: gbrain sources remove <id> && gbrain sources add <id> ... && gbrain sync --source <id> --strategy code with the API key set. Loses any source-specific history.
  • Live with keyword-only matches on the code source's pages until a fix lands. gbrain search still returns BM25-style hits (score 0.0000), just no vector relevance.

Environment

  • gbrain: 0.31.3 (commit 9c60b3a, master)
  • Engine: postgres (managed)
  • Bun: 1.3.11
  • Platform: Linux 6.8.0-111-generic x86_64

Separate observation (probably its own issue)

CODE_EXTENSIONS in src/core/sync.ts:46 does not include .tf, .tfvars, or .hcl. On a Terraform repo with 163 .tf + 51 .tfvars files (plus 20 .md), gbrain sync --strategy code matched 7 files (6 JSON + 1 Python) and indexed nothing else. This mirrors #709 (.astro missing) and is probably worth a follow-up issue that tracks an extensible / pluggable extension list, since Terraform/HCL, Bicep, Pulumi, Kubernetes manifests, and Helm templates all sit in the same gap.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions