Skip to content

fix(embed): preserve code-chunk metadata across re-embed (#769)#1232

Open
rayers wants to merge 1 commit into
garrytan:masterfrom
rayers:fix/embed-preserve-code-metadata-769
Open

fix(embed): preserve code-chunk metadata across re-embed (#769)#1232
rayers wants to merge 1 commit into
garrytan:masterfrom
rayers:fix/embed-preserve-code-metadata-769

Conversation

@rayers

@rayers rayers commented May 20, 2026

Copy link
Copy Markdown

Summary

Closes #769. Every re-embed pass clobbered code-chunk metadata (language, symbol_name, symbol_type, start_line, end_line, parent_symbol_path, doc_comment, symbol_name_qualified) to NULL, disabling code-def queries across thousands of indexed chunks.

What this PR is

This is the #769 fix split out of PR #768 (the original bundled #767 + #769 + extract polish). v0.31.2 absorbed the #767 fix independently via collectSyncableFiles, so PR #768 was carrying redundant scope. PR #768 is being closed in favor of two narrower PRs (this one + a separate wikilink-resolver PR).

Two complementary fixes

embed.ts — three re-upsert call sites (embedPage, embedAll non-stale, embedAllStale autopilot path) built ChunkInputs from loaded chunks but stripped the 8 metadata fields. New preserveCodeMetadata helper threads those fields through consistently. Integrated cleanly with v0.34.4.0's cursor-paginated --stale hardening — the wrap sits inside the worker function between embedBatchWithBackoff and engine.upsertChunks, keeping all the budget / sourceId / workers machinery from #991.

postgres-engine.ts + pglite-engine.tsupsertChunks ON CONFLICT clause OVERWROTE metadata columns from EXCLUDED. Asymmetric vs the embedding / embedded_at columns which already used a chunk_text-gated CASE pattern (re-chunk → trust EXCLUDED, re-embed → COALESCE preserve). Applied the same pattern to all 8 metadata columns.

Tests

Three regression tests in test/embed.serial.test.ts cover --stale (autopilot), --all, and --slugs paths. Each loads a chunk with full metadata, runs runEmbed, and asserts engine.upsertChunks receives the metadata round-tripped. Coexists with master's D5 embedBatchWithBackoff test block from #991.

Local: bun run verify clean, bun test test/embed.serial.test.ts → 28 pass / 0 fail.

Backfill required after deploy

gbrain sync --strategy code --force --source <id>

Per code source, to re-populate metadata via the chunker. Without backfill, existing NULL columns stay NULL — re-embed alone never produces metadata, only the chunker does.

Test plan

  • bun run verify clean
  • bun test test/embed.serial.test.ts → 28/0/0
  • bun run typecheck clean
  • bun run test:e2e (gated on DATABASE_URL)
  • Real-corpus backfill verification — code-def queries returning 0 hits → backfill → gbrain code-def <symbol> returns hits

🤖 Generated with Claude Code

rayers added a commit to rayers/gbrain that referenced this pull request May 24, 2026
Brings in 7 upstream releases (v0.40.3.0 through v0.40.8.0):
- contextual retrieval + cache invalidation gate (v0.40.3.0)
- selective graph signals + per-stage attribution + audit-writer
  unification (v0.40.4.0)
- Federated Sync v2 (v0.40.5.0, v0.40.6.0)
- Schema Cathedral v3 (v0.40.7.0)
- e2e + flake fixes (v0.40.8.0)

Conflicts in pglite-engine.ts and postgres-engine.ts upsertChunks
were comment-only (git couldn't pick which of two valid explanations
to keep). Both comments retained — they document complementary
aspects of the same SQL block:

  1. The garrytan#769 chunk_text-gated CASE pattern for the 8 code-chunk
     metadata columns (preserves metadata across re-embed; from
     Ryan's open PR garrytan#1232 upstream).
  2. The v0.40.3.0 D24 NULL→non-NULL race fix for the embedding +
     embedded_at columns (lets the fresher write win when two
     writers race on the same chunk).

The 8 metadata-column CASE assignments survived the auto-merge
intact alongside upstream's new D24 branches in the embedding
and embedded_at CASE expressions.

Run \`bun install\` to pick up the new \`chokidar\` dependency
(typecheck currently fails on chokidar resolution + a stale
\`collectFilesByStrategy\` export reference; both unrelated to
the engine-file resolution above).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rayers rayers force-pushed the fix/embed-preserve-code-metadata-769 branch from 807b06d to 69e7ba0 Compare May 26, 2026 11:11
Closes garrytan#769. Every re-embed pass clobbered code-chunk metadata
(language, symbol_name, symbol_type, start_line, end_line,
parent_symbol_path, doc_comment, symbol_name_qualified) to NULL,
disabling code-def queries across thousands of indexed chunks.

Two complementary fixes:

embed.ts — three re-upsert call sites (embedPage, embedAll
non-stale, embedAllStale autopilot path) build ChunkInputs from
loaded chunks; they were stripping the 8 metadata fields. New
preserveCodeMetadata helper threads those fields through
consistently. Integrated cleanly with v0.34.4.0's cursor-paginated
--stale hardening — the wrap sits inside the worker function
between embedBatchWithBackoff and engine.upsertChunks.

postgres-engine.ts + pglite-engine.ts — upsertChunks ON CONFLICT
clause OVERWROTE metadata columns from EXCLUDED. Asymmetric vs the
embedding/embedded_at columns which already used a chunk_text-gated
CASE pattern (re-chunk → trust EXCLUDED, re-embed → COALESCE
preserve). Applied the same pattern to all 8 metadata columns.

Three regression tests in test/embed.serial.test.ts cover --stale
(autopilot), --all, and --slugs paths. Each loads a chunk with
full metadata, runs runEmbed, and asserts engine.upsertChunks
receives the metadata round-tripped. Coexists with master's D5
embedBatchWithBackoff test block.

Backfill required after deploy: \`gbrain sync --strategy code
--force --source <id>\` per code source to re-populate metadata via
the chunker. Without backfill, existing NULL columns stay NULL —
re-embed alone never produces metadata, only the chunker does.

Originally landed as part of PR garrytan#768 (the wave that bundled garrytan#767 +
fix; this PR carries the garrytan#769 fix alone with no scope overlap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rayers rayers force-pushed the fix/embed-preserve-code-metadata-769 branch from 69e7ba0 to 8f3c27e Compare May 28, 2026 05:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Code chunks land in DB with NULL language / symbol_name / symbol_type across all languages

1 participant