fix(embed): preserve code-chunk metadata across re-embed (#769)#1232
Open
rayers wants to merge 1 commit into
Open
fix(embed): preserve code-chunk metadata across re-embed (#769)#1232rayers wants to merge 1 commit into
rayers wants to merge 1 commit into
Conversation
1 task
rayers
added a commit
to rayers/gbrain
that referenced
this pull request
May 24, 2026
Brings in 7 upstream releases (v0.40.3.0 through v0.40.8.0): - contextual retrieval + cache invalidation gate (v0.40.3.0) - selective graph signals + per-stage attribution + audit-writer unification (v0.40.4.0) - Federated Sync v2 (v0.40.5.0, v0.40.6.0) - Schema Cathedral v3 (v0.40.7.0) - e2e + flake fixes (v0.40.8.0) Conflicts in pglite-engine.ts and postgres-engine.ts upsertChunks were comment-only (git couldn't pick which of two valid explanations to keep). Both comments retained — they document complementary aspects of the same SQL block: 1. The garrytan#769 chunk_text-gated CASE pattern for the 8 code-chunk metadata columns (preserves metadata across re-embed; from Ryan's open PR garrytan#1232 upstream). 2. The v0.40.3.0 D24 NULL→non-NULL race fix for the embedding + embedded_at columns (lets the fresher write win when two writers race on the same chunk). The 8 metadata-column CASE assignments survived the auto-merge intact alongside upstream's new D24 branches in the embedding and embedded_at CASE expressions. Run \`bun install\` to pick up the new \`chokidar\` dependency (typecheck currently fails on chokidar resolution + a stale \`collectFilesByStrategy\` export reference; both unrelated to the engine-file resolution above). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
807b06d to
69e7ba0
Compare
Closes garrytan#769. Every re-embed pass clobbered code-chunk metadata (language, symbol_name, symbol_type, start_line, end_line, parent_symbol_path, doc_comment, symbol_name_qualified) to NULL, disabling code-def queries across thousands of indexed chunks. Two complementary fixes: embed.ts — three re-upsert call sites (embedPage, embedAll non-stale, embedAllStale autopilot path) build ChunkInputs from loaded chunks; they were stripping the 8 metadata fields. New preserveCodeMetadata helper threads those fields through consistently. Integrated cleanly with v0.34.4.0's cursor-paginated --stale hardening — the wrap sits inside the worker function between embedBatchWithBackoff and engine.upsertChunks. postgres-engine.ts + pglite-engine.ts — upsertChunks ON CONFLICT clause OVERWROTE metadata columns from EXCLUDED. Asymmetric vs the embedding/embedded_at columns which already used a chunk_text-gated CASE pattern (re-chunk → trust EXCLUDED, re-embed → COALESCE preserve). Applied the same pattern to all 8 metadata columns. Three regression tests in test/embed.serial.test.ts cover --stale (autopilot), --all, and --slugs paths. Each loads a chunk with full metadata, runs runEmbed, and asserts engine.upsertChunks receives the metadata round-tripped. Coexists with master's D5 embedBatchWithBackoff test block. Backfill required after deploy: \`gbrain sync --strategy code --force --source <id>\` per code source to re-populate metadata via the chunker. Without backfill, existing NULL columns stay NULL — re-embed alone never produces metadata, only the chunker does. Originally landed as part of PR garrytan#768 (the wave that bundled garrytan#767 + fix; this PR carries the garrytan#769 fix alone with no scope overlap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
69e7ba0 to
8f3c27e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #769. Every re-embed pass clobbered code-chunk metadata (
language,symbol_name,symbol_type,start_line,end_line,parent_symbol_path,doc_comment,symbol_name_qualified) to NULL, disablingcode-defqueries across thousands of indexed chunks.What this PR is
This is the #769 fix split out of PR #768 (the original bundled #767 + #769 + extract polish). v0.31.2 absorbed the #767 fix independently via
collectSyncableFiles, so PR #768 was carrying redundant scope. PR #768 is being closed in favor of two narrower PRs (this one + a separate wikilink-resolver PR).Two complementary fixes
embed.ts— three re-upsert call sites (embedPage,embedAllnon-stale,embedAllStaleautopilot path) builtChunkInputs from loaded chunks but stripped the 8 metadata fields. NewpreserveCodeMetadatahelper threads those fields through consistently. Integrated cleanly with v0.34.4.0's cursor-paginated--stalehardening — the wrap sits inside the worker function betweenembedBatchWithBackoffandengine.upsertChunks, keeping all the budget / sourceId / workers machinery from #991.postgres-engine.ts+pglite-engine.ts—upsertChunksON CONFLICTclause OVERWROTE metadata columns fromEXCLUDED. Asymmetric vs theembedding/embedded_atcolumns which already used achunk_text-gated CASE pattern (re-chunk → trust EXCLUDED, re-embed → COALESCE preserve). Applied the same pattern to all 8 metadata columns.Tests
Three regression tests in
test/embed.serial.test.tscover--stale(autopilot),--all, and--slugspaths. Each loads a chunk with full metadata, runsrunEmbed, and assertsengine.upsertChunksreceives the metadata round-tripped. Coexists with master's D5embedBatchWithBackofftest block from #991.Local:
bun run verifyclean,bun test test/embed.serial.test.ts→ 28 pass / 0 fail.Backfill required after deploy
Per code source, to re-populate metadata via the chunker. Without backfill, existing NULL columns stay NULL — re-embed alone never produces metadata, only the chunker does.
Test plan
bun run verifycleanbun test test/embed.serial.test.ts→ 28/0/0bun run typecheckcleanbun run test:e2e(gated on DATABASE_URL)gbrain code-def <symbol>returns hits🤖 Generated with Claude Code