fix(extract): normalize slugs to lowercase via pathToSlug() (T-OBS-1)#736
Closed
Freddy-Cach wants to merge 2 commits intogarrytan:masterfrom
Closed
fix(extract): normalize slugs to lowercase via pathToSlug() (T-OBS-1)#736Freddy-Cach wants to merge 2 commits intogarrytan:masterfrom
Freddy-Cach wants to merge 2 commits intogarrytan:masterfrom
Conversation
… SDK and Voyage's actual contract
The @ai-sdk/openai-compatible package treats Voyage as if it were
OpenAI-shaped, but Voyage's /v1/embeddings endpoint diverges in three places
that combine into a hard-blocking incompatibility:
OUTBOUND request:
- 'encoding_format=float' (SDK default) is rejected; Voyage only accepts 'base64'
- 'dimensions' parameter (OpenAI name) is rejected; Voyage uses 'output_dimension'
INBOUND response:
- With encoding_format=base64, 'embedding' is returned as a base64 string,
but the SDK's Zod schema (openaiTextEmbeddingResponseSchema) expects an
'array of number'. The schema fails with 'Invalid JSON response' even
though the JSON is well-formed.
- 'usage' lacks 'prompt_tokens'; the schema requires it when usage is present.
Without this patch, ALL embedding requests to Voyage fail. Reproducible by
running 'gbrain put <slug> < text' with embedding_model=voyage:voyage-* and
any current voyage model (voyage-3-large, voyage-3, voyage-4-large).
Solution: pass a custom 'fetch' to createOpenAICompatible only when
recipe.id === 'voyage'. The fetch wrapper:
1. Forces encoding_format='base64' on outbound (Voyage's only accepted value)
2. Translates dimensions -> output_dimension on outbound
3. Drops Content-Length so the runtime recomputes from the mutated body
4. Decodes base64 embeddings to Float32 arrays on inbound (so the Zod schema
sees what it expects)
5. Synthesizes prompt_tokens from total_tokens when missing
This is a minimal, targeted fix. It only activates for Voyage and falls
through cleanly for all other providers. No public API changes.
The extractor was generating from_slug and the allSlugs lookup set from
`relPath.replace('.md', '')` in 5 places, producing CAPS slugs for files
named ETHOS.md, AGENTS.md, ROADMAP.md, etc.
Pages persist in the DB with lowercase slug (core/sync.ts pathToSlug()
applies .toLowerCase()). The CAPS extractor output mismatched the DB rows,
so INSERT ... JOIN pages ON pages.slug = v.from_slug silently dropped
links from CAPS-named source files. The link batch returned 'inserted'
counts that were lower than the wikilinks actually present, with no error.
Reproduction (in a brain with CAPS-named canonical docs):
1. echo 'See [agents](agents.md).' > ETHOS.md
2. gbrain put ethos < ETHOS.md # page row: slug='ethos'
3. gbrain extract links --source fs
4. gbrain backlinks agents → [] (expected: contains 'ethos')
Fix: import pathToSlug from core/sync.ts and use it in all 5 sites:
- extractLinksFromFile (line 200): from_slug derivation
- runIncrementalExtractInternal (line 456): allSlugs set
- extractLinksFromDir (line 552): allSlugs set
- timeline loop (line 643): from_slug for timeline entries
- extractLinksForSlugs (line 673): allSlugs set used by sync hook
This single-line-per-site change keeps the extractor consistent with the
sync layer's slug normalization and doesn't introduce any new behavior
for already-lowercase paths (idempotent).
Tests: added 'extractLinksFromFile — slug normalization (T-OBS-1
regression)' suite with 4 cases covering CAPS, mixed-case, idempotent
lowercase, and nested path. Full extract suite (54 → 58 tests) passes.
Reported by Claude Code (Opus 4.7) during Obsidian PKM integration on
the gstack-plan Living Repo, where ~111 wikilinks pointing to ETHOS,
AGENTS, ROADMAP, etc. failed to count toward brain_score (54/100 vs
expected 75+/100). Documented as T-OBS-1 in the consumer's blocked.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
Owner
|
Closing — your fix landed in master via the v0.30.3 fix-wave PR #776 (merged at Thank you for the contribution — credit is preserved in the v0.30.3 CHANGELOG entry. 🙏 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The extractor was generating from_slug and the allSlugs lookup set from
relPath.replace('.md', '')in 5 places, producing CAPS slugs for filesnamed ETHOS.md, AGENTS.md, ROADMAP.md, etc.
Pages persist in the DB with lowercase slug (core/sync.ts pathToSlug()
applies .toLowerCase()). The CAPS extractor output mismatched the DB rows,
so INSERT ... JOIN pages ON pages.slug = v.from_slug silently dropped
links from CAPS-named source files. The link batch returned 'inserted'
counts that were lower than the wikilinks actually present, with no error.
Reproduction (in a brain with CAPS-named canonical docs):
Fix: import pathToSlug from core/sync.ts and use it in all 5 sites:
This single-line-per-site change keeps the extractor consistent with the
sync layer's slug normalization and doesn't introduce any new behavior
for already-lowercase paths (idempotent).
Tests: added 'extractLinksFromFile — slug normalization (T-OBS-1
regression)' suite with 4 cases covering CAPS, mixed-case, idempotent
lowercase, and nested path. Full extract suite (54 → 58 tests) passes.
Need help on this PR? Tag
@codesmithwith what you need.