fix(extract): resolve bare-name body wikilinks via resolver by rayers · Pull Request #1233 · garrytan/gbrain

rayers · 2026-05-20T16:22:31Z

Summary

Body wikilinks in wiki/topic/learning content are silently dropped on every gbrain extract pass. Three layered issues:

WIKILINK_RE in src/core/link-extraction.ts is gated on DIR_PATTERN (people|companies|meetings|...). Wiki/topic/learning content uses bare-name wikilinks like [[Fast-Weigh]] or [[2026-05-07-cost-plan]] which fall outside that allow-list — the regex never matched, so body refs were invisible to extract.
Body wikilinks that DID match were only resolved when --include-frontmatter was set, because extractPageLinks routed ALL refs (body + frontmatter) through activeResolver which was set to nullResolver when frontmatter was off. Body refs are already free of the cost concern that gated frontmatter — they surface in markdown the user explicitly typed — so they should always resolve.
extract.ts called extractPageLinks with one resolver doing both jobs. Splitting via opts.skipFrontmatter lets the body pass keep the real resolver while frontmatter stays opt-in.

What this PR is

This is the wikilink-resolver portion of the original PR #768 (which bundled #767 + #769 + extract polish). Two pieces of that bundle have either been absorbed or split off:

The sync --strategy code dropped on first sync via performFullSync #767 sync-strategy first-sync fix has been absorbed by v0.31.2's collectSyncableFiles independently.
The doctor hint fix (Run: gbrain extract all instead of the gone-since-v0.16 gbrain link-extract && gbrain timeline-extract) has been absorbed by upstream master at doctor.ts:2503.
The Code chunks land in DB with NULL language / symbol_name / symbol_type across all languages #769 chunk-metadata fix is in a separate sibling PR.

This PR carries only the wikilink resolver — no scope overlap with upstream master.

Fixes

link-extraction.ts adds BARE_WIKILINK_RE matching [[<name>(#anchor)?(|display)?]] shapes outside DIR_PATTERN, resolved via the new resolveBareWikilink(name, resolver) that walks fuzzy match → bare-name prefix expansion → exact-slug before giving up. Three new exports: BARE_WIKILINK_RE, resolveBareWikilink, isBareName (regex shape guard for the pre-extract candidate check). extractPageLinks gains an opts.skipFrontmatter parameter — when true, the frontmatter pass is skipped but body wikilinks still resolve through the passed resolver.

extract.ts threads the always-on resolver (not the conditional nullResolver) into extractPageLinks for the body pass, with opts.skipFrontmatter wired off --include-frontmatter.

Tests

test/link-extraction.test.ts: 75 lines covering BARE_WIKILINK_RE shape (anchor + display variants), resolveBareWikilink fuzzy + prefix + exact paths, isBareName negative cases (DIR_PATTERN prefixes still rejected), and extractPageLinks integration with opts.skipFrontmatter under both modes.

Local: bun run verify clean, bun test test/link-extraction.test.ts → 103 pass / 0 fail.

Scope note

The FS-source path (extractLinksFromDir) is NOT updated. It uses a different codepath via extractMarkdownLinks + resolveSlug; bare-name wikilinks in FS mode still won't resolve. Most users are on --source db (autopilot uses it); FS is for offline Obsidian-vault mode. Separate concern.

Test plan

bun run verify clean
bun test test/link-extraction.test.ts → 103/0/0
bun run typecheck clean
bun run test:e2e (gated on DATABASE_URL)
Manual verification on a real wiki corpus — gbrain extract links produces non-zero link counts on pages using [[Bare-Name]] shapes

🤖 Generated with Claude Code

Body wikilinks in wiki/topic/learning content are silently dropped on every `gbrain extract` pass. Three layered issues: 1. WIKILINK_RE in src/core/link-extraction.ts is gated on DIR_PATTERN (people|companies|meetings|...). Wiki/topic/learning content uses bare-name wikilinks like `[[Fast-Weigh]]` or `[[2026-05-07-cost-plan]]` which fall outside that allow-list — the regex never matched, so body refs were invisible to extract. 2. Body wikilinks that DID match were only resolved when `--include-frontmatter` was set, because extractPageLinks routed ALL refs (body + frontmatter) through `activeResolver` which was set to nullResolver when frontmatter was off. Body refs are already free of the cost concern that gated frontmatter — they surface in markdown the user explicitly typed — so they should always resolve. 3. extract.ts called extractPageLinks with one resolver doing both jobs. Splitting via opts.skipFrontmatter lets the body pass keep the real resolver while frontmatter stays opt-in. Fixes: - link-extraction.ts adds BARE_WIKILINK_RE matching `[[<name>(#anchor)?(|display)?]]` shapes outside DIR_PATTERN, resolved via the new `resolveBareWikilink(name, resolver)` that walks fuzzy match + bare-name prefix expansion + exact-slug before giving up. Three new exports: BARE_WIKILINK_RE, resolveBareWikilink, isBareName (regex shape guard for the pre-extract candidate check). extractPageLinks gains an opts.skipFrontmatter parameter — when true, the frontmatter pass is skipped but body wikilinks still resolve through the passed resolver. - extract.ts threads the always-on `resolver` (not the conditional nullResolver) into extractPageLinks for the body pass, with opts.skipFrontmatter wired off `--include-frontmatter`. - test/link-extraction.test.ts: 75 lines covering BARE_WIKILINK_RE shape (anchor + display variants), resolveBareWikilink fuzzy + prefix + exact paths, isBareName negative cases (DIR_PATTERN prefixes still rejected), and extractPageLinks integration with opts.skipFrontmatter under both modes. Scope note: this PR is the wikilink resolver portion of the original PR garrytan#768 wave. The doctor.ts hint fix that was also in that wave has been absorbed by upstream master independently (doctor.ts:2503 now correctly says `Run: gbrain extract all`). This PR carries only the wikilink resolver — no overlap with upstream. FS-source path (extractLinksFromDir) NOT updated. It uses a different codepath via extractMarkdownLinks + resolveSlug; bare- name wikilinks in FS mode still won't resolve. Most users are on --source db (autopilot uses it); FS is for offline Obsidian-vault mode. Separate concern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@rayers

…loses #972) (#1388) * v0.40.8.2 fix(extract): opt-in global-basename wikilink resolution (#972) Bare wikilinks like [[struktura]] that point at pages in another folder were silently dropped from the graph. The issue reporter saw 71 wikilinks in Obsidian render to 12 in gbrain (~83% lost). Symptoms downstream: `gbrain graph` returns thin neighborhoods, `gbrain backlinks` undercounts. This release adds an opt-in mode that resolves bare wikilinks by basename match, covers all three resolver surfaces (FS-source extract, DB-source extract, put_page auto-link), and emits one edge per match — no silent winner on ambiguity. `gbrain doctor` surfaces a paste-ready enable hint when ≥5 bare wikilinks would resolve under the new mode. Enable with: gbrain config set link_resolution.global_basename true gbrain extract links Default stays off. Existing brains see zero behavior change on upgrade. Closes #972. Adapts PR #1233 from @rayers (regex shape + slug-tail index) into a multi-match, opt-in form with FS-source coverage that the original PR explicitly skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: document opt-in global-basename wikilink resolution (#972) The #972 feature shipped with no user-facing docs — only CHANGELOG + CLAUDE.md. Anyone migrating an Obsidian/Notion vault with bare [[name]] wikilinks couldn't discover the link_resolution.global_basename flag unless gbrain doctor happened to surface its hint. - README "Self-wiring knowledge graph": one sentence on the opt-in mode for Obsidian-style cross-folder bare wikilinks + the doctor pre-check, linking to the install step. - INSTALL_FOR_AGENTS Step 4.5 (Wire the Knowledge Graph): a dedicated agent- facing subsection — when bare [[name]] links need it, the enable command, re-running extract, the doctor opportunity hint, and the multi-match behavior. - Regenerated llms-full.txt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(#972): resolve aliased wikilinks by target slug, not display text Codex outside-voice [P1]: `[[struktura|the project]]` resolved the basename "the project" (the alias) instead of `struktura` (the target), because extractPageLinks called resolveBasenameMatches(ref.name) and the doctor check keyed basenameIndex.get(e.name). ref.name is the display alias (match[2]); ref.slug is the wikilink target (match[1]). - extractPageLinks resolves ref.slug; context excerpt locates ref.slug. - doctor link_resolution_opportunity keys e.slug so its estimate matches what extraction actually resolves. - Test: aliased wikilink calls resolveBasenameMatches with the target, never the display text. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(#972): reconcile wikilink-resolved edges in put_page auto-link Codex outside-voice [P1]: put_page's reconcilableOut filter excluded link_source='wikilink-resolved', so a basename edge written by auto-link survived after the bare wikilink was deleted from the page OR the link_resolution.global_basename flag was turned off (the stale-removal loop only iterates reconcilableOut). Add 'wikilink-resolved' to the reconcilable set; manual edges still untouched. Test: write page with [[struktura]] (flag on) → edge lands; re-put without the wikilink → edge reconciled away. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(#972): source-scope basename resolution (no cross-source edges) Codex outside-voice [P1]: makeResolver.resolveBasenameMatches called engine.getAllSlugs() unscoped, so a bare [[name]] could resolve to a same-tail page in a DIFFERENT source and create a cross-source edge. The engine exposes getAllSlugs({sourceId}) precisely to prevent this. #972 is "global basename across folders," not "cross-source federation" — the canonical gbrain multi-source bug class. - makeResolver gains opts.sourceId; ensureBasenameIndex passes it to getAllSlugs (unscoped only when sourceId omitted — back-compat). - runAutoLink (put_page) passes opts.sourceId; extractLinksFromDB passes sourceIdFilter. FS extract is already single-source (walks one dir). - Tests: scoped index returns only the source's slugs (no cross-source); unscoped call stays brain-wide. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(#972): FS-source basename edges carry link_source='wikilink-resolved' The FS extract path is the issue's default repro (gbrain extract links with no --source db). ExtractedLink had no link_source field, so FS basename edges landed with the engine default ('markdown') instead of the 'wikilink-resolved' provenance the DB / put_page paths set and the docs promise. The e2e FS test only asserted link_type, so it was blind to this. - ExtractedLink gains link_source?; extractLinksFromFile sets it to 'wikilink-resolved' on basename edges (undefined for ordinary markdown). - Carries through the addLinksBatch snapshots automatically (LinkBatchInput already has link_source); single-row addLink fallback now passes it too. - e2e FS repro asserts link_source === 'wikilink-resolved'. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(#972): one shared basename matcher across resolver/FS/doctor Codex outside-voice [P2] DRY: three surfaces each hand-rolled a basename matcher with divergent key sets — the doctor omitted the slugified key, so its link_resolution_opportunity estimate undercounted what extraction resolves, and the resolver returned matches in unsorted getAllSlugs bucket order. New shared exports in link-extraction.ts: buildBasenameIndex(slugs) + queryBasenameIndex(index, name) (keys raw/lower/slugified tail; stable sort shorter-first then lexical) + normalizeBasename. - makeResolver.resolveBasenameMatches → queryBasenameIndex (now stable-sorted). - extract.ts resolveBasenameMatchesFromSlugs → delegates to the shared pair. - doctor link_resolution_opportunity → shared builder/query (slugified key added; estimate now matches extraction). - Test: doctor counts a slugified-only match ([[Fast Weigh]] → companies/fast-weigh). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(#972): P2 cluster — masking, code-fence, self-link, dedup decision Codex outside-voice P2 findings: - P2a markdown-label masking: a wikilink inside a markdown-link label ([see [[acme]]](companies/acme.md)) spawned a stray generic basename ref. Pass-1 can't match the nested brackets, so a new MARKDOWN_LABEL_WIKILINK_RE masks those spans out of pass 2c. Inner [[acme]] is now inert. - P2b FS code-fence: the FS path (extractMarkdownLinks on raw content) didn't strip code blocks like the DB path. extractLinksFromFile now scans stripCodeBlocks(content) so [[name]] inside a fence creates no FS edge. - P2c self-link guard: a basename [[own-tail]] on its own page resolved back to itself. Dropped in both extractPageLinks and the FS path. - P2d dedup: documented the decision to KEEP qualified + bare edges to the same target as separate rows (distinct provenance/audit trail). - P2e: skipFrontmatter unresolved-contract tests added. Tests: P2a inert-label, P2c self-link drop, P2b code-fence, P2e unresolved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * perf(#972): bound the doctor link_resolution_opportunity scan The check did listAllPageRefs() + a getPage() per page under a 60s budget. On a large brain (the eng-review concern) it hit the budget every non-fast doctor run and returned a perpetual partial, adding ~60s. Now batch-loads the 1000 most-recent pages in ONE query (ORDER BY id DESC LIMIT SAMPLE_LIMIT) and scans in memory, with the 60s cap kept as a backstop. Mirrors the v0.40.9 sampling convention. The estimate message names the bound when the brain exceeds the sample ("scanned the 1000 most-recent of N pages"). Test: source-grep pins the bounded query + the absence of the per-page getPage walk. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(#972): reconcile stale version/migration references to v112 / 0.42.6.0 Merge churn left intermediate refs: schema.sql + schema-embedded.ts said "migration v93", CLAUDE.md said "v0.41.32.0 / Migration v109", CHANGELOG said "Migration v93". Reconciled all to migration v112 / shipping 0.42.6.0. The CLAUDE.md annotation is also refreshed to describe the final behavior (shared matcher, source-scoping, alias-by-target, stale-edge reconciliation, bounded doctor scan) and credit @rayers + @ukd1. Regenerated schema-embedded + llms. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(#972): register doctor check category + bump llms budget to 800KB Two full-suite gate failures from the re-sync: - doctor-categories drift guard: the new `link_resolution_opportunity` check wasn't in any category set. Added to BRAIN_CHECK_NAMES (alongside graph_coverage / orphan_ratio — it's a graph-quality signal). - build-llms size budget: the #972 Key Files annotation (landing with master's #1696/#1699 waves) pushed llms-full.txt past 750KB. Bumped FULL_SIZE_BUDGET 750KB→800KB, the established "budget tracks CLAUDE.md's legitimate per-feature growth" pattern (600→700→750→800 across releases). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Garry Tan <garrytan@gmail.com>

rayers · 2026-06-03T07:52:35Z

Superseded by #1388 (merged), which is upstream's adoption of this PR's kernel — the generic wikilink regex + slug-tail index pattern — reworked as opt-in via link_resolution.global_basename, with multi-match resolution and FS-source coverage. Thanks for taking it forward; closing this as done-a-different-way.

This was referenced May 20, 2026

fix(sync+embed+extract): code-symbol ingest + graph extraction (#767 + #769 + extract follow-ups) #768

Closed

Make entity link directories configurable #632

Closed

ukd1 mentioned this pull request May 24, 2026

v0.42.10.0 feat(extract): opt-in global-basename wikilink resolution (closes #972) #1388

Merged

rayers force-pushed the fix/extract-bare-name-wikilinks branch from 596705c to 5b1d2e3 Compare May 26, 2026 11:11

rayers force-pushed the fix/extract-bare-name-wikilinks branch from 5b1d2e3 to 55232a9 Compare May 28, 2026 05:59

rayers closed this Jun 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(extract): resolve bare-name body wikilinks via resolver#1233

fix(extract): resolve bare-name body wikilinks via resolver#1233
rayers wants to merge 1 commit into
garrytan:masterfrom
rayers:fix/extract-bare-name-wikilinks

rayers commented May 20, 2026

Uh oh!

rayers commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rayers commented May 20, 2026

Summary

What this PR is

Fixes

Tests

Scope note

Test plan

Uh oh!

rayers commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant