fix(extract): resolve bare-name body wikilinks via resolver#1233
Closed
rayers wants to merge 1 commit into
Closed
Conversation
This was referenced May 20, 2026
596705c to
5b1d2e3
Compare
Body wikilinks in wiki/topic/learning content are silently dropped on every `gbrain extract` pass. Three layered issues: 1. WIKILINK_RE in src/core/link-extraction.ts is gated on DIR_PATTERN (people|companies|meetings|...). Wiki/topic/learning content uses bare-name wikilinks like `[[Fast-Weigh]]` or `[[2026-05-07-cost-plan]]` which fall outside that allow-list — the regex never matched, so body refs were invisible to extract. 2. Body wikilinks that DID match were only resolved when `--include-frontmatter` was set, because extractPageLinks routed ALL refs (body + frontmatter) through `activeResolver` which was set to nullResolver when frontmatter was off. Body refs are already free of the cost concern that gated frontmatter — they surface in markdown the user explicitly typed — so they should always resolve. 3. extract.ts called extractPageLinks with one resolver doing both jobs. Splitting via opts.skipFrontmatter lets the body pass keep the real resolver while frontmatter stays opt-in. Fixes: - link-extraction.ts adds BARE_WIKILINK_RE matching `[[<name>(#anchor)?(|display)?]]` shapes outside DIR_PATTERN, resolved via the new `resolveBareWikilink(name, resolver)` that walks fuzzy match + bare-name prefix expansion + exact-slug before giving up. Three new exports: BARE_WIKILINK_RE, resolveBareWikilink, isBareName (regex shape guard for the pre-extract candidate check). extractPageLinks gains an opts.skipFrontmatter parameter — when true, the frontmatter pass is skipped but body wikilinks still resolve through the passed resolver. - extract.ts threads the always-on `resolver` (not the conditional nullResolver) into extractPageLinks for the body pass, with opts.skipFrontmatter wired off `--include-frontmatter`. - test/link-extraction.test.ts: 75 lines covering BARE_WIKILINK_RE shape (anchor + display variants), resolveBareWikilink fuzzy + prefix + exact paths, isBareName negative cases (DIR_PATTERN prefixes still rejected), and extractPageLinks integration with opts.skipFrontmatter under both modes. Scope note: this PR is the wikilink resolver portion of the original PR garrytan#768 wave. The doctor.ts hint fix that was also in that wave has been absorbed by upstream master independently (doctor.ts:2503 now correctly says `Run: gbrain extract all`). This PR carries only the wikilink resolver — no overlap with upstream. FS-source path (extractLinksFromDir) NOT updated. It uses a different codepath via extractMarkdownLinks + resolveSlug; bare- name wikilinks in FS mode still won't resolve. Most users are on --source db (autopilot uses it); FS is for offline Obsidian-vault mode. Separate concern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5b1d2e3 to
55232a9
Compare
garrytan
added a commit
that referenced
this pull request
Jun 2, 2026
…loses #972) (#1388) * v0.40.8.2 fix(extract): opt-in global-basename wikilink resolution (#972) Bare wikilinks like [[struktura]] that point at pages in another folder were silently dropped from the graph. The issue reporter saw 71 wikilinks in Obsidian render to 12 in gbrain (~83% lost). Symptoms downstream: `gbrain graph` returns thin neighborhoods, `gbrain backlinks` undercounts. This release adds an opt-in mode that resolves bare wikilinks by basename match, covers all three resolver surfaces (FS-source extract, DB-source extract, put_page auto-link), and emits one edge per match — no silent winner on ambiguity. `gbrain doctor` surfaces a paste-ready enable hint when ≥5 bare wikilinks would resolve under the new mode. Enable with: gbrain config set link_resolution.global_basename true gbrain extract links Default stays off. Existing brains see zero behavior change on upgrade. Closes #972. Adapts PR #1233 from @rayers (regex shape + slug-tail index) into a multi-match, opt-in form with FS-source coverage that the original PR explicitly skipped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: document opt-in global-basename wikilink resolution (#972) The #972 feature shipped with no user-facing docs — only CHANGELOG + CLAUDE.md. Anyone migrating an Obsidian/Notion vault with bare [[name]] wikilinks couldn't discover the link_resolution.global_basename flag unless gbrain doctor happened to surface its hint. - README "Self-wiring knowledge graph": one sentence on the opt-in mode for Obsidian-style cross-folder bare wikilinks + the doctor pre-check, linking to the install step. - INSTALL_FOR_AGENTS Step 4.5 (Wire the Knowledge Graph): a dedicated agent- facing subsection — when bare [[name]] links need it, the enable command, re-running extract, the doctor opportunity hint, and the multi-match behavior. - Regenerated llms-full.txt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(#972): resolve aliased wikilinks by target slug, not display text Codex outside-voice [P1]: `[[struktura|the project]]` resolved the basename "the project" (the alias) instead of `struktura` (the target), because extractPageLinks called resolveBasenameMatches(ref.name) and the doctor check keyed basenameIndex.get(e.name). ref.name is the display alias (match[2]); ref.slug is the wikilink target (match[1]). - extractPageLinks resolves ref.slug; context excerpt locates ref.slug. - doctor link_resolution_opportunity keys e.slug so its estimate matches what extraction actually resolves. - Test: aliased wikilink calls resolveBasenameMatches with the target, never the display text. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(#972): reconcile wikilink-resolved edges in put_page auto-link Codex outside-voice [P1]: put_page's reconcilableOut filter excluded link_source='wikilink-resolved', so a basename edge written by auto-link survived after the bare wikilink was deleted from the page OR the link_resolution.global_basename flag was turned off (the stale-removal loop only iterates reconcilableOut). Add 'wikilink-resolved' to the reconcilable set; manual edges still untouched. Test: write page with [[struktura]] (flag on) → edge lands; re-put without the wikilink → edge reconciled away. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(#972): source-scope basename resolution (no cross-source edges) Codex outside-voice [P1]: makeResolver.resolveBasenameMatches called engine.getAllSlugs() unscoped, so a bare [[name]] could resolve to a same-tail page in a DIFFERENT source and create a cross-source edge. The engine exposes getAllSlugs({sourceId}) precisely to prevent this. #972 is "global basename across folders," not "cross-source federation" — the canonical gbrain multi-source bug class. - makeResolver gains opts.sourceId; ensureBasenameIndex passes it to getAllSlugs (unscoped only when sourceId omitted — back-compat). - runAutoLink (put_page) passes opts.sourceId; extractLinksFromDB passes sourceIdFilter. FS extract is already single-source (walks one dir). - Tests: scoped index returns only the source's slugs (no cross-source); unscoped call stays brain-wide. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(#972): FS-source basename edges carry link_source='wikilink-resolved' The FS extract path is the issue's default repro (gbrain extract links with no --source db). ExtractedLink had no link_source field, so FS basename edges landed with the engine default ('markdown') instead of the 'wikilink-resolved' provenance the DB / put_page paths set and the docs promise. The e2e FS test only asserted link_type, so it was blind to this. - ExtractedLink gains link_source?; extractLinksFromFile sets it to 'wikilink-resolved' on basename edges (undefined for ordinary markdown). - Carries through the addLinksBatch snapshots automatically (LinkBatchInput already has link_source); single-row addLink fallback now passes it too. - e2e FS repro asserts link_source === 'wikilink-resolved'. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(#972): one shared basename matcher across resolver/FS/doctor Codex outside-voice [P2] DRY: three surfaces each hand-rolled a basename matcher with divergent key sets — the doctor omitted the slugified key, so its link_resolution_opportunity estimate undercounted what extraction resolves, and the resolver returned matches in unsorted getAllSlugs bucket order. New shared exports in link-extraction.ts: buildBasenameIndex(slugs) + queryBasenameIndex(index, name) (keys raw/lower/slugified tail; stable sort shorter-first then lexical) + normalizeBasename. - makeResolver.resolveBasenameMatches → queryBasenameIndex (now stable-sorted). - extract.ts resolveBasenameMatchesFromSlugs → delegates to the shared pair. - doctor link_resolution_opportunity → shared builder/query (slugified key added; estimate now matches extraction). - Test: doctor counts a slugified-only match ([[Fast Weigh]] → companies/fast-weigh). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(#972): P2 cluster — masking, code-fence, self-link, dedup decision Codex outside-voice P2 findings: - P2a markdown-label masking: a wikilink inside a markdown-link label ([see [[acme]]](companies/acme.md)) spawned a stray generic basename ref. Pass-1 can't match the nested brackets, so a new MARKDOWN_LABEL_WIKILINK_RE masks those spans out of pass 2c. Inner [[acme]] is now inert. - P2b FS code-fence: the FS path (extractMarkdownLinks on raw content) didn't strip code blocks like the DB path. extractLinksFromFile now scans stripCodeBlocks(content) so [[name]] inside a fence creates no FS edge. - P2c self-link guard: a basename [[own-tail]] on its own page resolved back to itself. Dropped in both extractPageLinks and the FS path. - P2d dedup: documented the decision to KEEP qualified + bare edges to the same target as separate rows (distinct provenance/audit trail). - P2e: skipFrontmatter unresolved-contract tests added. Tests: P2a inert-label, P2c self-link drop, P2b code-fence, P2e unresolved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * perf(#972): bound the doctor link_resolution_opportunity scan The check did listAllPageRefs() + a getPage() per page under a 60s budget. On a large brain (the eng-review concern) it hit the budget every non-fast doctor run and returned a perpetual partial, adding ~60s. Now batch-loads the 1000 most-recent pages in ONE query (ORDER BY id DESC LIMIT SAMPLE_LIMIT) and scans in memory, with the 60s cap kept as a backstop. Mirrors the v0.40.9 sampling convention. The estimate message names the bound when the brain exceeds the sample ("scanned the 1000 most-recent of N pages"). Test: source-grep pins the bounded query + the absence of the per-page getPage walk. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(#972): reconcile stale version/migration references to v112 / 0.42.6.0 Merge churn left intermediate refs: schema.sql + schema-embedded.ts said "migration v93", CLAUDE.md said "v0.41.32.0 / Migration v109", CHANGELOG said "Migration v93". Reconciled all to migration v112 / shipping 0.42.6.0. The CLAUDE.md annotation is also refreshed to describe the final behavior (shared matcher, source-scoping, alias-by-target, stale-edge reconciliation, bounded doctor scan) and credit @rayers + @ukd1. Regenerated schema-embedded + llms. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(#972): register doctor check category + bump llms budget to 800KB Two full-suite gate failures from the re-sync: - doctor-categories drift guard: the new `link_resolution_opportunity` check wasn't in any category set. Added to BRAIN_CHECK_NAMES (alongside graph_coverage / orphan_ratio — it's a graph-quality signal). - build-llms size budget: the #972 Key Files annotation (landing with master's #1696/#1699 waves) pushed llms-full.txt past 750KB. Bumped FULL_SIZE_BUDGET 750KB→800KB, the established "budget tracks CLAUDE.md's legitimate per-feature growth" pattern (600→700→750→800 across releases). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Garry Tan <garrytan@gmail.com>
Author
|
Superseded by #1388 (merged), which is upstream's adoption of this PR's kernel — the generic wikilink regex + slug-tail index pattern — reworked as opt-in via |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Body wikilinks in wiki/topic/learning content are silently dropped on every
gbrain extractpass. Three layered issues:WIKILINK_REinsrc/core/link-extraction.tsis gated onDIR_PATTERN(people|companies|meetings|...). Wiki/topic/learning content uses bare-name wikilinks like[[Fast-Weigh]]or[[2026-05-07-cost-plan]]which fall outside that allow-list — the regex never matched, so body refs were invisible to extract.--include-frontmatterwas set, becauseextractPageLinksrouted ALL refs (body + frontmatter) throughactiveResolverwhich was set tonullResolverwhen frontmatter was off. Body refs are already free of the cost concern that gated frontmatter — they surface in markdown the user explicitly typed — so they should always resolve.extract.tscalledextractPageLinkswith one resolver doing both jobs. Splitting viaopts.skipFrontmatterlets the body pass keep the real resolver while frontmatter stays opt-in.What this PR is
This is the wikilink-resolver portion of the original PR #768 (which bundled #767 + #769 + extract polish). Two pieces of that bundle have either been absorbed or split off:
collectSyncableFilesindependently.Run: gbrain extract allinstead of the gone-since-v0.16gbrain link-extract && gbrain timeline-extract) has been absorbed by upstream master atdoctor.ts:2503.This PR carries only the wikilink resolver — no scope overlap with upstream master.
Fixes
link-extraction.tsaddsBARE_WIKILINK_REmatching[[<name>(#anchor)?(|display)?]]shapes outsideDIR_PATTERN, resolved via the newresolveBareWikilink(name, resolver)that walks fuzzy match → bare-name prefix expansion → exact-slug before giving up. Three new exports:BARE_WIKILINK_RE,resolveBareWikilink,isBareName(regex shape guard for the pre-extract candidate check).extractPageLinksgains anopts.skipFrontmatterparameter — when true, the frontmatter pass is skipped but body wikilinks still resolve through the passed resolver.extract.tsthreads the always-onresolver(not the conditionalnullResolver) intoextractPageLinksfor the body pass, withopts.skipFrontmatterwired off--include-frontmatter.Tests
test/link-extraction.test.ts: 75 lines coveringBARE_WIKILINK_REshape (anchor + display variants),resolveBareWikilinkfuzzy + prefix + exact paths,isBareNamenegative cases (DIR_PATTERNprefixes still rejected), andextractPageLinksintegration withopts.skipFrontmatterunder both modes.Local:
bun run verifyclean,bun test test/link-extraction.test.ts→ 103 pass / 0 fail.Scope note
The FS-source path (
extractLinksFromDir) is NOT updated. It uses a different codepath viaextractMarkdownLinks+resolveSlug; bare-name wikilinks in FS mode still won't resolve. Most users are on--source db(autopilot uses it); FS is for offline Obsidian-vault mode. Separate concern.Test plan
bun run verifycleanbun test test/link-extraction.test.ts→ 103/0/0bun run typecheckcleanbun run test:e2e(gated on DATABASE_URL)gbrain extract linksproduces non-zero link counts on pages using[[Bare-Name]]shapes🤖 Generated with Claude Code