Skip to content

fix(extract): resolve bare-name body wikilinks via resolver#1233

Closed
rayers wants to merge 1 commit into
garrytan:masterfrom
rayers:fix/extract-bare-name-wikilinks
Closed

fix(extract): resolve bare-name body wikilinks via resolver#1233
rayers wants to merge 1 commit into
garrytan:masterfrom
rayers:fix/extract-bare-name-wikilinks

Conversation

@rayers

@rayers rayers commented May 20, 2026

Copy link
Copy Markdown

Summary

Body wikilinks in wiki/topic/learning content are silently dropped on every gbrain extract pass. Three layered issues:

  1. WIKILINK_RE in src/core/link-extraction.ts is gated on DIR_PATTERN (people|companies|meetings|...). Wiki/topic/learning content uses bare-name wikilinks like [[Fast-Weigh]] or [[2026-05-07-cost-plan]] which fall outside that allow-list — the regex never matched, so body refs were invisible to extract.
  2. Body wikilinks that DID match were only resolved when --include-frontmatter was set, because extractPageLinks routed ALL refs (body + frontmatter) through activeResolver which was set to nullResolver when frontmatter was off. Body refs are already free of the cost concern that gated frontmatter — they surface in markdown the user explicitly typed — so they should always resolve.
  3. extract.ts called extractPageLinks with one resolver doing both jobs. Splitting via opts.skipFrontmatter lets the body pass keep the real resolver while frontmatter stays opt-in.

What this PR is

This is the wikilink-resolver portion of the original PR #768 (which bundled #767 + #769 + extract polish). Two pieces of that bundle have either been absorbed or split off:

This PR carries only the wikilink resolver — no scope overlap with upstream master.

Fixes

link-extraction.ts adds BARE_WIKILINK_RE matching [[<name>(#anchor)?(|display)?]] shapes outside DIR_PATTERN, resolved via the new resolveBareWikilink(name, resolver) that walks fuzzy match → bare-name prefix expansion → exact-slug before giving up. Three new exports: BARE_WIKILINK_RE, resolveBareWikilink, isBareName (regex shape guard for the pre-extract candidate check). extractPageLinks gains an opts.skipFrontmatter parameter — when true, the frontmatter pass is skipped but body wikilinks still resolve through the passed resolver.

extract.ts threads the always-on resolver (not the conditional nullResolver) into extractPageLinks for the body pass, with opts.skipFrontmatter wired off --include-frontmatter.

Tests

test/link-extraction.test.ts: 75 lines covering BARE_WIKILINK_RE shape (anchor + display variants), resolveBareWikilink fuzzy + prefix + exact paths, isBareName negative cases (DIR_PATTERN prefixes still rejected), and extractPageLinks integration with opts.skipFrontmatter under both modes.

Local: bun run verify clean, bun test test/link-extraction.test.ts → 103 pass / 0 fail.

Scope note

The FS-source path (extractLinksFromDir) is NOT updated. It uses a different codepath via extractMarkdownLinks + resolveSlug; bare-name wikilinks in FS mode still won't resolve. Most users are on --source db (autopilot uses it); FS is for offline Obsidian-vault mode. Separate concern.

Test plan

  • bun run verify clean
  • bun test test/link-extraction.test.ts → 103/0/0
  • bun run typecheck clean
  • bun run test:e2e (gated on DATABASE_URL)
  • Manual verification on a real wiki corpus — gbrain extract links produces non-zero link counts on pages using [[Bare-Name]] shapes

🤖 Generated with Claude Code

Body wikilinks in wiki/topic/learning content are silently dropped
on every `gbrain extract` pass. Three layered issues:

1. WIKILINK_RE in src/core/link-extraction.ts is gated on
   DIR_PATTERN (people|companies|meetings|...). Wiki/topic/learning
   content uses bare-name wikilinks like `[[Fast-Weigh]]` or
   `[[2026-05-07-cost-plan]]` which fall outside that allow-list —
   the regex never matched, so body refs were invisible to extract.

2. Body wikilinks that DID match were only resolved when
   `--include-frontmatter` was set, because extractPageLinks routed
   ALL refs (body + frontmatter) through `activeResolver` which was
   set to nullResolver when frontmatter was off. Body refs are
   already free of the cost concern that gated frontmatter — they
   surface in markdown the user explicitly typed — so they should
   always resolve.

3. extract.ts called extractPageLinks with one resolver doing both
   jobs. Splitting via opts.skipFrontmatter lets the body pass keep
   the real resolver while frontmatter stays opt-in.

Fixes:

- link-extraction.ts adds BARE_WIKILINK_RE matching
  `[[<name>(#anchor)?(|display)?]]` shapes outside DIR_PATTERN,
  resolved via the new `resolveBareWikilink(name, resolver)` that
  walks fuzzy match + bare-name prefix expansion + exact-slug
  before giving up. Three new exports: BARE_WIKILINK_RE,
  resolveBareWikilink, isBareName (regex shape guard for the
  pre-extract candidate check). extractPageLinks gains an
  opts.skipFrontmatter parameter — when true, the frontmatter
  pass is skipped but body wikilinks still resolve through the
  passed resolver.

- extract.ts threads the always-on `resolver` (not the conditional
  nullResolver) into extractPageLinks for the body pass, with
  opts.skipFrontmatter wired off `--include-frontmatter`.

- test/link-extraction.test.ts: 75 lines covering BARE_WIKILINK_RE
  shape (anchor + display variants), resolveBareWikilink fuzzy +
  prefix + exact paths, isBareName negative cases (DIR_PATTERN
  prefixes still rejected), and extractPageLinks integration with
  opts.skipFrontmatter under both modes.

Scope note: this PR is the wikilink resolver portion of the
original PR garrytan#768 wave. The doctor.ts hint fix that was also in
that wave has been absorbed by upstream master independently
(doctor.ts:2503 now correctly says `Run: gbrain extract all`).
This PR carries only the wikilink resolver — no overlap with
upstream.

FS-source path (extractLinksFromDir) NOT updated. It uses a
different codepath via extractMarkdownLinks + resolveSlug; bare-
name wikilinks in FS mode still won't resolve. Most users are on
--source db (autopilot uses it); FS is for offline Obsidian-vault
mode. Separate concern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rayers rayers force-pushed the fix/extract-bare-name-wikilinks branch from 5b1d2e3 to 55232a9 Compare May 28, 2026 05:59
garrytan added a commit that referenced this pull request Jun 2, 2026
…loses #972) (#1388)

* v0.40.8.2 fix(extract): opt-in global-basename wikilink resolution (#972)

Bare wikilinks like [[struktura]] that point at pages in another folder
were silently dropped from the graph. The issue reporter saw 71 wikilinks
in Obsidian render to 12 in gbrain (~83% lost). Symptoms downstream:
`gbrain graph` returns thin neighborhoods, `gbrain backlinks` undercounts.

This release adds an opt-in mode that resolves bare wikilinks by basename
match, covers all three resolver surfaces (FS-source extract, DB-source
extract, put_page auto-link), and emits one edge per match — no silent
winner on ambiguity. `gbrain doctor` surfaces a paste-ready enable hint
when ≥5 bare wikilinks would resolve under the new mode.

Enable with:
  gbrain config set link_resolution.global_basename true
  gbrain extract links

Default stays off. Existing brains see zero behavior change on upgrade.

Closes #972. Adapts PR #1233 from @rayers (regex shape + slug-tail index)
into a multi-match, opt-in form with FS-source coverage that the original
PR explicitly skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: document opt-in global-basename wikilink resolution (#972)

The #972 feature shipped with no user-facing docs — only CHANGELOG + CLAUDE.md.
Anyone migrating an Obsidian/Notion vault with bare [[name]] wikilinks couldn't
discover the link_resolution.global_basename flag unless gbrain doctor happened
to surface its hint.

- README "Self-wiring knowledge graph": one sentence on the opt-in mode for
  Obsidian-style cross-folder bare wikilinks + the doctor pre-check, linking to
  the install step.
- INSTALL_FOR_AGENTS Step 4.5 (Wire the Knowledge Graph): a dedicated agent-
  facing subsection — when bare [[name]] links need it, the enable command,
  re-running extract, the doctor opportunity hint, and the multi-match behavior.
- Regenerated llms-full.txt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(#972): resolve aliased wikilinks by target slug, not display text

Codex outside-voice [P1]: `[[struktura|the project]]` resolved the basename
"the project" (the alias) instead of `struktura` (the target), because
extractPageLinks called resolveBasenameMatches(ref.name) and the doctor check
keyed basenameIndex.get(e.name). ref.name is the display alias (match[2]);
ref.slug is the wikilink target (match[1]).

- extractPageLinks resolves ref.slug; context excerpt locates ref.slug.
- doctor link_resolution_opportunity keys e.slug so its estimate matches
  what extraction actually resolves.
- Test: aliased wikilink calls resolveBasenameMatches with the target, never
  the display text.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(#972): reconcile wikilink-resolved edges in put_page auto-link

Codex outside-voice [P1]: put_page's reconcilableOut filter excluded
link_source='wikilink-resolved', so a basename edge written by auto-link
survived after the bare wikilink was deleted from the page OR the
link_resolution.global_basename flag was turned off (the stale-removal loop
only iterates reconcilableOut). Add 'wikilink-resolved' to the reconcilable
set; manual edges still untouched.

Test: write page with [[struktura]] (flag on) → edge lands; re-put without
the wikilink → edge reconciled away.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(#972): source-scope basename resolution (no cross-source edges)

Codex outside-voice [P1]: makeResolver.resolveBasenameMatches called
engine.getAllSlugs() unscoped, so a bare [[name]] could resolve to a
same-tail page in a DIFFERENT source and create a cross-source edge. The
engine exposes getAllSlugs({sourceId}) precisely to prevent this. #972 is
"global basename across folders," not "cross-source federation" — the
canonical gbrain multi-source bug class.

- makeResolver gains opts.sourceId; ensureBasenameIndex passes it to
  getAllSlugs (unscoped only when sourceId omitted — back-compat).
- runAutoLink (put_page) passes opts.sourceId; extractLinksFromDB passes
  sourceIdFilter. FS extract is already single-source (walks one dir).
- Tests: scoped index returns only the source's slugs (no cross-source);
  unscoped call stays brain-wide.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(#972): FS-source basename edges carry link_source='wikilink-resolved'

The FS extract path is the issue's default repro (gbrain extract links with no
--source db). ExtractedLink had no link_source field, so FS basename edges
landed with the engine default ('markdown') instead of the 'wikilink-resolved'
provenance the DB / put_page paths set and the docs promise. The e2e FS test
only asserted link_type, so it was blind to this.

- ExtractedLink gains link_source?; extractLinksFromFile sets it to
  'wikilink-resolved' on basename edges (undefined for ordinary markdown).
- Carries through the addLinksBatch snapshots automatically (LinkBatchInput
  already has link_source); single-row addLink fallback now passes it too.
- e2e FS repro asserts link_source === 'wikilink-resolved'.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(#972): one shared basename matcher across resolver/FS/doctor

Codex outside-voice [P2] DRY: three surfaces each hand-rolled a basename
matcher with divergent key sets — the doctor omitted the slugified key, so its
link_resolution_opportunity estimate undercounted what extraction resolves, and
the resolver returned matches in unsorted getAllSlugs bucket order.

New shared exports in link-extraction.ts: buildBasenameIndex(slugs) +
queryBasenameIndex(index, name) (keys raw/lower/slugified tail; stable sort
shorter-first then lexical) + normalizeBasename.

- makeResolver.resolveBasenameMatches → queryBasenameIndex (now stable-sorted).
- extract.ts resolveBasenameMatchesFromSlugs → delegates to the shared pair.
- doctor link_resolution_opportunity → shared builder/query (slugified key
  added; estimate now matches extraction).
- Test: doctor counts a slugified-only match ([[Fast Weigh]] → companies/fast-weigh).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(#972): P2 cluster — masking, code-fence, self-link, dedup decision

Codex outside-voice P2 findings:
- P2a markdown-label masking: a wikilink inside a markdown-link label
  ([see [[acme]]](companies/acme.md)) spawned a stray generic basename ref.
  Pass-1 can't match the nested brackets, so a new MARKDOWN_LABEL_WIKILINK_RE
  masks those spans out of pass 2c. Inner [[acme]] is now inert.
- P2b FS code-fence: the FS path (extractMarkdownLinks on raw content) didn't
  strip code blocks like the DB path. extractLinksFromFile now scans
  stripCodeBlocks(content) so [[name]] inside a fence creates no FS edge.
- P2c self-link guard: a basename [[own-tail]] on its own page resolved back
  to itself. Dropped in both extractPageLinks and the FS path.
- P2d dedup: documented the decision to KEEP qualified + bare edges to the
  same target as separate rows (distinct provenance/audit trail).
- P2e: skipFrontmatter unresolved-contract tests added.

Tests: P2a inert-label, P2c self-link drop, P2b code-fence, P2e unresolved.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* perf(#972): bound the doctor link_resolution_opportunity scan

The check did listAllPageRefs() + a getPage() per page under a 60s budget.
On a large brain (the eng-review concern) it hit the budget every non-fast
doctor run and returned a perpetual partial, adding ~60s.

Now batch-loads the 1000 most-recent pages in ONE query
(ORDER BY id DESC LIMIT SAMPLE_LIMIT) and scans in memory, with the 60s cap
kept as a backstop. Mirrors the v0.40.9 sampling convention. The estimate
message names the bound when the brain exceeds the sample
("scanned the 1000 most-recent of N pages").

Test: source-grep pins the bounded query + the absence of the per-page
getPage walk.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(#972): reconcile stale version/migration references to v112 / 0.42.6.0

Merge churn left intermediate refs: schema.sql + schema-embedded.ts said
"migration v93", CLAUDE.md said "v0.41.32.0 / Migration v109", CHANGELOG said
"Migration v93". Reconciled all to migration v112 / shipping 0.42.6.0. The
CLAUDE.md annotation is also refreshed to describe the final behavior (shared
matcher, source-scoping, alias-by-target, stale-edge reconciliation, bounded
doctor scan) and credit @rayers + @ukd1. Regenerated schema-embedded + llms.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(#972): register doctor check category + bump llms budget to 800KB

Two full-suite gate failures from the re-sync:
- doctor-categories drift guard: the new `link_resolution_opportunity` check
  wasn't in any category set. Added to BRAIN_CHECK_NAMES (alongside
  graph_coverage / orphan_ratio — it's a graph-quality signal).
- build-llms size budget: the #972 Key Files annotation (landing with master's
  #1696/#1699 waves) pushed llms-full.txt past 750KB. Bumped FULL_SIZE_BUDGET
  750KB→800KB, the established "budget tracks CLAUDE.md's legitimate per-feature
  growth" pattern (600→700→750→800 across releases).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Garry Tan <garrytan@gmail.com>
@rayers

rayers commented Jun 3, 2026

Copy link
Copy Markdown
Author

Superseded by #1388 (merged), which is upstream's adoption of this PR's kernel — the generic wikilink regex + slug-tail index pattern — reworked as opt-in via link_resolution.global_basename, with multi-match resolution and FS-source coverage. Thanks for taking it forward; closing this as done-a-different-way.

@rayers rayers closed this Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant