Skip to content

Commit 1cfb156

Browse files
knee5claude
andcommitted
feat(extract): support Obsidian wikilinks + wiki-style domain slugs in canonical extractor
extractEntityRefs now recognizes both syntaxes equally: [Name](people/slug) -- upstream original [[people/slug|Name]] -- Obsidian wikilink (new) Extends DIR_PATTERN to include domain-organized wiki slugs used by Karpathy-style knowledge bases: - entities (legacy prefix some brains keep during migration) - projects (gbrain canonical, was missing from regex) - tech, finance, personal, openclaw (domain-organized wiki roots) Before this change, a 2,100-page brain with wikilinks throughout extracted zero auto-links on put_page because the regex only matched markdown-style [name](path). After: 1,377 new typed edges on a single extract --source db pass over the same corpus. Matches the behavior of the extract.ts filesystem walker (which already handled wikilinks as of the wiki-markdown-compat fix wave), so the db and fs sources now produce the same link graph from the same content. Both patterns share the DIR_PATTERN constant so adding a new entity dir only requires updating one string. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent f50954f commit 1cfb156

1 file changed

Lines changed: 55 additions & 13 deletions

File tree

src/core/link-extraction.ts

Lines changed: 55 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -27,16 +27,41 @@ export interface EntityRef {
2727
}
2828

2929
/**
30-
* Match `[Name](path)` markdown links pointing to `people/` or `companies/`
31-
* (and other entity directories). Accepts both filesystem-relative format
32-
* (`[Name](../people/slug.md)`) AND engine-slug format (`[Name](people/slug)`).
30+
* Directory prefix whitelist. These are the top-level slug dirs the extractor
31+
* recognizes as entity references. Upstream canonical + our extensions:
32+
* - Gbrain canonical: people, companies, meetings, concepts, deal, civic, project, source, media, yc, projects
33+
* - Our domain extensions: tech, finance, personal, openclaw (domain-organized wikis)
34+
* - Our entity prefix: entities (we kept some legacy entities/projects/ pages)
35+
*/
36+
const DIR_PATTERN = '(?:people|companies|meetings|concepts|deal|civic|project|projects|source|media|yc|tech|finance|personal|openclaw|entities)';
37+
38+
/**
39+
* Match `[Name](path)` markdown links pointing to entity directories.
40+
* Accepts both filesystem-relative format (`[Name](../people/slug.md)`)
41+
* AND engine-slug format (`[Name](people/slug)`).
3342
*
34-
* Captures: name, dir (people/companies/...), slug.
43+
* Captures: name, slug (dir/name, possibly deeper).
3544
*
3645
* The regex permits an optional `../` prefix (any number) and an optional
3746
* `.md` suffix so the same function works for both filesystem and DB content.
3847
*/
39-
const ENTITY_REF_RE = /\[([^\]]+)\]\((?:\.\.\/)*((?:people|companies|meetings|concepts|deal|civic|project|source|media|yc)\/([^)\s]+?))(?:\.md)?\)/g;
48+
const ENTITY_REF_RE = new RegExp(
49+
`\\[([^\\]]+)\\]\\((?:\\.\\.\\/)*(${DIR_PATTERN}\\/[^)\\s]+?)(?:\\.md)?\\)`,
50+
'g',
51+
);
52+
53+
/**
54+
* Match Obsidian-style `[[path]]` or `[[path|Display Text]]` wikilinks.
55+
* Captures: slug (dir/...), displayName (optional).
56+
*
57+
* Same dir whitelist as ENTITY_REF_RE. Strips trailing `.md`, strips section
58+
* anchors (`#heading`), skips external URLs. Wiki KBs use this format almost
59+
* exclusively so missing it leaves the graph empty.
60+
*/
61+
const WIKILINK_RE = new RegExp(
62+
`\\[\\[(${DIR_PATTERN}\\/[^|\\]#]+?)(?:#[^|\\]]*?)?(?:\\|([^\\]]+?))?\\]\\]`,
63+
'g',
64+
);
4065

4166
/**
4267
* Strip fenced code blocks (```...```) and inline code (`...`) from markdown,
@@ -84,16 +109,30 @@ function stripCodeBlocks(content: string): string {
84109
export function extractEntityRefs(content: string): EntityRef[] {
85110
const stripped = stripCodeBlocks(content);
86111
const refs: EntityRef[] = [];
87-
let m: RegExpExecArray | null;
88-
// Fresh regex per call (g-flag state is per-instance).
89-
const re = new RegExp(ENTITY_REF_RE.source, ENTITY_REF_RE.flags);
90-
while ((m = re.exec(stripped)) !== null) {
91-
const name = m[1];
92-
const fullPath = m[2];
93-
const slug = fullPath; // dir/slug
112+
let match: RegExpExecArray | null;
113+
114+
// 1. Markdown links: [Name](path)
115+
const mdPattern = new RegExp(ENTITY_REF_RE.source, ENTITY_REF_RE.flags);
116+
while ((match = mdPattern.exec(stripped)) !== null) {
117+
const name = match[1];
118+
const fullPath = match[2];
119+
const slug = fullPath;
94120
const dir = fullPath.split('/')[0];
95121
refs.push({ name, slug, dir });
96122
}
123+
124+
// 2. Obsidian wikilinks: [[path]] or [[path|Display Text]]
125+
const wikiPattern = new RegExp(WIKILINK_RE.source, WIKILINK_RE.flags);
126+
while ((match = wikiPattern.exec(stripped)) !== null) {
127+
let slug = match[1].trim();
128+
if (!slug) continue;
129+
if (slug.includes('://')) continue;
130+
if (slug.endsWith('.md')) slug = slug.slice(0, -3);
131+
const displayName = (match[2] || slug).trim();
132+
const dir = slug.split('/')[0];
133+
refs.push({ name: displayName, slug, dir });
134+
}
135+
97136
return refs;
98137
}
99138

@@ -145,7 +184,10 @@ export function extractPageLinks(
145184
// Limited to the same entity directories ENTITY_REF_RE covers.
146185
// Code blocks are stripped first — slugs in code samples are not real refs.
147186
const strippedContent = stripCodeBlocks(content);
148-
const bareRe = /\b((?:people|companies|meetings|concepts|deal|civic|project|source|media|yc)\/[a-z0-9][a-z0-9-]*)\b/g;
187+
const bareRe = new RegExp(
188+
`\\b(${DIR_PATTERN}\\/[a-z0-9][a-z0-9/-]*[a-z0-9])\\b`,
189+
'g',
190+
);
149191
let m: RegExpExecArray | null;
150192
while ((m = bareRe.exec(strippedContent)) !== null) {
151193
// Skip matches that are part of a markdown link (already handled above).

0 commit comments

Comments
 (0)