Skip to content

feat: auto-link entity mentions for orphan reduction#1378

Closed
garrytan-agents wants to merge 3 commits into
garrytan:masterfrom
garrytan-agents:feat/auto-link-entity-mentions
Closed

feat: auto-link entity mentions for orphan reduction#1378
garrytan-agents wants to merge 3 commits into
garrytan:masterfrom
garrytan-agents:feat/auto-link-entity-mentions

Conversation

@garrytan-agents

Copy link
Copy Markdown
Contributor

Proposal: Auto-Link Entity Mentions (Orphan Reduction)

Problem

In a production brain with 165K+ pages, approximately 88% of pages are orphans — they have zero inbound links. This happens because the current link extraction only recognizes explicit markdown links ([Name](path)). When a page mentions an entity by name in body text (e.g., "we discussed Acme Corp's growth trajectory"), no link is created.

This means the vast majority of a brain's knowledge graph is disconnected, making it impossible to traverse relationships, find related content, or build meaningful entity profiles through link analysis.

Scale of Impact

Metric Value
Total pages ~165,000
Orphan pages (0 inbound links) ~146,000 (88%)
Entity link coverage ~32%

Proposed Solution

Add a link-by-mention pass in the dream/extract cycle that creates mentions links from text references to known entities.

How It Works

  1. Build a gazetteer from existing entity pages (person, company, etc.): collect each entity's title + aliases from frontmatter
  2. Scan recently-synced pages for text mentions of those entity names using case-insensitive fuzzy matching
  3. Create mentions links from the mentioning page to the mentioned entity page, with deduplication to avoid duplicates
  4. Gate behind a config flag: gbrain config set auto_link_mentions true

CLI Interface

# One-time backfill for existing pages
gbrain extract links --by-mention

# Enable ongoing auto-linking in dream cycle
gbrain config set auto_link_mentions true

# Dry run to preview what would be linked
gbrain extract links --by-mention --dry-run

Implementation Notes

  • The gazetteer is built from the brain's own pages — no external NER model needed for this pass
  • Fuzzy matching should handle common variations (e.g., "Acme" matching "Acme Corp", "Acme Corporation")
  • Dedup ensures running the command multiple times is safe (idempotent)
  • Performance: process in batches; for 165K pages, scanning all pages could take time so support --batch-size and --since flags

Agent Onboarding

Doctor Detection

gbrain doctor should detect orphan ratio >50% and surface a recommendation:

⚠ High orphan ratio: 88% of pages have no inbound links
  Recommendation: Run `gbrain extract links --by-mention` to create links from text mentions
  Or enable auto-linking: `gbrain config set auto_link_mentions true`

Fresh Install

On fresh install, the setup wizard should ask:

Would you like to enable automatic entity mention linking? 
This creates links when pages mention known entities by name. [y/N]

Migration Prompt (v0.41+)

Add a one-time migration that runs after upgrade:

Your brain has 88% orphan pages. 
Run `gbrain extract links --by-mention` to create links from text mentions? [y/N]

The migration records completion in the kv table so it doesn't prompt again.

Evidence

This proposal is based on production data from a 165K-page brain where orphan pages accumulated over months of operation. The operator discovered the issue only after running gbrain doctor — by then, 146K pages had no connections despite being rich with entity references in their body text.

Risks & Mitigations

Risk Mitigation
False positive matches (e.g., "Apple" the fruit vs "Apple" the company) Require minimum name length, prefer exact matches, allow an ignore list
Performance on large brains Batch processing, --since flag for incremental runs
Link spam on frequently-mentioned entities Cap mentions per source page, or only link first mention

root and others added 3 commits May 24, 2026 09:16
The judgeSignificance trimming (slice at 4000 chars) could split a
UTF-16 surrogate pair when an emoji sits exactly at the boundary,
producing a lone high surrogate that Anthropic's JSON parser rejects
with 'no low surrogate in string'.

Add safeSliceEnd() helper that backs up by one char when the cut lands
between a high and low surrogate. Apply to:
- judgeSignificance transcript trimming (the direct cause)
- findBoundary hard-split fallback (defense-in-depth)

Fixes: dream cycle SYNTH_PHASE_FAIL on 2026-05-24 caused by
🤖 emoji at pos 3999 in telegram/2026-05-20-topic-1-topic-1.md
Add proposal for automatic entity mention linking to reduce orphan pages.
In a 165K-page production brain, 88% of pages are orphans because link
extraction only finds explicit markdown links, not text mentions.

Proposes a link-by-mention pass in the dream/extract cycle.
@garrytan-agents

Copy link
Copy Markdown
Contributor Author

Closing — this was a docs-only proposal, not an implementation. Consolidating all 5 proposals into one design doc on #1383 (gbrain onboard). The surrogate-pair fix will land as a separate micro-PR. The real implementation work starts with auto-link (orphan reduction) as the first actual code PR.

garrytan added a commit that referenced this pull request May 25, 2026
…-pair fix (#1442)

* fix(synthesize): UTF-16 surrogate-safe hard-split in chunker

Part A of v0.42.0.0 fix wave: lifts surrogate-pair-safe slicing from
src/core/eval-contradictions/judge.ts into a new shared module
src/core/text-safe.ts. The dream-cycle chunker findBoundary tier-3
fallback (synthesize.ts) previously hard-split at maxChars, orphaning
a high surrogate when the boundary landed inside emoji / non-BMP CJK /
mathematical alphanumerics. Resulting chunks were not byte-identical
to the source content, which broke the v0.30.2 D9 stable-chunk-identity
invariant — the per-chunk idempotency key drifted across retries on
transcripts containing 4-byte UTF-8 characters near a hard-split.

Five agent-authored PRs (#1378-#1382) each independently introduced a
narrow safeSliceEnd helper that handled ONE of the three correctness
cases (high+low pair straddle) but missed the AT-low-surrogate case
that fires when a boundary lands inside a complete pair. The shared
text-safe.ts module exports both truncateUtf8 (the verbatim sliced
string, for judge.ts) and safeSplitIndex (the boundary index, for
chunker hot path), each covering all three cases.

Co-authored credit: @garrytan-agents for surfacing the fix in PRs
#1378-#1382 (closed in favor of consolidated design doc #1409).

* New: src/core/text-safe.ts (truncateUtf8 + safeSplitIndex helpers).
* New: test/text-safe.test.ts (18 cases, all 3 surrogate cases plus
  boundary-after-pair conservative back-up per codex CK16).
* refactor(judge): import truncateUtf8 from text-safe; re-export for
  back-compat. Existing 32 judge tests pass unchanged.
* fix(synthesize): findBoundary tier-3 routes through safeSplitIndex.
  3 new surrogate-safety cases in test/cycle-synthesize-chunker.test.ts
  (emoji at boundary, non-BMP CJK at boundary, determinism + joined
  chunks reconstruct source byte-identical across 5 fuzzed hashes).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(schema): widen link_source CHECK to include 'mentions' (v95)

Part B of v0.42.0.0: link_source enum widening to admit a fourth
provenance channel for auto-linked body-text mentions from the
upcoming `gbrain extract links --by-mention` command.

Codex outside-voice review on the v0.42.0.0 plan caught that the
existing link_source CHECK is a hard wall (src/schema.sql:356) —
my earlier draft claimed "no schema migration needed; link_source
is free-form TEXT." Wrong. The CHECK admits only NULL OR
('markdown', 'frontmatter', 'manual'); attempting to insert
link_source='mentions' would have raised a constraint violation
on every auto-link write. Migration v95 widens the CHECK to admit
'mentions' alongside the three existing values.

Mentions are intentionally a separate provenance from markdown
(human-authored links) so the backlink-count SQL in postgres-engine
+ pglite-engine can filter `WHERE link_source != 'mentions'` for
search ranking (D12). Mentions still count toward orphan-ratio and
graph traversal — distinct semantics from the three human-authored
sources, modeled cleanly on the dedicated CHECK value.

* src/schema.sql: widened CHECK with provenance comment.
* src/core/pglite-schema.ts: same widening (PGLite engine parity).
* src/core/schema-embedded.ts: regenerated via `bun run build:schema`.
* src/core/migrate.ts: new migration v95
  `links_link_source_check_includes_mentions` with both Postgres
  and PGLite branches. DROP IF EXISTS + ADD CONSTRAINT pattern so
  re-applying the migration is a no-op (idempotent).
* test/schema-migrate-link-source-mentions.test.ts (NEW, 7 cases):
  registration shape, SQL shape (all 4 values present + DROP IF
  EXISTS pattern), PGLite branch present, post-migration insert
  succeeds, CHECK still rejects unknown values (widening did not
  nullify the gate), idempotent re-application via runMigration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* refactor(orphans): expose getOrphansData alias as canonical pure data fn (D1)

D1 from /plan-eng-review for v0.42.0.0: doctor's upcoming orphan_ratio
check needs the SAME exclusion logic as `gbrain orphans` so the two
surfaces cannot disagree on what counts as an orphan. The existing
findOrphans() was already the pure data fn — this commit just makes
that contract explicit via the getOrphansData alias and pins it with
an IRON RULE regression test.

* src/commands/orphans.ts: export const getOrphansData = findOrphans
  (alias, same function reference). Documents the v0.42.0.0 contract
  in findOrphans' docstring.
* test/orphans-pure-fn.test.ts (NEW, 12 cases):
  - getOrphansData === findOrphans (same reference).
  - findOrphans + getOrphansData deep-equal output.
  - includePseudo branch toggles excluded count.
  - CLI --json output deep-equals findOrphans (IRON RULE — catches
    drift if anyone adds CLI-side post-filtering).
  - CLI --count matches total_orphans (with and without --include-pseudo).
  - shouldExclude regression: pseudo-pages, auto-suffix, raw segment,
    deny-prefixes, first-segment exclusions all fire correctly;
    regular slugs are NOT excluded.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(engine): filter mentions out of backlink-count for search ranking (D12)

D12 from /plan-eng-review for v0.42.0.0: codex outside-voice review
caught that engine.getBacklinkCounts had NO link_source filter — so
every link counted equally toward backlink-boost in hybridSearch.
Running `gbrain extract links --by-mention` (migration #1 of #1409)
would silently shift search ranking globally on first run, boosting
popular-mention pages over intentional-backlink pages.

Add `AND l.link_source IS DISTINCT FROM 'mentions'` to the LEFT JOIN
in both engines. `IS DISTINCT FROM` is NULL-safe per the
[sql-neq-misses-null-drift] memory: a naive `!= 'mentions'` would
silently drop legacy pre-v0.13 rows where link_source IS NULL (because
NULL != 'mentions' evaluates to NULL not TRUE in SQL three-valued
logic). The IS DISTINCT FROM form treats NULL as a distinct value so
legacy rows still count toward backlinks — the only rows filtered are
the explicitly mention-derived ones from v0.42.0.0+.

Mentions still count toward:
  - orphan-ratio (the whole point — `findOrphans` runs against `links`
    with no source filter, so an auto-linked page is no longer an orphan)
  - graph traversal (`traverseGraph` walks all link_source values)
  - graph adjacency (`getAdjacencyBoosts` includes mentions in the
    induced subgraph counts)

Mentions are filtered ONLY from:
  - `getBacklinkCounts` (this commit) — the input to hybridSearch's
    backlink_boost stage

* src/core/postgres-engine.ts: AND clause on the LEFT JOIN.
* src/core/pglite-engine.ts: same change for engine parity.
* test/backlink-count-mention-filter.test.ts (NEW, 6 cases):
  - 10 markdown + 0 mention → count = 10
  - 0 markdown + 50 mention → count = 0
  - 10 markdown + 50 mention → count = 10
  - NULL link_source legacy rows still count (IS DISTINCT FROM semantics)
  - mixed (markdown + frontmatter + manual + mentions) → only mentions filtered
  - uninitialized slug returns 0

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(by-mention): pure mention scanner with gazetteer + guards (D2/D6/D12/D13)

Net new module powering migration #1 of #1409 (orphan reduction).
buildGazetteer queries entity-typed pages (hardcoded D2 filter:
person/company/organization/entity, pack-aware deferred to TODO-1) and
produces a token-Map lookup keyed by lowercase first-token. findMentionedEntities
is a pure function that scans body text against the gazetteer, applies
maximal-munch matching (longest entry wins at each offset), self-link
guard (D13), cross-source guard, and per-page first-mention-only cap
(1 link per source→target pair regardless of how many body mentions).

Token-Map + multi-word phrase pass per D6 — no new deps, no regex
alternation (pathological perf at 5K patterns), no Aho-Corasick (dep
tax not justified at this scale). At each token offset, lookup in
Map<lowercase, GazetteerEntry[]> is O(1); multi-word entries validate
subsequent tokens. Bucket pre-sorted longest-first so the first valid
entry IS the maximal-munch winner.

Ignore-list semantics per CK12: built-in ambiguous tokens (Apple,
Amazon, Square, Stripe, Box, Meta, Target, Oracle) suppressed at
gazetteer-build time ONLY when no corresponding entity page exists.
If the user has explicitly created companies/apple, gazetteer
presence wins — ignore list does NOT override user intent.

Min-name-length filter at 4 chars kills false-positive 2-3-char names
(AI, YC, X, IBM). Codex CK13 noted this trade-off will under-deliver
on 3-char real entities; pack-aware follow-up (TODO-1) can let users
opt 3-char entity types in deliberately.

Code-block stripping via existing stripCodeBlocks() from
link-extraction.ts. CK8 fix: stripCodeBlocks was internal-only; this
commit exports it so by-mention.ts can reuse without rolling its own
fenced/inline code parser.

* src/core/by-mention.ts (NEW, 240 LOC):
  - LINKABLE_ENTITY_TYPES const (hardcoded D2 type filter).
  - GazetteerEntry + Gazetteer + Mention types.
  - buildGazetteer(engine, opts) — engine-backed, hardcoded type filter,
    ignore-list at build time per CK12, sort buckets longest-first.
  - findMentionedEntities(text, gazetteer, opts) — pure, maximal-munch,
    guards (self-link/cross-source/first-mention-cap), code-block strip.
* src/core/link-extraction.ts: export stripCodeBlocks (CK8 fix).
* test/by-mention.test.ts (NEW, 22 cases):
  - All 20 plan-mandated cases.
  - Plus extraIgnore user-override case + LINKABLE_ENTITY_TYPES contract pin.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(extract): --by-mention auto-link entity mentions (migration #1 of #1409)

Wires the v0.42.0.0 mention scanner into 'gbrain extract links'. Mode
dispatch: when --by-mention is set, runs ONLY the new mention pass
(skips default link/frontmatter extract) so the two surfaces don't
conflict mid-run. The default extract path is unchanged.

Flag plumbing:
* --by-mention: opts into the mention pass. Mode dispatch.
* --source fs --by-mention rejected with paste-ready --source db
  fix-hint (D7: gazetteer needs the engine; FS-walk + DB-gazetteer is
  incoherent).
* timeline --by-mention rejected (mentions are a links-pass concern).
* --source-id scopes the page WALK; gazetteer remains brain-wide
  (cross-source guard in findMentionedEntities suppresses scanning
  pages in source A from auto-linking entities in source B).
* --since DATE filters the walk to recently-modified pages.
* --type filter applies (rarely useful; included for parity).
* --dry-run prints add_link action lines without writing; --json
  emits one JSON line per dry-run action.

extractMentionsFromDb function:
* buildGazetteer once per run via hardcoded type filter (D2).
* Walks pages via engine.listAllPageRefs (DB-source only).
* Reads body as compiled_truth || '\n\n' || COALESCE(timeline, '')
  per D3 — separator-joined so an end-of-compiled token doesn't
  merge with a start-of-timeline token into a false phrase match.
* findMentionedEntities returns Mention[] with self-link guard (D13)
  + cross-source guard + first-mention-only cap baked in.
* addLinksBatch with link_source='mentions' — distinct provenance
  channel that backlink-count filters out for search ranking (D12).
* Empty-gazetteer no-op with informative message (no entity pages =
  nothing to scan).

* src/commands/extract.ts: --by-mention flag + mode dispatch + FS
  rejection + extractMentionsFromDb function (~120 LOC).
* test/extract-by-mention.test.ts (NEW, 12 cases):
  end-to-end happy path, idempotency, --dry-run no writes, --json
  output shape, --source-id scoping, --source fs rejection with
  fix-hint, timeline rejection, mode dispatch (no markdown rows when
  --by-mention), coexistence of markdown + mention link_source on
  same (from,to) pair via ON CONFLICT key, schema migration
  verification (link_source='mentions' insert succeeds), empty-brain
  no-op, cross-source guard (team-b post → default acme = no link).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(doctor): orphan_ratio check on local + thin-client surfaces (D5/D11)

D5/D11 from /plan-eng-review for v0.42.0.0: surface orphan-page count
in 'gbrain doctor' so users discover the new --by-mention fix without
having to know the feature exists. Two surfaces because thin-client
installs (gbrain init --mcp-only) route to runRemoteDoctor entirely —
adding the check to runDoctor only would miss every brain-server
consumer (codex CK5 caught this exactly during outside-voice review).

Local surface (src/commands/doctor.ts):
* Inserts as check '9b' right after graph_coverage.
* Consumes getOrphansData() — the canonical pure data fn from T5 —
  so doctor and 'gbrain orphans --count' cannot disagree on the ratio.
* Vacuous gate at < 100 entity pages (small brains naturally show
  high orphan ratio; not actionable signal).
* warn > 0.5, fail > 0.8; both states recommend
  'gbrain extract links --by-mention' as the fix.

Thin-client surface (src/core/doctor-remote.ts):
* New exported runOrphanRatioCheck function. Mirrors local logic
  but routes through find_orphans MCP op (existing v0.12.3 op,
  scope: read — even minimal-scope thin-clients can call it).
* Operator-pointing hint: 'Ask the brain operator at <url> to run
  gbrain extract links --by-mention'. Thin-client users can't run
  the fix against a brain they don't host (v0.31.1 bug class).
* Network failure fall-back: returns informational ok with
  network_error detail, NOT fail — earlier mcp_smoke catches
  genuine unreachable; orphan_ratio is informational only.
* Skippable via the existing skipScopeProbe flag so hermetic
  fixtures that don't implement find_orphans on /mcp don't hang.

Wiring in --by-mention extract.ts integration test (fix-up):
CliOptions field is `progressInterval` not `progressIntervalMs`,
and `timeoutMs: null` is required. Pre-existing tsc error
surfaced when typechecking the new doctor changes.

* test/doctor-orphan-ratio.test.ts (NEW, 10 cases):
  - <100 entity pages → vacuous ok
  - 100+ entities + low ratio (20%) → ok
  - high ratio (70%) → warn with fix-hint
  - very high ratio (90%) → fail with urgency fix-hint
  - zero entity pages → vacuous ok
  - JSON envelope contains orphan_ratio check
  - Thin-client: network failure → informational ok with detail
  - Cross-surface parity: source greps verify orphan_ratio name and
    fix command appear in BOTH doctor.ts and doctor-remote.ts; local
    hint is self-fix, thin-client hint asks the operator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(e2e): orphan-reduction end-to-end with cross-surface count parity

Pins the v0.42.0.0 design-doc claim shape — "material reduction in
orphan pages via --by-mention" — without committing to a specific %
(per TODO-4=C decision to soften the 88%->_30% promise into a
"material reduction, exact figure TBD via post-merge measurement on
representative brain").

3 e2e cases via hermetic PGLite:
* Seed 20 entities + 5 content pages mentioning 15 → assert orphan
  count drops by >=10 after --by-mention (material delta).
* Cross-check the D1 single-source contract end-to-end:
  gbrain orphans --count, getOrphansData() pure fn, and the doctor
  JSON orphan_ratio message all reflect the same numerator. If a
  future change makes them disagree, this fires.
* Re-run idempotency: second --by-mention invocation produces 0 new
  mention rows AND the first run actually created some (sanity gate
  so a no-op pass doesn't trivially satisfy the idempotency test).

* test/e2e/orphan-reduction.test.ts (NEW, 3 cases, hermetic PGLite,
  no DATABASE_URL needed).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* release: v0.41.10.0 — orphan reduction via --by-mention + surrogate-pair fix

Bumps VERSION + package.json to 0.41.10.0 (next available slot in the
v0.41.x queue after master moved to v0.41.4.0). Minor bump scope: new
CLI flag (`gbrain extract links --by-mention`), new schema migration
v95, new doctor check `orphan_ratio`, new public src/core/text-safe.ts
module, new src/core/by-mention.ts module, new link_source enum value
with ranking-filter semantic.

CHANGELOG entry follows the v0.41.x voice rules: ELI10 lead, To take
advantage block with paste-ready commands, How to turn it on, What
you'd see, Promise calibration (softens design-doc 88%->_30% claim
per codex CK13), What to watch for, Itemized changes split into Part
A (surrogate-pair fix) + Part B (auto-link --by-mention) + Follow-ups
(TODO-1 through TODO-4). Credits @garrytan-agents for the underlying
PR work (#1378-#1382 closed in favor of design doc #1409).

TODOS.md gets four new follow-up entries (pack-aware gazetteer,
cycle integration, MCP op, post-merge measurement).

System-of-record annotation: the addLinksBatch call in
extractMentionsFromDb carries `gbrain-allow-direct-insert` per the
canonical reconcile-layer write pattern.

3-line audit: VERSION + package.json + CHANGELOG top all on 0.41.10.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant