v0.41.10.0 feat: orphan reduction via --by-mention + UTF-16 surrogate-pair fix#1442
Merged
Conversation
Part A of v0.42.0.0 fix wave: lifts surrogate-pair-safe slicing from src/core/eval-contradictions/judge.ts into a new shared module src/core/text-safe.ts. The dream-cycle chunker findBoundary tier-3 fallback (synthesize.ts) previously hard-split at maxChars, orphaning a high surrogate when the boundary landed inside emoji / non-BMP CJK / mathematical alphanumerics. Resulting chunks were not byte-identical to the source content, which broke the v0.30.2 D9 stable-chunk-identity invariant — the per-chunk idempotency key drifted across retries on transcripts containing 4-byte UTF-8 characters near a hard-split. Five agent-authored PRs (#1378-#1382) each independently introduced a narrow safeSliceEnd helper that handled ONE of the three correctness cases (high+low pair straddle) but missed the AT-low-surrogate case that fires when a boundary lands inside a complete pair. The shared text-safe.ts module exports both truncateUtf8 (the verbatim sliced string, for judge.ts) and safeSplitIndex (the boundary index, for chunker hot path), each covering all three cases. Co-authored credit: @garrytan-agents for surfacing the fix in PRs #1378-#1382 (closed in favor of consolidated design doc #1409). * New: src/core/text-safe.ts (truncateUtf8 + safeSplitIndex helpers). * New: test/text-safe.test.ts (18 cases, all 3 surrogate cases plus boundary-after-pair conservative back-up per codex CK16). * refactor(judge): import truncateUtf8 from text-safe; re-export for back-compat. Existing 32 judge tests pass unchanged. * fix(synthesize): findBoundary tier-3 routes through safeSplitIndex. 3 new surrogate-safety cases in test/cycle-synthesize-chunker.test.ts (emoji at boundary, non-BMP CJK at boundary, determinism + joined chunks reconstruct source byte-identical across 5 fuzzed hashes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Part B of v0.42.0.0: link_source enum widening to admit a fourth
provenance channel for auto-linked body-text mentions from the
upcoming `gbrain extract links --by-mention` command.
Codex outside-voice review on the v0.42.0.0 plan caught that the
existing link_source CHECK is a hard wall (src/schema.sql:356) —
my earlier draft claimed "no schema migration needed; link_source
is free-form TEXT." Wrong. The CHECK admits only NULL OR
('markdown', 'frontmatter', 'manual'); attempting to insert
link_source='mentions' would have raised a constraint violation
on every auto-link write. Migration v95 widens the CHECK to admit
'mentions' alongside the three existing values.
Mentions are intentionally a separate provenance from markdown
(human-authored links) so the backlink-count SQL in postgres-engine
+ pglite-engine can filter `WHERE link_source != 'mentions'` for
search ranking (D12). Mentions still count toward orphan-ratio and
graph traversal — distinct semantics from the three human-authored
sources, modeled cleanly on the dedicated CHECK value.
* src/schema.sql: widened CHECK with provenance comment.
* src/core/pglite-schema.ts: same widening (PGLite engine parity).
* src/core/schema-embedded.ts: regenerated via `bun run build:schema`.
* src/core/migrate.ts: new migration v95
`links_link_source_check_includes_mentions` with both Postgres
and PGLite branches. DROP IF EXISTS + ADD CONSTRAINT pattern so
re-applying the migration is a no-op (idempotent).
* test/schema-migrate-link-source-mentions.test.ts (NEW, 7 cases):
registration shape, SQL shape (all 4 values present + DROP IF
EXISTS pattern), PGLite branch present, post-migration insert
succeeds, CHECK still rejects unknown values (widening did not
nullify the gate), idempotent re-application via runMigration.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… fn (D1)
D1 from /plan-eng-review for v0.42.0.0: doctor's upcoming orphan_ratio
check needs the SAME exclusion logic as `gbrain orphans` so the two
surfaces cannot disagree on what counts as an orphan. The existing
findOrphans() was already the pure data fn — this commit just makes
that contract explicit via the getOrphansData alias and pins it with
an IRON RULE regression test.
* src/commands/orphans.ts: export const getOrphansData = findOrphans
(alias, same function reference). Documents the v0.42.0.0 contract
in findOrphans' docstring.
* test/orphans-pure-fn.test.ts (NEW, 12 cases):
- getOrphansData === findOrphans (same reference).
- findOrphans + getOrphansData deep-equal output.
- includePseudo branch toggles excluded count.
- CLI --json output deep-equals findOrphans (IRON RULE — catches
drift if anyone adds CLI-side post-filtering).
- CLI --count matches total_orphans (with and without --include-pseudo).
- shouldExclude regression: pseudo-pages, auto-suffix, raw segment,
deny-prefixes, first-segment exclusions all fire correctly;
regular slugs are NOT excluded.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (D12) D12 from /plan-eng-review for v0.42.0.0: codex outside-voice review caught that engine.getBacklinkCounts had NO link_source filter — so every link counted equally toward backlink-boost in hybridSearch. Running `gbrain extract links --by-mention` (migration #1 of #1409) would silently shift search ranking globally on first run, boosting popular-mention pages over intentional-backlink pages. Add `AND l.link_source IS DISTINCT FROM 'mentions'` to the LEFT JOIN in both engines. `IS DISTINCT FROM` is NULL-safe per the [sql-neq-misses-null-drift] memory: a naive `!= 'mentions'` would silently drop legacy pre-v0.13 rows where link_source IS NULL (because NULL != 'mentions' evaluates to NULL not TRUE in SQL three-valued logic). The IS DISTINCT FROM form treats NULL as a distinct value so legacy rows still count toward backlinks — the only rows filtered are the explicitly mention-derived ones from v0.42.0.0+. Mentions still count toward: - orphan-ratio (the whole point — `findOrphans` runs against `links` with no source filter, so an auto-linked page is no longer an orphan) - graph traversal (`traverseGraph` walks all link_source values) - graph adjacency (`getAdjacencyBoosts` includes mentions in the induced subgraph counts) Mentions are filtered ONLY from: - `getBacklinkCounts` (this commit) — the input to hybridSearch's backlink_boost stage * src/core/postgres-engine.ts: AND clause on the LEFT JOIN. * src/core/pglite-engine.ts: same change for engine parity. * test/backlink-count-mention-filter.test.ts (NEW, 6 cases): - 10 markdown + 0 mention → count = 10 - 0 markdown + 50 mention → count = 0 - 10 markdown + 50 mention → count = 10 - NULL link_source legacy rows still count (IS DISTINCT FROM semantics) - mixed (markdown + frontmatter + manual + mentions) → only mentions filtered - uninitialized slug returns 0 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…/D12/D13) Net new module powering migration #1 of #1409 (orphan reduction). buildGazetteer queries entity-typed pages (hardcoded D2 filter: person/company/organization/entity, pack-aware deferred to TODO-1) and produces a token-Map lookup keyed by lowercase first-token. findMentionedEntities is a pure function that scans body text against the gazetteer, applies maximal-munch matching (longest entry wins at each offset), self-link guard (D13), cross-source guard, and per-page first-mention-only cap (1 link per source→target pair regardless of how many body mentions). Token-Map + multi-word phrase pass per D6 — no new deps, no regex alternation (pathological perf at 5K patterns), no Aho-Corasick (dep tax not justified at this scale). At each token offset, lookup in Map<lowercase, GazetteerEntry[]> is O(1); multi-word entries validate subsequent tokens. Bucket pre-sorted longest-first so the first valid entry IS the maximal-munch winner. Ignore-list semantics per CK12: built-in ambiguous tokens (Apple, Amazon, Square, Stripe, Box, Meta, Target, Oracle) suppressed at gazetteer-build time ONLY when no corresponding entity page exists. If the user has explicitly created companies/apple, gazetteer presence wins — ignore list does NOT override user intent. Min-name-length filter at 4 chars kills false-positive 2-3-char names (AI, YC, X, IBM). Codex CK13 noted this trade-off will under-deliver on 3-char real entities; pack-aware follow-up (TODO-1) can let users opt 3-char entity types in deliberately. Code-block stripping via existing stripCodeBlocks() from link-extraction.ts. CK8 fix: stripCodeBlocks was internal-only; this commit exports it so by-mention.ts can reuse without rolling its own fenced/inline code parser. * src/core/by-mention.ts (NEW, 240 LOC): - LINKABLE_ENTITY_TYPES const (hardcoded D2 type filter). - GazetteerEntry + Gazetteer + Mention types. - buildGazetteer(engine, opts) — engine-backed, hardcoded type filter, ignore-list at build time per CK12, sort buckets longest-first. - findMentionedEntities(text, gazetteer, opts) — pure, maximal-munch, guards (self-link/cross-source/first-mention-cap), code-block strip. * src/core/link-extraction.ts: export stripCodeBlocks (CK8 fix). * test/by-mention.test.ts (NEW, 22 cases): - All 20 plan-mandated cases. - Plus extraIgnore user-override case + LINKABLE_ENTITY_TYPES contract pin. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…#1409) Wires the v0.42.0.0 mention scanner into 'gbrain extract links'. Mode dispatch: when --by-mention is set, runs ONLY the new mention pass (skips default link/frontmatter extract) so the two surfaces don't conflict mid-run. The default extract path is unchanged. Flag plumbing: * --by-mention: opts into the mention pass. Mode dispatch. * --source fs --by-mention rejected with paste-ready --source db fix-hint (D7: gazetteer needs the engine; FS-walk + DB-gazetteer is incoherent). * timeline --by-mention rejected (mentions are a links-pass concern). * --source-id scopes the page WALK; gazetteer remains brain-wide (cross-source guard in findMentionedEntities suppresses scanning pages in source A from auto-linking entities in source B). * --since DATE filters the walk to recently-modified pages. * --type filter applies (rarely useful; included for parity). * --dry-run prints add_link action lines without writing; --json emits one JSON line per dry-run action. extractMentionsFromDb function: * buildGazetteer once per run via hardcoded type filter (D2). * Walks pages via engine.listAllPageRefs (DB-source only). * Reads body as compiled_truth || '\n\n' || COALESCE(timeline, '') per D3 — separator-joined so an end-of-compiled token doesn't merge with a start-of-timeline token into a false phrase match. * findMentionedEntities returns Mention[] with self-link guard (D13) + cross-source guard + first-mention-only cap baked in. * addLinksBatch with link_source='mentions' — distinct provenance channel that backlink-count filters out for search ranking (D12). * Empty-gazetteer no-op with informative message (no entity pages = nothing to scan). * src/commands/extract.ts: --by-mention flag + mode dispatch + FS rejection + extractMentionsFromDb function (~120 LOC). * test/extract-by-mention.test.ts (NEW, 12 cases): end-to-end happy path, idempotency, --dry-run no writes, --json output shape, --source-id scoping, --source fs rejection with fix-hint, timeline rejection, mode dispatch (no markdown rows when --by-mention), coexistence of markdown + mention link_source on same (from,to) pair via ON CONFLICT key, schema migration verification (link_source='mentions' insert succeeds), empty-brain no-op, cross-source guard (team-b post → default acme = no link). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…D11)
D5/D11 from /plan-eng-review for v0.42.0.0: surface orphan-page count
in 'gbrain doctor' so users discover the new --by-mention fix without
having to know the feature exists. Two surfaces because thin-client
installs (gbrain init --mcp-only) route to runRemoteDoctor entirely —
adding the check to runDoctor only would miss every brain-server
consumer (codex CK5 caught this exactly during outside-voice review).
Local surface (src/commands/doctor.ts):
* Inserts as check '9b' right after graph_coverage.
* Consumes getOrphansData() — the canonical pure data fn from T5 —
so doctor and 'gbrain orphans --count' cannot disagree on the ratio.
* Vacuous gate at < 100 entity pages (small brains naturally show
high orphan ratio; not actionable signal).
* warn > 0.5, fail > 0.8; both states recommend
'gbrain extract links --by-mention' as the fix.
Thin-client surface (src/core/doctor-remote.ts):
* New exported runOrphanRatioCheck function. Mirrors local logic
but routes through find_orphans MCP op (existing v0.12.3 op,
scope: read — even minimal-scope thin-clients can call it).
* Operator-pointing hint: 'Ask the brain operator at <url> to run
gbrain extract links --by-mention'. Thin-client users can't run
the fix against a brain they don't host (v0.31.1 bug class).
* Network failure fall-back: returns informational ok with
network_error detail, NOT fail — earlier mcp_smoke catches
genuine unreachable; orphan_ratio is informational only.
* Skippable via the existing skipScopeProbe flag so hermetic
fixtures that don't implement find_orphans on /mcp don't hang.
Wiring in --by-mention extract.ts integration test (fix-up):
CliOptions field is `progressInterval` not `progressIntervalMs`,
and `timeoutMs: null` is required. Pre-existing tsc error
surfaced when typechecking the new doctor changes.
* test/doctor-orphan-ratio.test.ts (NEW, 10 cases):
- <100 entity pages → vacuous ok
- 100+ entities + low ratio (20%) → ok
- high ratio (70%) → warn with fix-hint
- very high ratio (90%) → fail with urgency fix-hint
- zero entity pages → vacuous ok
- JSON envelope contains orphan_ratio check
- Thin-client: network failure → informational ok with detail
- Cross-surface parity: source greps verify orphan_ratio name and
fix command appear in BOTH doctor.ts and doctor-remote.ts; local
hint is self-fix, thin-client hint asks the operator.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pins the v0.42.0.0 design-doc claim shape — "material reduction in orphan pages via --by-mention" — without committing to a specific % (per TODO-4=C decision to soften the 88%->_30% promise into a "material reduction, exact figure TBD via post-merge measurement on representative brain"). 3 e2e cases via hermetic PGLite: * Seed 20 entities + 5 content pages mentioning 15 → assert orphan count drops by >=10 after --by-mention (material delta). * Cross-check the D1 single-source contract end-to-end: gbrain orphans --count, getOrphansData() pure fn, and the doctor JSON orphan_ratio message all reflect the same numerator. If a future change makes them disagree, this fires. * Re-run idempotency: second --by-mention invocation produces 0 new mention rows AND the first run actually created some (sanity gate so a no-op pass doesn't trivially satisfy the idempotency test). * test/e2e/orphan-reduction.test.ts (NEW, 3 cases, hermetic PGLite, no DATABASE_URL needed). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…air fix Bumps VERSION + package.json to 0.41.10.0 (next available slot in the v0.41.x queue after master moved to v0.41.4.0). Minor bump scope: new CLI flag (`gbrain extract links --by-mention`), new schema migration v95, new doctor check `orphan_ratio`, new public src/core/text-safe.ts module, new src/core/by-mention.ts module, new link_source enum value with ranking-filter semantic. CHANGELOG entry follows the v0.41.x voice rules: ELI10 lead, To take advantage block with paste-ready commands, How to turn it on, What you'd see, Promise calibration (softens design-doc 88%->_30% claim per codex CK13), What to watch for, Itemized changes split into Part A (surrogate-pair fix) + Part B (auto-link --by-mention) + Follow-ups (TODO-1 through TODO-4). Credits @garrytan-agents for the underlying PR work (#1378-#1382 closed in favor of design doc #1409). TODOS.md gets four new follow-up entries (pack-aware gazetteer, cycle integration, MCP op, post-merge measurement). System-of-record annotation: the addLinksBatch call in extractMentionsFromDb carries `gbrain-allow-direct-insert` per the canonical reconcile-layer write pattern. 3-line audit: VERSION + package.json + CHANGELOG top all on 0.41.10.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rphans-onboard # Conflicts: # CHANGELOG.md # VERSION # package.json
…rphans-onboard # Conflicts: # CHANGELOG.md # VERSION # package.json
…rphans-onboard # Conflicts: # CHANGELOG.md # VERSION # package.json
…rphans-onboard # Conflicts: # CHANGELOG.md # TODOS.md # VERSION # package.json
garrytan
added a commit
that referenced
this pull request
May 25, 2026
…bump) Brings in #1442 (v0.41.10.0 — orphan reduction via --by-mention + UTF-16 surrogate-pair fix). Standard trio conflicts resolved per CLAUDE.md procedure: - VERSION: ours wins (0.41.11.0). - package.json: ours wins (version line). - CHANGELOG.md: both entries kept; ours stays topmost. Code-file conflict (migrate.ts): master shipped v95 `links_link_source_check_includes_mentions` (the link_source CHECK constraint widening that admits 'mentions' for --by-mention auto-links). This collides with our v95 `facts_extract_conversation_session_index`. Resolution: master's v95 keeps its slot; ours bumps to v96. Handler runMigration() calls updated 95→96. Index shape unchanged. Slot history accumulated: v94 plan → bumped to v95 (master claimed v94 take_domain_assignments) v95 → bumped to v96 (master claimed v95 links_link_source_check) Post-merge verification: - bun install (no changes) - typecheck clean - bun run verify PASS (21 checks, 13s parallel) - 213/213 tests pass across the 4 most-impacted files (extract-conversation-facts, cycle.serial, phase-scope-coverage, migrate) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7 tasks
mgunnin
added a commit
to mgunnin/gbrain
that referenced
this pull request
May 28, 2026
* upstream/master: v0.41.10.1 fix-wave: dream.* config + batch retry + extract_atoms idempotency + ze-switch env-gate (garrytan#1445) v0.41.10.0 feat: orphan reduction via --by-mention + UTF-16 surrogate-pair fix (garrytan#1442) v0.41.9.0 — UX/reliability fix wave (5 defects from production report) (garrytan#1440) v0.41.8.0 fix(pglite): search/query/get exit cleanly + garrytan#1340 hint + garrytan#1342 breadcrumbs (garrytan#1405) v0.41.7.0 feat: compact list-format resolver + 300-skill scaling tutorial (garrytan#1407) v0.41.6.0 feat(ci): CI test speedup — 23min → ~9min via matrix 4→6 + weight-aware sharding + auto SHA cache + parallel verify (garrytan#1444) v0.41.5.0 fix-wave: warm-narwhal — 6 community PRs + E2E reliability (garrytan#1374) # Conflicts: # src/core/ai/recipes/openai.ts
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two-part v0.41.10.0 wave consolidating the 5 closed agent-authored PRs (#1378–#1382) per design doc #1409.
Part A — UTF-16 surrogate-pair safety in dream-cycle chunker:
Lifts surrogate-pair-safe slicing from
judge.tsinto a new sharedsrc/core/text-safe.tsmodule. The agent-authored fix that rode all 5 closed PRs handled only ONE of THREE correctness cases (high+low pair straddle); the AT-low-surrogate case (3) silently bit when a chunk boundary landed inside a complete pair.safeSplitIndexcovers all three cases. Preserves the v0.30.2 D9 stable-chunk-identity invariant on retries of transcripts containing emoji / non-BMP CJK / mathematical alphanumerics.Part B — Auto-link entity mentions (
gbrain extract links --by-mention):Migration #1 from design doc #1409. Scans every page's body text for mentions of known entity pages (people, companies, organizations, entities), creates
link_source='mentions'rows. Mentions filtered OUT of backlink-count for search ranking (D12) so first-run doesn't shift rankings globally. New doctororphan_ratiocheck fires on BOTH local and thin-client (runRemoteDoctor) surfaces with context-appropriate fix-hints (D11).Locked design decisions (14 D-decisions from /plan-eng-review + codex outside-voice round):
getOrphansData()pure data fncompiled_truth || '\n\n' || COALESCE(timeline, '')text-safe.tsexportstruncateUtf8+safeSplitIndex(judge re-imports)orphan_ratioon both surfaces, context-appropriate hintslink_sourceCHECK to admit'mentions'Test Coverage
35 new test cases across 7 new test files + 1 E2E:
test/text-safe.test.ts(18) — all 3 surrogate cases + boundary-after-pair (codex CK16)test/cycle-synthesize-chunker.test.ts(extends, +3) — emoji + CJK at hard-split boundariestest/schema-migrate-link-source-mentions.test.ts(7) — migration v95 idempotency + CHECK validationtest/orphans-pure-fn.test.ts(12) — IRON RULE regression pin (CLI ↔ pure fn deep-equal)test/backlink-count-mention-filter.test.ts(6) — D12 filter, NULL-safeIS DISTINCT FROMtest/by-mention.test.ts(22) — all 20 plan-mandated cases + extrastest/extract-by-mention.test.ts(12) — CLI integration via PGLitetest/doctor-orphan-ratio.test.ts(10) — local + thin-client + parity contracttest/e2e/orphan-reduction.test.ts(3) — end-to-end material delta + cross-surface count parityFull unit suite: 10,750 pass / 0 fail / 0 skip. Verify gate: clean.
Pre-Landing Review
All decisions made in plan-eng-review + 7 D-questions in the codex outside-voice round. CODEX caught 7 blocking issues the eng-review missed (CHECK constraint, pack-awareness contract, thin-client wiring, ranking pollution, self-link guard, cycle mismatch, test-spec corrections). All resolved in the locked decisions above.
Eval Results
No prompt-related files changed — evals skipped.
Plan Completion
All 14 D-decisions implemented across 10 atomic bisect-friendly commits + 1 version commit. Plan + GSTACK REVIEW REPORT at
/Users/garrytan/.claude/plans/system-instruction-you-are-working-drifting-forest.md. ENG CLEARED with codex round folded in.Follow-up TODOs filed (TODO-1 P2 pack-aware gazetteer, TODO-2 P2 cycle integration, TODO-3 P3 MCP op, TODO-4 P1 post-merge orphan-ratio measurement).
Pre-existing master failures flagged (NOT caused by this branch — zero commits on branch touched these files):
test/e2e/cycle.test.ts(5 fails):CONNECTION_ENDEDpostgres pool exhaustion under parallel loadtest/e2e/dream-cycle-phase-order-pglite.test.ts(2 fails): EXPECTED_PHASES list missingextract_atoms+synthesize_conceptsadded by v0.41.2 lens packs (ca68633f)test/e2e/dream.test.ts(1 fail): pre-existing dream-cycle flaketest/e2e/mechanical.test.ts(1 fail): doctor fails onembedding_width_consistency(host schema is 1536d, gateway resolves to 1280d) +source_routing_healthwarn (source 'delta' has zero pages) — both pre-existing host statetest/e2e/thin-client.test.ts(1 fail): pre-existing 60s beforeEach timeoutTo take advantage
Test plan
orphan_rationumerator ==gbrain orphans --countgbrain extract links --by-mentionand capture orphan-ratio delta (TODO-4)🤖 Generated with Claude Code