Skip to content

v0.41.29.0 feat(conversation-parser): bold-name-no-time builtin + fix(orphans): source-scoped orphan_ratio (supersedes #1613)#1620

Merged
garrytan merged 3 commits into
masterfrom
garrytan/moab
May 29, 2026
Merged

v0.41.29.0 feat(conversation-parser): bold-name-no-time builtin + fix(orphans): source-scoped orphan_ratio (supersedes #1613)#1620
garrytan merged 3 commits into
masterfrom
garrytan/moab

Conversation

@garrytan

Copy link
Copy Markdown
Owner

Supersedes #1613 (re-homed from the fork so CI can run with secrets; real names scrubbed for the privacy rule; orphan_ratio --source scoping bundled in).

What ships (two fixes, 3 bisect-friendly commits)

1. bold-name-no-time conversation parser builtin (the 14th)

Parses **Speaker:** text transcripts with no per-line timestamp — the shape Circleback / Granola / Zoom emit. Every prior builtin required a time anchor, so this shape matched nothing: a production brain had 104 conversation pages + 3,423 eligible pages silently extracting zero conversation-facts. This is the unlock for whole-brain conversation-facts extraction.

  • Messages anchor at T00:00:00Z of the frontmatter date (no fabricated wall-clock; line order preserves sequence), same convention as irc-classic.
  • regex /^\*\*(?!\[)(.+?):\*\*\s*(.*)$/: the colon-inside-bold (not declaration order) prevents shadowing bold-paren-time; the (?!\[) lookahead rejects telegram-bracket **[18:37] Name:** so disabling telegram-bracket yields an honest no_match instead of speaker="[18:37] Name".
  • new optional PatternEntry.score_full_body: **Label:** text is a common prose idiom, so a notes page with bold labels clustered in its first 10 lines scored 0.3 on the head pass (NOT < SCORING_HEAD_TRIGGER_THRESHOLD, so the full-body fallback never fired) and cleared the 0.05 floor. parse.ts now recomputes the winner over the full body before the floor, so such a page stays no_match.
  • scrubbed pre-existing real names from bold-paren-time test samples (privacy rule).

2. orphan_ratio / find_orphans source scoping

gbrain doctor --source <id> and gbrain orphans --source <id> now scope to that source instead of reporting brain-wide.

  • findOrphanPages(opts?: { sourceId?, sourceIds? }) on both engines scopes the candidate set (scalar = $1 / federated = ANY($1::text[])). Cross-source inbound links still count, so a page in X linked from Y is reachable (not an orphan of X).
  • Corrected the total_linkable denominator: excluded pages (templates/, scratch/) that have inbound links no longer inflate it and suppress warnings. Changes orphan_ratio output for every brain, in the accurate direction.
  • The find_orphans MCP op threads sourceScopeOpts(ctx), closing a cross-source read leak for source-bound OAuth clients (v0.34.1 source-isolation class).
  • Under explicit --source below 100 entity pages, orphan_ratio reports the ratio with a low-scale caveat instead of a vacuous "ok". Thin-client doctor --source deferred (TODOS.md).

Review

  • /plan-eng-review (cleared) + /codex outside-voice (8 findings, 7 actioned + 1 confirmed-good). Codex caught the score_full_body floor gap, the bracket-timestamp mis-capture, the pre-existing denominator bug, and the find_orphans MCP leak.

Tests

  • bun run verify — 29 checks green (typecheck, privacy, conversation-parser eval).
  • conversation-parser + back-compat: 135 pass. orphan-area (incl. doctor-orphan-ratio + orphan-reduction E2E): 73 pass. engine-parity on real Postgres (scalar + federated parity): 11 pass.

🤖 Generated with Claude Code

garrytan and others added 3 commits May 29, 2026 02:18
…Granola/Zoom, no timestamp)

The 14th built-in pattern parses `**Speaker:** text` transcripts with NO
per-line timestamp — the shape Circleback / Granola / Zoom emit. Every prior
builtin required a time anchor, so this shape matched nothing: a production
brain had 104 conversation pages + 3,423 eligible pages silently extracting
zero facts. Messages anchor at T00:00:00Z of the frontmatter date (no
fabricated wall-clock; line order preserves sequence), same convention as
irc-classic.

Hardening beyond the original community proposal:
- regex `/^\*\*(?!\[)(.+?):\*\*\s*(.*)$/`: the colon-inside-bold (NOT
  declaration order) is what prevents shadowing bold-paren-time; the `(?!\[)`
  lookahead rejects telegram-bracket `**[18:37] Name:**` so disabling
  telegram-bracket yields an honest no_match instead of speaker="[18:37] Name".
- new optional PatternEntry.score_full_body: `**Label:** text` is a common
  prose idiom, so a notes page with bold labels clustered in its first 10
  lines scored 0.3 on the head pass (NOT < SCORING_HEAD_TRIGGER_THRESHOLD, so
  the full-body fallback never fired) and cleared the 0.05 floor. parse.ts now
  recomputes the winner's score over the full body before the floor, so such a
  page drops to its true low density and stays no_match.
- scrubbed pre-existing real names from bold-paren-time test_positive samples
  (privacy rule).

Fixtures use placeholder names only. Pinned by new bold-name-no-time +
clustered-head no_match cases in parse.test.ts and the eval corpus.

Co-Authored-By: garrytan-agents <noreply@github.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…linkable denominator

`gbrain doctor --source <id>` and `gbrain orphans --source <id>` now scope
the orphan scan to that source instead of reporting brain-wide. Three fixes:

- findOrphanPages(opts?: { sourceId?, sourceIds? }) on both engines scopes the
  CANDIDATE set (scalar `= $1` or federated `= ANY($1::text[])`). Inbound links
  from ANY source still count, so a page in source X linked FROM source Y is
  reachable and NOT an orphan of X (the deliberate, less-surprising definition).
- corrected the total_linkable denominator in findOrphans: it now enumerates
  all live pages (scoped) and subtracts every excluded-by-slug page, not just
  excluded orphans. The old `total - excludedOrphans` left excluded NON-orphan
  pages (templates/, scratch/) with inbound links in the denominator, inflating
  it and suppressing warnings. Changes orphan_ratio output for every brain, in
  the accurate direction.
- the find_orphans MCP op threads sourceScopeOpts(ctx), closing a cross-source
  read leak where a source-bound OAuth client saw brain-wide orphans (v0.34.1
  source-isolation class).

doctor uses an explicit `--source` flag parse (NOT resolveSourceWithTier, which
would scope bare invocations to a default), and under explicit --source reports
the ratio with a low-scale caveat below 100 entity pages instead of a vacuous
"ok". Thin-client doctor --source orphan_ratio deferred (TODOS.md).

Pinned by test/orphans-source-scope.test.ts (PGLite: scoping, cross-source
inbound, denominator, find_orphans op scope) + a Postgres↔PGLite parity case
in test/e2e/engine-parity.test.ts (scalar + federated binding).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
VERSION + package.json → 0.41.29.0; CHANGELOG entry; CLAUDE.md conversation-parser
(13→14 patterns) + orphans source-scoping notes; regenerated llms bundles; TODOS
for thin-client doctor --source + check-test-real-names widening.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@garrytan garrytan merged commit 041d89b into master May 29, 2026
20 checks passed
mgunnin added a commit to mgunnin/gbrain that referenced this pull request Jun 3, 2026
* upstream/master:
  v0.41.29.0 feat(conversation-parser): bold-name-no-time builtin + fix(orphans): source-scoped orphan_ratio (supersedes garrytan#1613) (garrytan#1620)
  v0.41.27.0 fix: withRetry self-heals on null singleton + facts:absorb drain + disconnect audit (closes garrytan#1570) (garrytan#1608)
  v0.41.27.0 fix(doctor): git-aware sync_freshness (supersedes garrytan#1564) (garrytan#1573)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant