fix(conversation-parser): add bold-name-no-timestamp builtin (Circleback/Granola/Zoom) by garrytan-agents · Pull Request #1613 · garrytan/gbrain

garrytan-agents · 2026-05-29T03:49:58Z

Problem

gbrain's conversation parser ships 13 builtin patterns in src/core/conversation-parser/builtins.ts — every one requires a per-line timestamp (an inline date or a time anchor like (HH:MM) / [HH:MM]).

Modern meeting-transcription tools — Circleback, Granola, Zoom — emit transcripts as **Speaker Name:** message text with no per-line timestamp:

**Garry Tan:** Okay, start on. And then weirdly like zoom doesn't...
**Participant 2:** he tried to reset it remotely the other night. Let me ask him.
**Garry Tan:** I mean it's really just like we need to get zoom to fix this.
**Participant 2:** Okay, let me.

With no time anchor, this shape matches no builtin. A single accidental anchor on a long page scores ~0.002, well below the SCORING_MIN_ACCEPTANCE gate of 0.05 in parse.ts, so the page parses to 0 messages and extracts zero conversation-facts.

Real-world impact

In a production brain, the coverage scan reported conversation_format_coverage: 100% _no_match — 104/104 conversation pages plus 3,423 eligible pages matched no builtin pattern and silently yielded nothing.

The fix

New additive builtin bold-name-no-time:

regex /^\*\*(.+?):\*\*\s*(.*)$/ — captures the speaker inside the bold markers (**Name:**), message text in group 2.
No time capture. date_source: 'frontmatter' with hour_group undefined routes through parse.ts's existing no-time branch (the same convention as irc-classic), anchoring every message at 00:00:00 of the page's frontmatter date. No wall-clock time is fabricated; intra-day ordering is preserved by line order.
Declared after the timestamped bold patterns (imessage-slack, telegram-bracket, bold-paren-time) so it can never shadow them. The scorer tie-breaks on declaration index, and the regex requires the colon inside the bold markers, so the **Name** (time): paren-time shape (colon outside) still matches bold-paren-time, not this pattern.

No existing pattern is modified.

Tests

Added a dedicated bold-name-no-time test block in test/conversation-parser/parse.test.ts:

parses the real 4-line **Speaker:** text sample → 4 messages with correct speaker/text, anchored at 00:00:00 of the frontmatter date;
scores above the 0.05 acceptance floor on a pure bold-name transcript (epoch-default date when none supplied);
regression: **Garry** (HH:MM): text still matches bold-paren-time (not the new pattern);
regression: a pure telegram-bracket transcript still matches telegram-bracket (wins the score tie via lower declaration index).

Updated the existing disabledBuiltinIds test: disabling telegram-bracket now correctly falls through to bold-name-no-time (the bracket line's colon is inside the bold markers); disabling both yields the genuine no_match.

Full suite result:

bun test test/conversation-parser/ test/conversation-parser-cli.test.ts test/extract-conversation-facts.test.ts
131 pass / 134

The 3 failures are pre-existing on master and env-driven (probeLlmAvailability returns null when ANTHROPIC_API_KEY is unset and two sibling "provider unavailable" fail-open tests) — they fail identically on clean master in this sandbox because an ANTHROPIC_API_KEY is present in the environment. They are unrelated to this change. eval-conversation-parser-cli and extract-conversation-facts-workers are fully green (32/32). All 14 builtins (13 existing + new) pass their module-load validatePatternEntry checks and their test_positive parse loop.

Separately noticed (not in this PR)

doctor --source <id> does not scope the orphan_ratio check. In src/commands/doctor.ts (~line 4559), getOrphansData(engine, { includePseudo: false }) takes no source filter, so --source default still reports a brain-wide orphan ratio across all federated sources. Suggested follow-up: thread sourceId into getOrphansData. (Note only — intentionally not fixed here.)

…ack/Granola/Zoom) Adds an additive 'bold-name-no-time' builtin pattern matching the `**Speaker:** text` shape that modern meeting-transcription tools (Circleback, Granola, Zoom) emit with NO per-line timestamp. All 13 prior builtins require a time anchor, so this shape scored ~0.002 (below the 0.05 SCORING_MIN_ACCEPTANCE gate) and parsed to zero messages -> zero conversation-facts. The new pattern captures the speaker inside the bold markers, has no time capture, and uses date_source='frontmatter' with hour_group undefined so parse.ts's existing no-time branch (same convention as irc-classic) anchors every message at 00:00:00 of the frontmatter date. No wall-clock time is fabricated; intra-day ordering preserved by line order. Declared AFTER imessage-slack/telegram-bracket/bold-paren-time so it can never shadow them (scorer tie-breaks on declaration index; its regex requires the colon INSIDE the bold markers so it cannot eat the `**Name** (time):` paren-time shape). Adds tests + updates the disabledBuiltinIds test for the new fall-through. Full conversation-parser suite passing; the only failures are pre-existing env-driven LLM-availability tests (ANTHROPIC_API_KEY) unrelated to this change.

garrytan · 2026-05-29T09:19:04Z

Superseded by #1620 (landing as v0.41.29.0). Thank you for this — the bold-name-no-time pattern is the real unlock for the 104 conversation pages + 3,423 eligible pages that were silently extracting nothing.

Re-homed into a base-repo branch because fork PRs from this account don't get CI secrets. Folded in with these changes during the review (/plan-eng-review + /codex outside-voice):

Privacy scrub — test_positive samples used real names (Garry Tan, Alex Graveley), which the project privacy rule forbids in checked-in code; replaced with placeholders. Also scrubbed the same names that were already in bold-paren-time.
score_full_body guard — the regex matches any **Label:** text (a common prose idiom). Codex verified the 0.05 floor does NOT protect a notes page with bold labels clustered in its first 10 lines (head score 0.3 skips the rescore). Added a full-body acceptance recompute so such a page stays no_match instead of mis-parsing as a conversation.
(?!\[) lookahead — so disabling telegram-bracket yields an honest no_match instead of capturing speaker="[18:37] Name". The disabledBuiltinIds test stays at no_match.
Added an eval fixture + clustered-head adversarial regression.

Your original bold-name-no-time pattern is preserved with Co-Authored-By on the parser commit. Bundled an unrelated orphan_ratio --source scoping fix into the same release. Closing in favor of #1620.

…(orphans): source-scoped orphan_ratio (supersedes #1613) (#1620) * feat(conversation-parser): add bold-name-no-time builtin (Circleback/Granola/Zoom, no timestamp) The 14th built-in pattern parses `**Speaker:** text` transcripts with NO per-line timestamp — the shape Circleback / Granola / Zoom emit. Every prior builtin required a time anchor, so this shape matched nothing: a production brain had 104 conversation pages + 3,423 eligible pages silently extracting zero facts. Messages anchor at T00:00:00Z of the frontmatter date (no fabricated wall-clock; line order preserves sequence), same convention as irc-classic. Hardening beyond the original community proposal: - regex `/^\*\*(?!\[)(.+?):\*\*\s*(.*)$/`: the colon-inside-bold (NOT declaration order) is what prevents shadowing bold-paren-time; the `(?!\[)` lookahead rejects telegram-bracket `**[18:37] Name:**` so disabling telegram-bracket yields an honest no_match instead of speaker="[18:37] Name". - new optional PatternEntry.score_full_body: `**Label:** text` is a common prose idiom, so a notes page with bold labels clustered in its first 10 lines scored 0.3 on the head pass (NOT < SCORING_HEAD_TRIGGER_THRESHOLD, so the full-body fallback never fired) and cleared the 0.05 floor. parse.ts now recomputes the winner's score over the full body before the floor, so such a page drops to its true low density and stays no_match. - scrubbed pre-existing real names from bold-paren-time test_positive samples (privacy rule). Fixtures use placeholder names only. Pinned by new bold-name-no-time + clustered-head no_match cases in parse.test.ts and the eval corpus. Co-Authored-By: garrytan-agents <noreply@github.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(orphans): scope orphan_ratio + find_orphans by source; fix total_linkable denominator `gbrain doctor --source <id>` and `gbrain orphans --source <id>` now scope the orphan scan to that source instead of reporting brain-wide. Three fixes: - findOrphanPages(opts?: { sourceId?, sourceIds? }) on both engines scopes the CANDIDATE set (scalar `= $1` or federated `= ANY($1::text[])`). Inbound links from ANY source still count, so a page in source X linked FROM source Y is reachable and NOT an orphan of X (the deliberate, less-surprising definition). - corrected the total_linkable denominator in findOrphans: it now enumerates all live pages (scoped) and subtracts every excluded-by-slug page, not just excluded orphans. The old `total - excludedOrphans` left excluded NON-orphan pages (templates/, scratch/) with inbound links in the denominator, inflating it and suppressing warnings. Changes orphan_ratio output for every brain, in the accurate direction. - the find_orphans MCP op threads sourceScopeOpts(ctx), closing a cross-source read leak where a source-bound OAuth client saw brain-wide orphans (v0.34.1 source-isolation class). doctor uses an explicit `--source` flag parse (NOT resolveSourceWithTier, which would scope bare invocations to a default), and under explicit --source reports the ratio with a low-scale caveat below 100 entity pages instead of a vacuous "ok". Thin-client doctor --source orphan_ratio deferred (TODOS.md). Pinned by test/orphans-source-scope.test.ts (PGLite: scoping, cross-source inbound, denominator, find_orphans op scope) + a Postgres↔PGLite parity case in test/e2e/engine-parity.test.ts (scalar + federated binding). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: v0.41.29.0 — bold-name-no-time + orphan source scoping VERSION + package.json → 0.41.29.0; CHANGELOG entry; CLAUDE.md conversation-parser (13→14 patterns) + orphans source-scoping notes; regenerated llms bundles; TODOS for thin-client doctor --source + check-test-real-names widening. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: garrytan-agents <noreply@github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* upstream/master: v0.41.29.0 feat(conversation-parser): bold-name-no-time builtin + fix(orphans): source-scoped orphan_ratio (supersedes garrytan#1613) (garrytan#1620) v0.41.27.0 fix: withRetry self-heals on null singleton + facts:absorb drain + disconnect audit (closes garrytan#1570) (garrytan#1608) v0.41.27.0 fix(doctor): git-aware sync_freshness (supersedes garrytan#1564) (garrytan#1573)

garrytan mentioned this pull request May 29, 2026

v0.41.29.0 feat(conversation-parser): bold-name-no-time builtin + fix(orphans): source-scoped orphan_ratio (supersedes #1613) #1620

Merged

garrytan closed this May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(conversation-parser): add bold-name-no-timestamp builtin (Circleback/Granola/Zoom)#1613

fix(conversation-parser): add bold-name-no-timestamp builtin (Circleback/Granola/Zoom)#1613
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/conversation-parser-bold-name-no-timestamp

garrytan-agents commented May 29, 2026

Uh oh!

garrytan commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

garrytan-agents commented May 29, 2026

Problem

Real-world impact

The fix

Tests

Separately noticed (not in this PR)

Uh oh!

garrytan commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants