fix(conversation-parser): add bold-name-no-timestamp builtin (Circleback/Granola/Zoom)#1613
Closed
garrytan-agents wants to merge 1 commit into
Closed
Conversation
…ack/Granola/Zoom) Adds an additive 'bold-name-no-time' builtin pattern matching the `**Speaker:** text` shape that modern meeting-transcription tools (Circleback, Granola, Zoom) emit with NO per-line timestamp. All 13 prior builtins require a time anchor, so this shape scored ~0.002 (below the 0.05 SCORING_MIN_ACCEPTANCE gate) and parsed to zero messages -> zero conversation-facts. The new pattern captures the speaker inside the bold markers, has no time capture, and uses date_source='frontmatter' with hour_group undefined so parse.ts's existing no-time branch (same convention as irc-classic) anchors every message at 00:00:00 of the frontmatter date. No wall-clock time is fabricated; intra-day ordering preserved by line order. Declared AFTER imessage-slack/telegram-bracket/bold-paren-time so it can never shadow them (scorer tie-breaks on declaration index; its regex requires the colon INSIDE the bold markers so it cannot eat the `**Name** (time):` paren-time shape). Adds tests + updates the disabledBuiltinIds test for the new fall-through. Full conversation-parser suite passing; the only failures are pre-existing env-driven LLM-availability tests (ANTHROPIC_API_KEY) unrelated to this change.
Owner
|
Superseded by #1620 (landing as v0.41.29.0). Thank you for this — the Re-homed into a base-repo branch because fork PRs from this account don't get CI secrets. Folded in with these changes during the review (
Your original |
garrytan
added a commit
that referenced
this pull request
May 29, 2026
…(orphans): source-scoped orphan_ratio (supersedes #1613) (#1620) * feat(conversation-parser): add bold-name-no-time builtin (Circleback/Granola/Zoom, no timestamp) The 14th built-in pattern parses `**Speaker:** text` transcripts with NO per-line timestamp — the shape Circleback / Granola / Zoom emit. Every prior builtin required a time anchor, so this shape matched nothing: a production brain had 104 conversation pages + 3,423 eligible pages silently extracting zero facts. Messages anchor at T00:00:00Z of the frontmatter date (no fabricated wall-clock; line order preserves sequence), same convention as irc-classic. Hardening beyond the original community proposal: - regex `/^\*\*(?!\[)(.+?):\*\*\s*(.*)$/`: the colon-inside-bold (NOT declaration order) is what prevents shadowing bold-paren-time; the `(?!\[)` lookahead rejects telegram-bracket `**[18:37] Name:**` so disabling telegram-bracket yields an honest no_match instead of speaker="[18:37] Name". - new optional PatternEntry.score_full_body: `**Label:** text` is a common prose idiom, so a notes page with bold labels clustered in its first 10 lines scored 0.3 on the head pass (NOT < SCORING_HEAD_TRIGGER_THRESHOLD, so the full-body fallback never fired) and cleared the 0.05 floor. parse.ts now recomputes the winner's score over the full body before the floor, so such a page drops to its true low density and stays no_match. - scrubbed pre-existing real names from bold-paren-time test_positive samples (privacy rule). Fixtures use placeholder names only. Pinned by new bold-name-no-time + clustered-head no_match cases in parse.test.ts and the eval corpus. Co-Authored-By: garrytan-agents <noreply@github.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(orphans): scope orphan_ratio + find_orphans by source; fix total_linkable denominator `gbrain doctor --source <id>` and `gbrain orphans --source <id>` now scope the orphan scan to that source instead of reporting brain-wide. Three fixes: - findOrphanPages(opts?: { sourceId?, sourceIds? }) on both engines scopes the CANDIDATE set (scalar `= $1` or federated `= ANY($1::text[])`). Inbound links from ANY source still count, so a page in source X linked FROM source Y is reachable and NOT an orphan of X (the deliberate, less-surprising definition). - corrected the total_linkable denominator in findOrphans: it now enumerates all live pages (scoped) and subtracts every excluded-by-slug page, not just excluded orphans. The old `total - excludedOrphans` left excluded NON-orphan pages (templates/, scratch/) with inbound links in the denominator, inflating it and suppressing warnings. Changes orphan_ratio output for every brain, in the accurate direction. - the find_orphans MCP op threads sourceScopeOpts(ctx), closing a cross-source read leak where a source-bound OAuth client saw brain-wide orphans (v0.34.1 source-isolation class). doctor uses an explicit `--source` flag parse (NOT resolveSourceWithTier, which would scope bare invocations to a default), and under explicit --source reports the ratio with a low-scale caveat below 100 entity pages instead of a vacuous "ok". Thin-client doctor --source orphan_ratio deferred (TODOS.md). Pinned by test/orphans-source-scope.test.ts (PGLite: scoping, cross-source inbound, denominator, find_orphans op scope) + a Postgres↔PGLite parity case in test/e2e/engine-parity.test.ts (scalar + federated binding). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: v0.41.29.0 — bold-name-no-time + orphan source scoping VERSION + package.json → 0.41.29.0; CHANGELOG entry; CLAUDE.md conversation-parser (13→14 patterns) + orphans source-scoping notes; regenerated llms bundles; TODOS for thin-client doctor --source + check-test-real-names widening. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: garrytan-agents <noreply@github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mgunnin
added a commit
to mgunnin/gbrain
that referenced
this pull request
Jun 3, 2026
* upstream/master: v0.41.29.0 feat(conversation-parser): bold-name-no-time builtin + fix(orphans): source-scoped orphan_ratio (supersedes garrytan#1613) (garrytan#1620) v0.41.27.0 fix: withRetry self-heals on null singleton + facts:absorb drain + disconnect audit (closes garrytan#1570) (garrytan#1608) v0.41.27.0 fix(doctor): git-aware sync_freshness (supersedes garrytan#1564) (garrytan#1573)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
gbrain's conversation parser ships 13 builtin patterns in
src/core/conversation-parser/builtins.ts— every one requires a per-line timestamp (an inline date or a time anchor like(HH:MM)/[HH:MM]).Modern meeting-transcription tools — Circleback, Granola, Zoom — emit transcripts as
**Speaker Name:** message textwith no per-line timestamp:With no time anchor, this shape matches no builtin. A single accidental anchor on a long page scores ~
0.002, well below theSCORING_MIN_ACCEPTANCEgate of0.05inparse.ts, so the page parses to 0 messages and extracts zero conversation-facts.Real-world impact
In a production brain, the coverage scan reported
conversation_format_coverage: 100% _no_match— 104/104 conversation pages plus 3,423 eligible pages matched no builtin pattern and silently yielded nothing.The fix
New additive builtin
bold-name-no-time:/^\*\*(.+?):\*\*\s*(.*)$/— captures the speaker inside the bold markers (**Name:**), message text in group 2.date_source: 'frontmatter'withhour_groupundefined routes throughparse.ts's existing no-time branch (the same convention asirc-classic), anchoring every message at00:00:00of the page's frontmatter date. No wall-clock time is fabricated; intra-day ordering is preserved by line order.imessage-slack,telegram-bracket,bold-paren-time) so it can never shadow them. The scorer tie-breaks on declaration index, and the regex requires the colon inside the bold markers, so the**Name** (time):paren-time shape (colon outside) still matchesbold-paren-time, not this pattern.No existing pattern is modified.
Tests
Added a dedicated
bold-name-no-timetest block intest/conversation-parser/parse.test.ts:**Speaker:** textsample → 4 messages with correct speaker/text, anchored at00:00:00of the frontmatter date;0.05acceptance floor on a pure bold-name transcript (epoch-default date when none supplied);**Garry** (HH:MM): textstill matchesbold-paren-time(not the new pattern);telegram-brackettranscript still matchestelegram-bracket(wins the score tie via lower declaration index).Updated the existing
disabledBuiltinIdstest: disablingtelegram-bracketnow correctly falls through tobold-name-no-time(the bracket line's colon is inside the bold markers); disabling both yields the genuineno_match.Full suite result:
The 3 failures are pre-existing on master and env-driven (
probeLlmAvailability returns null when ANTHROPIC_API_KEY is unsetand two sibling "provider unavailable" fail-open tests) — they fail identically on cleanmasterin this sandbox because anANTHROPIC_API_KEYis present in the environment. They are unrelated to this change.eval-conversation-parser-cliandextract-conversation-facts-workersare fully green (32/32). All 14 builtins (13 existing + new) pass their module-loadvalidatePatternEntrychecks and theirtest_positiveparse loop.Separately noticed (not in this PR)
doctor --source <id>does not scope theorphan_ratiocheck. Insrc/commands/doctor.ts(~line 4559),getOrphansData(engine, { includePseudo: false })takes no source filter, so--source defaultstill reports a brain-wide orphan ratio across all federated sources. Suggested follow-up: threadsourceIdintogetOrphansData. (Note only — intentionally not fixed here.)