Skip to content

fix(conversation-parser): add bold-name-no-timestamp builtin (Circleback/Granola/Zoom)#1613

Closed
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/conversation-parser-bold-name-no-timestamp
Closed

fix(conversation-parser): add bold-name-no-timestamp builtin (Circleback/Granola/Zoom)#1613
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/conversation-parser-bold-name-no-timestamp

Conversation

@garrytan-agents

Copy link
Copy Markdown
Contributor

Problem

gbrain's conversation parser ships 13 builtin patterns in src/core/conversation-parser/builtins.tsevery one requires a per-line timestamp (an inline date or a time anchor like (HH:MM) / [HH:MM]).

Modern meeting-transcription tools — Circleback, Granola, Zoom — emit transcripts as **Speaker Name:** message text with no per-line timestamp:

**Garry Tan:** Okay, start on. And then weirdly like zoom doesn't...
**Participant 2:** he tried to reset it remotely the other night. Let me ask him.
**Garry Tan:** I mean it's really just like we need to get zoom to fix this.
**Participant 2:** Okay, let me.

With no time anchor, this shape matches no builtin. A single accidental anchor on a long page scores ~0.002, well below the SCORING_MIN_ACCEPTANCE gate of 0.05 in parse.ts, so the page parses to 0 messages and extracts zero conversation-facts.

Real-world impact

In a production brain, the coverage scan reported conversation_format_coverage: 100% _no_match104/104 conversation pages plus 3,423 eligible pages matched no builtin pattern and silently yielded nothing.

The fix

New additive builtin bold-name-no-time:

  • regex /^\*\*(.+?):\*\*\s*(.*)$/ — captures the speaker inside the bold markers (**Name:**), message text in group 2.
  • No time capture. date_source: 'frontmatter' with hour_group undefined routes through parse.ts's existing no-time branch (the same convention as irc-classic), anchoring every message at 00:00:00 of the page's frontmatter date. No wall-clock time is fabricated; intra-day ordering is preserved by line order.
  • Declared after the timestamped bold patterns (imessage-slack, telegram-bracket, bold-paren-time) so it can never shadow them. The scorer tie-breaks on declaration index, and the regex requires the colon inside the bold markers, so the **Name** (time): paren-time shape (colon outside) still matches bold-paren-time, not this pattern.

No existing pattern is modified.

Tests

Added a dedicated bold-name-no-time test block in test/conversation-parser/parse.test.ts:

  • parses the real 4-line **Speaker:** text sample → 4 messages with correct speaker/text, anchored at 00:00:00 of the frontmatter date;
  • scores above the 0.05 acceptance floor on a pure bold-name transcript (epoch-default date when none supplied);
  • regression: **Garry** (HH:MM): text still matches bold-paren-time (not the new pattern);
  • regression: a pure telegram-bracket transcript still matches telegram-bracket (wins the score tie via lower declaration index).

Updated the existing disabledBuiltinIds test: disabling telegram-bracket now correctly falls through to bold-name-no-time (the bracket line's colon is inside the bold markers); disabling both yields the genuine no_match.

Full suite result:

bun test test/conversation-parser/ test/conversation-parser-cli.test.ts test/extract-conversation-facts.test.ts
131 pass / 134

The 3 failures are pre-existing on master and env-driven (probeLlmAvailability returns null when ANTHROPIC_API_KEY is unset and two sibling "provider unavailable" fail-open tests) — they fail identically on clean master in this sandbox because an ANTHROPIC_API_KEY is present in the environment. They are unrelated to this change. eval-conversation-parser-cli and extract-conversation-facts-workers are fully green (32/32). All 14 builtins (13 existing + new) pass their module-load validatePatternEntry checks and their test_positive parse loop.


Separately noticed (not in this PR)

doctor --source <id> does not scope the orphan_ratio check. In src/commands/doctor.ts (~line 4559), getOrphansData(engine, { includePseudo: false }) takes no source filter, so --source default still reports a brain-wide orphan ratio across all federated sources. Suggested follow-up: thread sourceId into getOrphansData. (Note only — intentionally not fixed here.)

…ack/Granola/Zoom)

Adds an additive 'bold-name-no-time' builtin pattern matching the
`**Speaker:** text` shape that modern meeting-transcription tools
(Circleback, Granola, Zoom) emit with NO per-line timestamp.

All 13 prior builtins require a time anchor, so this shape scored
~0.002 (below the 0.05 SCORING_MIN_ACCEPTANCE gate) and parsed to zero
messages -> zero conversation-facts.

The new pattern captures the speaker inside the bold markers, has no
time capture, and uses date_source='frontmatter' with hour_group
undefined so parse.ts's existing no-time branch (same convention as
irc-classic) anchors every message at 00:00:00 of the frontmatter
date. No wall-clock time is fabricated; intra-day ordering preserved
by line order.

Declared AFTER imessage-slack/telegram-bracket/bold-paren-time so it
can never shadow them (scorer tie-breaks on declaration index; its
regex requires the colon INSIDE the bold markers so it cannot eat the
`**Name** (time):` paren-time shape).

Adds tests + updates the disabledBuiltinIds test for the new
fall-through. Full conversation-parser suite passing; the only
failures are pre-existing env-driven LLM-availability tests
(ANTHROPIC_API_KEY) unrelated to this change.
@garrytan

Copy link
Copy Markdown
Owner

Superseded by #1620 (landing as v0.41.29.0). Thank you for this — the bold-name-no-time pattern is the real unlock for the 104 conversation pages + 3,423 eligible pages that were silently extracting nothing.

Re-homed into a base-repo branch because fork PRs from this account don't get CI secrets. Folded in with these changes during the review (/plan-eng-review + /codex outside-voice):

  • Privacy scrubtest_positive samples used real names (Garry Tan, Alex Graveley), which the project privacy rule forbids in checked-in code; replaced with placeholders. Also scrubbed the same names that were already in bold-paren-time.
  • score_full_body guard — the regex matches any **Label:** text (a common prose idiom). Codex verified the 0.05 floor does NOT protect a notes page with bold labels clustered in its first 10 lines (head score 0.3 skips the rescore). Added a full-body acceptance recompute so such a page stays no_match instead of mis-parsing as a conversation.
  • (?!\[) lookahead — so disabling telegram-bracket yields an honest no_match instead of capturing speaker="[18:37] Name". The disabledBuiltinIds test stays at no_match.
  • Added an eval fixture + clustered-head adversarial regression.

Your original bold-name-no-time pattern is preserved with Co-Authored-By on the parser commit. Bundled an unrelated orphan_ratio --source scoping fix into the same release. Closing in favor of #1620.

@garrytan garrytan closed this May 29, 2026
garrytan added a commit that referenced this pull request May 29, 2026
…(orphans): source-scoped orphan_ratio (supersedes #1613) (#1620)

* feat(conversation-parser): add bold-name-no-time builtin (Circleback/Granola/Zoom, no timestamp)

The 14th built-in pattern parses `**Speaker:** text` transcripts with NO
per-line timestamp — the shape Circleback / Granola / Zoom emit. Every prior
builtin required a time anchor, so this shape matched nothing: a production
brain had 104 conversation pages + 3,423 eligible pages silently extracting
zero facts. Messages anchor at T00:00:00Z of the frontmatter date (no
fabricated wall-clock; line order preserves sequence), same convention as
irc-classic.

Hardening beyond the original community proposal:
- regex `/^\*\*(?!\[)(.+?):\*\*\s*(.*)$/`: the colon-inside-bold (NOT
  declaration order) is what prevents shadowing bold-paren-time; the `(?!\[)`
  lookahead rejects telegram-bracket `**[18:37] Name:**` so disabling
  telegram-bracket yields an honest no_match instead of speaker="[18:37] Name".
- new optional PatternEntry.score_full_body: `**Label:** text` is a common
  prose idiom, so a notes page with bold labels clustered in its first 10
  lines scored 0.3 on the head pass (NOT < SCORING_HEAD_TRIGGER_THRESHOLD, so
  the full-body fallback never fired) and cleared the 0.05 floor. parse.ts now
  recomputes the winner's score over the full body before the floor, so such a
  page drops to its true low density and stays no_match.
- scrubbed pre-existing real names from bold-paren-time test_positive samples
  (privacy rule).

Fixtures use placeholder names only. Pinned by new bold-name-no-time +
clustered-head no_match cases in parse.test.ts and the eval corpus.

Co-Authored-By: garrytan-agents <noreply@github.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(orphans): scope orphan_ratio + find_orphans by source; fix total_linkable denominator

`gbrain doctor --source <id>` and `gbrain orphans --source <id>` now scope
the orphan scan to that source instead of reporting brain-wide. Three fixes:

- findOrphanPages(opts?: { sourceId?, sourceIds? }) on both engines scopes the
  CANDIDATE set (scalar `= $1` or federated `= ANY($1::text[])`). Inbound links
  from ANY source still count, so a page in source X linked FROM source Y is
  reachable and NOT an orphan of X (the deliberate, less-surprising definition).
- corrected the total_linkable denominator in findOrphans: it now enumerates
  all live pages (scoped) and subtracts every excluded-by-slug page, not just
  excluded orphans. The old `total - excludedOrphans` left excluded NON-orphan
  pages (templates/, scratch/) with inbound links in the denominator, inflating
  it and suppressing warnings. Changes orphan_ratio output for every brain, in
  the accurate direction.
- the find_orphans MCP op threads sourceScopeOpts(ctx), closing a cross-source
  read leak where a source-bound OAuth client saw brain-wide orphans (v0.34.1
  source-isolation class).

doctor uses an explicit `--source` flag parse (NOT resolveSourceWithTier, which
would scope bare invocations to a default), and under explicit --source reports
the ratio with a low-scale caveat below 100 entity pages instead of a vacuous
"ok". Thin-client doctor --source orphan_ratio deferred (TODOS.md).

Pinned by test/orphans-source-scope.test.ts (PGLite: scoping, cross-source
inbound, denominator, find_orphans op scope) + a Postgres↔PGLite parity case
in test/e2e/engine-parity.test.ts (scalar + federated binding).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs: v0.41.29.0 — bold-name-no-time + orphan source scoping

VERSION + package.json → 0.41.29.0; CHANGELOG entry; CLAUDE.md conversation-parser
(13→14 patterns) + orphans source-scoping notes; regenerated llms bundles; TODOS
for thin-client doctor --source + check-test-real-names widening.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: garrytan-agents <noreply@github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mgunnin added a commit to mgunnin/gbrain that referenced this pull request Jun 3, 2026
* upstream/master:
  v0.41.29.0 feat(conversation-parser): bold-name-no-time builtin + fix(orphans): source-scoped orphan_ratio (supersedes garrytan#1613) (garrytan#1620)
  v0.41.27.0 fix: withRetry self-heals on null singleton + facts:absorb drain + disconnect audit (closes garrytan#1570) (garrytan#1608)
  v0.41.27.0 fix(doctor): git-aware sync_freshness (supersedes garrytan#1564) (garrytan#1573)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants