feat: support bracket-time format in conversation facts parser by garrytan-agents · Pull Request #1461 · garrytan/gbrain

garrytan-agents · 2026-05-26T00:17:49Z

Problem

extract-conversation-facts only recognizes one message format:

**Speaker** (2026-05-25 6:30 PM): message text

Conversation pages synced from Telegram use a different bracket-time format:

**[18:37] 👤 G T:** message text
**[18:38] 🤖 Bot:** response text

A production brain with 134 conversation pages typed correctly as conversation shows 0 eligible segments because every line fails the regex match. Doctor reports the pages as backlog, but the extractor silently skips them all.

Solution

Add a second regex (BRACKET_TIME_RX) that matches the bracket-time format:

Component	Format 1 (existing)	Format 2 (new)
Pattern	`Name (YYYY-MM-DD H:MM AM/PM): text`	`[HH:MM] 👤 Name: text`
Date source	Inline in each line	Page frontmatter `date` field
Time format	12-hour with AM/PM	24-hour
Speaker prefix	None	Optional emoji (stripped)

Changes

src/commands/extract-conversation-facts.ts

New BRACKET_TIME_RX regex: /^\*\*\[(\d{1,2}):(\d{2})\]\s+(.+?):\*\*\s*(.*)$/
cleanSpeaker() helper strips leading emoji + whitespace from speaker names using Unicode property escapes
ParseConversationOpts.fallbackDate parameter supplies the date for time-only lines
processPage() derives fallbackDate from page.frontmatter.date (preferred) or page.effective_date (fallback)
Parser tries Format 1 first, then Format 2, then treats the line as a continuation — preserving full backward compatibility

test/extract-conversation-facts.test.ts

6 new test cases:
- Bracket-time with 👤 emoji speaker prefix
- Bracket-time with 🤖 robot emoji
- Multi-line continuation after bracket-time anchor
- Fallback to 1970-01-01 when no fallbackDate provided
- Mixed Format 1 + Format 2 in the same body
- Bracket-time without emoji prefix

Testing

33/33 tests pass (bun test test/extract-conversation-facts.test.ts). All 27 existing tests unchanged; 6 new tests added.

Behavior Matrix

Input	Before	After
`Alice (2024-03-15 9:00 AM): hi`	✅ Parsed	✅ Parsed (unchanged)
`[18:37] 👤 G T: hello`	❌ Orphan line	✅ Parsed, speaker=`G T`
`[06:00] 🤖 Bot: response`	❌ Orphan line	✅ Parsed, speaker=`Bot`
`[22:15] Plain Name: text`	❌ Orphan line	✅ Parsed, speaker=`Plain Name`
Continuation lines after any format	✅ Appended	✅ Appended (unchanged)

Add Format 2 (bracket-time) to parseConversationMessages alongside the existing full-date format. This enables fact extraction from Telegram conversation pages that use **[HH:MM] 👤 Speaker:** syntax instead of **Speaker** (YYYY-MM-DD H:MM AM/PM): format. Changes: - New BRACKET_TIME_RX regex for **[HH:MM] emoji Speaker:** lines - cleanSpeaker() strips leading emoji from speaker names - ParseConversationOpts.fallbackDate supplies the date for time-only lines (derived from page frontmatter or effective_date) - processPage() passes the page date as fallbackDate automatically - 6 new tests covering emoji prefix, multi-line, mixed formats, and fallback behavior (33/33 pass)

garrytan · 2026-05-26T04:15:01Z

Thanks for catching this — the Telegram bracket-time format silently dropping 134 production conversation pages is a real bug, and the regex + frontmatter-date derivation in this PR is the right shape.

This PR is being superseded by a production-grade replacement (in progress, will land as v0.41.13.0). The replacement generalizes the fix: instead of adding one regex per format (this PR fixed Telegram; Discord, WhatsApp, Signal, IRC, plain email threads, and any future chat export would each need their own one-off PR), it ships:

Built-in pattern registry — 12+ hand-vetted patterns covering iMessage/Slack, Telegram (2 variants), Discord (2 variants), WhatsApp (2 locales), Signal, IRC (2 variants), Matrix/Element, Slack-export-json, Teams-export, and email-thread. Each with module-load validation, quick_reject hints, explicit multi_line flag, and timezone_policy.
User-declared patterns via gbrain config set conversation_parser.patterns — using a simple_pattern structured spec rather than raw user regex (arbitrary regex needs proper isolation we haven't built yet; deferred to v0.42+).
Pattern-priority scoring — orchestrator scores all candidate patterns across the first 10 lines per page, picks highest match-rate + most date-consistent + most participant-stable. Defeats silent mis-routing when formats overlap.
Opt-IN LLM polish + fallback for the long tail (privacy posture: private chat logs don't go to Anthropic by default).
gbrain eval conversation-parser quality surface — synthetic + real-corpus-redacted fixture gate wired into bun run verify so parser regressions block PRs.
gbrain doctor checks for coverage tracking + audit health.

Your Telegram bracket-time regex (BRACKET_TIME_RX), cleanSpeaker helper, and ParseConversationOpts.fallbackDate shape survive verbatim — they become the telegram-bracket built-in pattern + DEFAULT_SPEAKER_CLEAN export. All 6 of your test cases pass against the new orchestrator. Co-Authored-By preserved in the new PR's merge commit.

The same wave also extracts a src/core/progressive-batch/ primitive (Wintermute-inspired ramp-up: trial 10 → 100 → 500 → full with verification at each stage) and retrofits 9 existing batch sites onto it. Future batch features inherit the discipline for free.

Closing in favor of the upcoming v0.41.13.0 PR.

…imitive (closes #1461) (#1510) * v0.41.15.0 feat: conversation parser cathedral + progressive-batch primitive (closes #1461) Replaces PR #1461's single-format Telegram regex with a 12-pattern built-in registry covering iMessage/Slack, Telegram (×2), Discord (×2), WhatsApp (×2 locales), Signal, Matrix/Element, IRC (×2), Teams. Each pattern is hand-vetted from public format docs (signal-cli, DiscordChatExporter, Telegram Desktop, WhatsApp export docs, Element matrix-archive, irssi/weechat defaults); module-load validation runs test_positive[] + test_negative[] for every pattern at startup so a typo makes gbrain refuse to start. PR #1461 contributor's BRACKET_TIME_RX + cleanSpeaker survive verbatim as the `telegram-bracket` built-in pattern + DEFAULT_SPEAKER_CLEAN export. All 33 of their test cases pass against the new orchestrator. Three layers per page (orchestrator chooses): 1. Built-in pattern registry (zero-cost, deterministic) 2. User-declared simple_pattern via config (deferred to v0.42+) 3. Opt-IN LLM polish + fallback (privacy-first; chat content goes to Anthropic only when user explicitly enables) D18 priority scoring picks the highest-match-rate pattern across the first 10 lines (not first-wins) so overlapping formats don't silently mis-route. D5 multi_line per-pattern + D11 quick_reject prefix screen + D19 timezone_policy per-pattern complete the registry shape. Companion: src/core/progressive-batch/ primitive (rule of three satisfied across 12+ ad-hoc cost-prompt sites). Wintermute-inspired ramp shape (trial 10 → 100 → 500 → full with verification at each stage), productionized with verifier+policy injection (callers describe HOW TO MEASURE SUCCESS, not WHEN TO WAIT FOR CTRL-C). D3 fail-closed budget gate: null tracker + null Policy.maxCostUsd → abort_cost_cap reason='no_budget_safety_net'. D20 discriminated Verifier union (output_count | idempotent_mutation | noop). extract-conversation-facts is the one proven consumer in v0.41.15.0; 9-site retrofit deferred to v0.41.16.0+ per TODOS.md. Codex outside-voice review absorbed 8 substantive findings: - Privacy posture (LLM polish/fallback flipped to opt-IN) - ReDoS theater (dropped arbitrary user regex; v0.42+ uses RE2) - LLM-inferred-regex persistence as silent-corruption machine - Pattern priority scoring across first 10 lines - Timezone policy on every PatternEntry - Verifier shape discriminated union - Behavior parity for sites that "jumped straight to full" - Real-corpus-redacted fixture gap (v0.42+ TODO) CI gates: - bun run check:conversation-parser (13 fixtures, --no-llm, deterministic) - bun run check:fixture-privacy (banned-token grep) Doctor surfaces 3 new checks: conversation_format_coverage, progressive_batch_audit_health, conversation_parser_probe_health. Tests: 198/198 across primitive + parser + LLM + nightly probe + eval CLI + debug CLI + doctor checks + migration v97 round-trip + E2E parser ↔ engine integration. Real bug caught + fixed during gap audit: IdempotentMutationVerifier was comparing absolute mutated-count vs per-stage expected (failed silently on stage 2+); now uses per-stage delta semantics matching OutputCountVerifier. Schema migration v97: conversation_parser_llm_cache table with (content_sha256, model_id, call_shape) composite key. NO inferred_patterns table (D17: silent-corruption machine). Plan + 23 decisions + codex outside-voice absorption at ~/.claude/plans/system-instruction-you-are-working-cuddly-hollerith.md. Co-Authored-By: garrytan-agents (PR #1461) <noreply@github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(check-privacy): allowlist scripts/check-fixture-privacy.sh The new sibling privacy guard literally names the banned tokens in its BANNED_TOKENS array — same meta-exception that check-privacy.sh itself gets. Without this allowlist entry, bun run verify rejects the file post-merge because the banned name appears in the rule-definition script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: renumber v0.41.15.0 → v0.41.16.0 (queue drift) Mechanical rename across all surfaces: VERSION, package.json, CHANGELOG (header + body refs), CLAUDE.md, TODOS.md, src/core/ migrate.ts (migration v98 comment), all src/core/conversation-parser/* and src/core/progressive-batch/* file headers, all test/ headers, scripts/check-privacy.sh allowlist comment, llms-full.txt regenerated. Audit clean: VERSION + package.json + CHANGELOG header all show 0.41.16.0. verify 24/24, touched tests 179/179. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: garrytan-agents (PR #1461) <noreply@github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* upstream/master: v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes garrytan#1567) (garrytan#1572) v0.41.26.0 fix: dream --source + ingest junk titles + emoji-crash (supersedes garrytan#1559, garrytan#1561) (garrytan#1571) v0.41.25.0 perf(sync): batched deletes + global page-generation clock (supersedes garrytan#1538) (garrytan#1566) v0.41.24.0 fix(conversation-parser): threshold gates + bold-paren-time pattern — 20,167 Circleback messages unblocked (closes garrytan#1533) (garrytan#1543) v0.41.23.0 feat: extract operator surfaces + pack-driven extractables (garrytan#1541) v0.41.22.1 feat: brainstorm/lsd judge fixes (closes garrytan#1540 end-to-end) (garrytan#1562) v0.41.22.0 feat: type-unification cathedral — 94 types → 15 canonical (closes garrytan#1479) (garrytan#1542) v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave (garrytan#1545) v0.41.20.0 feat: gbrain status + doctor --scope=brain (fix wave 2: items garrytan#6 + garrytan#7) (garrytan#1544) feat: v0.41.19.0 Supavisor Retry Cathedral (garrytan#1537) v0.41.18.0: gbrain onboard — the activation surface gbrain didn't have before (garrytan#1521) v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity (garrytan#1519) v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes garrytan#1461) (garrytan#1510) v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes garrytan#1472 RFC) (garrytan#1506)

garrytan closed this May 26, 2026

garrytan mentioned this pull request May 26, 2026

v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes #1461) #1510

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support bracket-time format in conversation facts parser#1461

feat: support bracket-time format in conversation facts parser#1461
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:feat/conversation-parser-bracket-format

garrytan-agents commented May 26, 2026

Uh oh!

garrytan commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants