Skip to content

feat: support bracket-time format in conversation facts parser#1461

Closed
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:feat/conversation-parser-bracket-format
Closed

feat: support bracket-time format in conversation facts parser#1461
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:feat/conversation-parser-bracket-format

Conversation

@garrytan-agents

Copy link
Copy Markdown
Contributor

Problem

extract-conversation-facts only recognizes one message format:

**Speaker** (2026-05-25 6:30 PM): message text

Conversation pages synced from Telegram use a different bracket-time format:

**[18:37] 👤 G T:** message text
**[18:38] 🤖 Bot:** response text

A production brain with 134 conversation pages typed correctly as conversation shows 0 eligible segments because every line fails the regex match. Doctor reports the pages as backlog, but the extractor silently skips them all.

Solution

Add a second regex (BRACKET_TIME_RX) that matches the bracket-time format:

Component Format 1 (existing) Format 2 (new)
Pattern **Name** (YYYY-MM-DD H:MM AM/PM): text **[HH:MM] 👤 Name:** text
Date source Inline in each line Page frontmatter date field
Time format 12-hour with AM/PM 24-hour
Speaker prefix None Optional emoji (stripped)

Changes

src/commands/extract-conversation-facts.ts

  • New BRACKET_TIME_RX regex: /^\*\*\[(\d{1,2}):(\d{2})\]\s+(.+?):\*\*\s*(.*)$/
  • cleanSpeaker() helper strips leading emoji + whitespace from speaker names using Unicode property escapes
  • ParseConversationOpts.fallbackDate parameter supplies the date for time-only lines
  • processPage() derives fallbackDate from page.frontmatter.date (preferred) or page.effective_date (fallback)
  • Parser tries Format 1 first, then Format 2, then treats the line as a continuation — preserving full backward compatibility

test/extract-conversation-facts.test.ts

  • 6 new test cases:
    • Bracket-time with 👤 emoji speaker prefix
    • Bracket-time with 🤖 robot emoji
    • Multi-line continuation after bracket-time anchor
    • Fallback to 1970-01-01 when no fallbackDate provided
    • Mixed Format 1 + Format 2 in the same body
    • Bracket-time without emoji prefix

Testing

33/33 tests pass (bun test test/extract-conversation-facts.test.ts). All 27 existing tests unchanged; 6 new tests added.

Behavior Matrix

Input Before After
**Alice** (2024-03-15 9:00 AM): hi ✅ Parsed ✅ Parsed (unchanged)
**[18:37] 👤 G T:** hello ❌ Orphan line ✅ Parsed, speaker=G T
**[06:00] 🤖 Bot:** response ❌ Orphan line ✅ Parsed, speaker=Bot
**[22:15] Plain Name:** text ❌ Orphan line ✅ Parsed, speaker=Plain Name
Continuation lines after any format ✅ Appended ✅ Appended (unchanged)

Add Format 2 (bracket-time) to parseConversationMessages alongside the
existing full-date format. This enables fact extraction from Telegram
conversation pages that use **[HH:MM] 👤 Speaker:** syntax instead of
**Speaker** (YYYY-MM-DD H:MM AM/PM): format.

Changes:
- New BRACKET_TIME_RX regex for **[HH:MM] emoji Speaker:** lines
- cleanSpeaker() strips leading emoji from speaker names
- ParseConversationOpts.fallbackDate supplies the date for time-only
  lines (derived from page frontmatter or effective_date)
- processPage() passes the page date as fallbackDate automatically
- 6 new tests covering emoji prefix, multi-line, mixed formats, and
  fallback behavior (33/33 pass)
@garrytan

Copy link
Copy Markdown
Owner

Thanks for catching this — the Telegram bracket-time format silently dropping 134 production conversation pages is a real bug, and the regex + frontmatter-date derivation in this PR is the right shape.

This PR is being superseded by a production-grade replacement (in progress, will land as v0.41.13.0). The replacement generalizes the fix: instead of adding one regex per format (this PR fixed Telegram; Discord, WhatsApp, Signal, IRC, plain email threads, and any future chat export would each need their own one-off PR), it ships:

  1. Built-in pattern registry — 12+ hand-vetted patterns covering iMessage/Slack, Telegram (2 variants), Discord (2 variants), WhatsApp (2 locales), Signal, IRC (2 variants), Matrix/Element, Slack-export-json, Teams-export, and email-thread. Each with module-load validation, quick_reject hints, explicit multi_line flag, and timezone_policy.

  2. User-declared patterns via gbrain config set conversation_parser.patterns — using a simple_pattern structured spec rather than raw user regex (arbitrary regex needs proper isolation we haven't built yet; deferred to v0.42+).

  3. Pattern-priority scoring — orchestrator scores all candidate patterns across the first 10 lines per page, picks highest match-rate + most date-consistent + most participant-stable. Defeats silent mis-routing when formats overlap.

  4. Opt-IN LLM polish + fallback for the long tail (privacy posture: private chat logs don't go to Anthropic by default).

  5. gbrain eval conversation-parser quality surface — synthetic + real-corpus-redacted fixture gate wired into bun run verify so parser regressions block PRs.

  6. gbrain doctor checks for coverage tracking + audit health.

Your Telegram bracket-time regex (BRACKET_TIME_RX), cleanSpeaker helper, and ParseConversationOpts.fallbackDate shape survive verbatim — they become the telegram-bracket built-in pattern + DEFAULT_SPEAKER_CLEAN export. All 6 of your test cases pass against the new orchestrator. Co-Authored-By preserved in the new PR's merge commit.

The same wave also extracts a src/core/progressive-batch/ primitive (Wintermute-inspired ramp-up: trial 10 → 100 → 500 → full with verification at each stage) and retrofits 9 existing batch sites onto it. Future batch features inherit the discipline for free.

Closing in favor of the upcoming v0.41.13.0 PR.

@garrytan garrytan closed this May 26, 2026
garrytan added a commit that referenced this pull request May 27, 2026
…imitive (closes #1461) (#1510)

* v0.41.15.0 feat: conversation parser cathedral + progressive-batch primitive (closes #1461)

Replaces PR #1461's single-format Telegram regex with a 12-pattern
built-in registry covering iMessage/Slack, Telegram (×2), Discord
(×2), WhatsApp (×2 locales), Signal, Matrix/Element, IRC (×2), Teams.
Each pattern is hand-vetted from public format docs (signal-cli,
DiscordChatExporter, Telegram Desktop, WhatsApp export docs, Element
matrix-archive, irssi/weechat defaults); module-load validation runs
test_positive[] + test_negative[] for every pattern at startup so a
typo makes gbrain refuse to start.

PR #1461 contributor's BRACKET_TIME_RX + cleanSpeaker survive verbatim
as the `telegram-bracket` built-in pattern + DEFAULT_SPEAKER_CLEAN
export. All 33 of their test cases pass against the new orchestrator.

Three layers per page (orchestrator chooses):
  1. Built-in pattern registry (zero-cost, deterministic)
  2. User-declared simple_pattern via config (deferred to v0.42+)
  3. Opt-IN LLM polish + fallback (privacy-first; chat content goes
     to Anthropic only when user explicitly enables)

D18 priority scoring picks the highest-match-rate pattern across the
first 10 lines (not first-wins) so overlapping formats don't silently
mis-route. D5 multi_line per-pattern + D11 quick_reject prefix screen
+ D19 timezone_policy per-pattern complete the registry shape.

Companion: src/core/progressive-batch/ primitive (rule of three
satisfied across 12+ ad-hoc cost-prompt sites). Wintermute-inspired
ramp shape (trial 10 → 100 → 500 → full with verification at each
stage), productionized with verifier+policy injection (callers
describe HOW TO MEASURE SUCCESS, not WHEN TO WAIT FOR CTRL-C). D3
fail-closed budget gate: null tracker + null Policy.maxCostUsd →
abort_cost_cap reason='no_budget_safety_net'. D20 discriminated
Verifier union (output_count | idempotent_mutation | noop).
extract-conversation-facts is the one proven consumer in v0.41.15.0;
9-site retrofit deferred to v0.41.16.0+ per TODOS.md.

Codex outside-voice review absorbed 8 substantive findings:
  - Privacy posture (LLM polish/fallback flipped to opt-IN)
  - ReDoS theater (dropped arbitrary user regex; v0.42+ uses RE2)
  - LLM-inferred-regex persistence as silent-corruption machine
  - Pattern priority scoring across first 10 lines
  - Timezone policy on every PatternEntry
  - Verifier shape discriminated union
  - Behavior parity for sites that "jumped straight to full"
  - Real-corpus-redacted fixture gap (v0.42+ TODO)

CI gates:
  - bun run check:conversation-parser (13 fixtures, --no-llm, deterministic)
  - bun run check:fixture-privacy (banned-token grep)

Doctor surfaces 3 new checks: conversation_format_coverage,
progressive_batch_audit_health, conversation_parser_probe_health.

Tests: 198/198 across primitive + parser + LLM + nightly probe + eval
CLI + debug CLI + doctor checks + migration v97 round-trip + E2E
parser ↔ engine integration. Real bug caught + fixed during gap audit:
IdempotentMutationVerifier was comparing absolute mutated-count vs
per-stage expected (failed silently on stage 2+); now uses per-stage
delta semantics matching OutputCountVerifier.

Schema migration v97: conversation_parser_llm_cache table with
(content_sha256, model_id, call_shape) composite key. NO
inferred_patterns table (D17: silent-corruption machine).

Plan + 23 decisions + codex outside-voice absorption at
~/.claude/plans/system-instruction-you-are-working-cuddly-hollerith.md.

Co-Authored-By: garrytan-agents (PR #1461) <noreply@github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(check-privacy): allowlist scripts/check-fixture-privacy.sh

The new sibling privacy guard literally names the banned tokens in its
BANNED_TOKENS array — same meta-exception that check-privacy.sh itself
gets. Without this allowlist entry, bun run verify rejects the file
post-merge because the banned name appears in the rule-definition script.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: renumber v0.41.15.0 → v0.41.16.0 (queue drift)

Mechanical rename across all surfaces: VERSION, package.json,
CHANGELOG (header + body refs), CLAUDE.md, TODOS.md, src/core/
migrate.ts (migration v98 comment), all src/core/conversation-parser/*
and src/core/progressive-batch/* file headers, all test/ headers,
scripts/check-privacy.sh allowlist comment, llms-full.txt regenerated.

Audit clean: VERSION + package.json + CHANGELOG header all show
0.41.16.0. verify 24/24, touched tests 179/179.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: garrytan-agents (PR #1461) <noreply@github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mgunnin added a commit to mgunnin/gbrain that referenced this pull request May 28, 2026
* upstream/master:
  v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes garrytan#1567) (garrytan#1572)
  v0.41.26.0 fix: dream --source + ingest junk titles + emoji-crash (supersedes garrytan#1559, garrytan#1561) (garrytan#1571)
  v0.41.25.0 perf(sync): batched deletes + global page-generation clock (supersedes garrytan#1538) (garrytan#1566)
  v0.41.24.0 fix(conversation-parser): threshold gates + bold-paren-time pattern — 20,167 Circleback messages unblocked (closes garrytan#1533) (garrytan#1543)
  v0.41.23.0 feat: extract operator surfaces + pack-driven extractables (garrytan#1541)
  v0.41.22.1 feat: brainstorm/lsd judge fixes (closes garrytan#1540 end-to-end) (garrytan#1562)
  v0.41.22.0 feat: type-unification cathedral — 94 types → 15 canonical (closes garrytan#1479) (garrytan#1542)
  v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave (garrytan#1545)
  v0.41.20.0 feat: gbrain status + doctor --scope=brain (fix wave 2: items garrytan#6 + garrytan#7) (garrytan#1544)
  feat: v0.41.19.0 Supavisor Retry Cathedral (garrytan#1537)
  v0.41.18.0: gbrain onboard — the activation surface gbrain didn't have before (garrytan#1521)
  v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity (garrytan#1519)
  v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes garrytan#1461) (garrytan#1510)
  v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes garrytan#1472 RFC) (garrytan#1506)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants