feat: support bracket-time format in conversation facts parser#1461
feat: support bracket-time format in conversation facts parser#1461garrytan-agents wants to merge 1 commit into
Conversation
Add Format 2 (bracket-time) to parseConversationMessages alongside the existing full-date format. This enables fact extraction from Telegram conversation pages that use **[HH:MM] 👤 Speaker:** syntax instead of **Speaker** (YYYY-MM-DD H:MM AM/PM): format. Changes: - New BRACKET_TIME_RX regex for **[HH:MM] emoji Speaker:** lines - cleanSpeaker() strips leading emoji from speaker names - ParseConversationOpts.fallbackDate supplies the date for time-only lines (derived from page frontmatter or effective_date) - processPage() passes the page date as fallbackDate automatically - 6 new tests covering emoji prefix, multi-line, mixed formats, and fallback behavior (33/33 pass)
|
Thanks for catching this — the Telegram bracket-time format silently dropping 134 production conversation pages is a real bug, and the regex + frontmatter-date derivation in this PR is the right shape. This PR is being superseded by a production-grade replacement (in progress, will land as v0.41.13.0). The replacement generalizes the fix: instead of adding one regex per format (this PR fixed Telegram; Discord, WhatsApp, Signal, IRC, plain email threads, and any future chat export would each need their own one-off PR), it ships:
Your Telegram bracket-time regex ( The same wave also extracts a Closing in favor of the upcoming v0.41.13.0 PR. |
…imitive (closes #1461) (#1510) * v0.41.15.0 feat: conversation parser cathedral + progressive-batch primitive (closes #1461) Replaces PR #1461's single-format Telegram regex with a 12-pattern built-in registry covering iMessage/Slack, Telegram (×2), Discord (×2), WhatsApp (×2 locales), Signal, Matrix/Element, IRC (×2), Teams. Each pattern is hand-vetted from public format docs (signal-cli, DiscordChatExporter, Telegram Desktop, WhatsApp export docs, Element matrix-archive, irssi/weechat defaults); module-load validation runs test_positive[] + test_negative[] for every pattern at startup so a typo makes gbrain refuse to start. PR #1461 contributor's BRACKET_TIME_RX + cleanSpeaker survive verbatim as the `telegram-bracket` built-in pattern + DEFAULT_SPEAKER_CLEAN export. All 33 of their test cases pass against the new orchestrator. Three layers per page (orchestrator chooses): 1. Built-in pattern registry (zero-cost, deterministic) 2. User-declared simple_pattern via config (deferred to v0.42+) 3. Opt-IN LLM polish + fallback (privacy-first; chat content goes to Anthropic only when user explicitly enables) D18 priority scoring picks the highest-match-rate pattern across the first 10 lines (not first-wins) so overlapping formats don't silently mis-route. D5 multi_line per-pattern + D11 quick_reject prefix screen + D19 timezone_policy per-pattern complete the registry shape. Companion: src/core/progressive-batch/ primitive (rule of three satisfied across 12+ ad-hoc cost-prompt sites). Wintermute-inspired ramp shape (trial 10 → 100 → 500 → full with verification at each stage), productionized with verifier+policy injection (callers describe HOW TO MEASURE SUCCESS, not WHEN TO WAIT FOR CTRL-C). D3 fail-closed budget gate: null tracker + null Policy.maxCostUsd → abort_cost_cap reason='no_budget_safety_net'. D20 discriminated Verifier union (output_count | idempotent_mutation | noop). extract-conversation-facts is the one proven consumer in v0.41.15.0; 9-site retrofit deferred to v0.41.16.0+ per TODOS.md. Codex outside-voice review absorbed 8 substantive findings: - Privacy posture (LLM polish/fallback flipped to opt-IN) - ReDoS theater (dropped arbitrary user regex; v0.42+ uses RE2) - LLM-inferred-regex persistence as silent-corruption machine - Pattern priority scoring across first 10 lines - Timezone policy on every PatternEntry - Verifier shape discriminated union - Behavior parity for sites that "jumped straight to full" - Real-corpus-redacted fixture gap (v0.42+ TODO) CI gates: - bun run check:conversation-parser (13 fixtures, --no-llm, deterministic) - bun run check:fixture-privacy (banned-token grep) Doctor surfaces 3 new checks: conversation_format_coverage, progressive_batch_audit_health, conversation_parser_probe_health. Tests: 198/198 across primitive + parser + LLM + nightly probe + eval CLI + debug CLI + doctor checks + migration v97 round-trip + E2E parser ↔ engine integration. Real bug caught + fixed during gap audit: IdempotentMutationVerifier was comparing absolute mutated-count vs per-stage expected (failed silently on stage 2+); now uses per-stage delta semantics matching OutputCountVerifier. Schema migration v97: conversation_parser_llm_cache table with (content_sha256, model_id, call_shape) composite key. NO inferred_patterns table (D17: silent-corruption machine). Plan + 23 decisions + codex outside-voice absorption at ~/.claude/plans/system-instruction-you-are-working-cuddly-hollerith.md. Co-Authored-By: garrytan-agents (PR #1461) <noreply@github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(check-privacy): allowlist scripts/check-fixture-privacy.sh The new sibling privacy guard literally names the banned tokens in its BANNED_TOKENS array — same meta-exception that check-privacy.sh itself gets. Without this allowlist entry, bun run verify rejects the file post-merge because the banned name appears in the rule-definition script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: renumber v0.41.15.0 → v0.41.16.0 (queue drift) Mechanical rename across all surfaces: VERSION, package.json, CHANGELOG (header + body refs), CLAUDE.md, TODOS.md, src/core/ migrate.ts (migration v98 comment), all src/core/conversation-parser/* and src/core/progressive-batch/* file headers, all test/ headers, scripts/check-privacy.sh allowlist comment, llms-full.txt regenerated. Audit clean: VERSION + package.json + CHANGELOG header all show 0.41.16.0. verify 24/24, touched tests 179/179. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: garrytan-agents (PR #1461) <noreply@github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* upstream/master: v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes garrytan#1567) (garrytan#1572) v0.41.26.0 fix: dream --source + ingest junk titles + emoji-crash (supersedes garrytan#1559, garrytan#1561) (garrytan#1571) v0.41.25.0 perf(sync): batched deletes + global page-generation clock (supersedes garrytan#1538) (garrytan#1566) v0.41.24.0 fix(conversation-parser): threshold gates + bold-paren-time pattern — 20,167 Circleback messages unblocked (closes garrytan#1533) (garrytan#1543) v0.41.23.0 feat: extract operator surfaces + pack-driven extractables (garrytan#1541) v0.41.22.1 feat: brainstorm/lsd judge fixes (closes garrytan#1540 end-to-end) (garrytan#1562) v0.41.22.0 feat: type-unification cathedral — 94 types → 15 canonical (closes garrytan#1479) (garrytan#1542) v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave (garrytan#1545) v0.41.20.0 feat: gbrain status + doctor --scope=brain (fix wave 2: items garrytan#6 + garrytan#7) (garrytan#1544) feat: v0.41.19.0 Supavisor Retry Cathedral (garrytan#1537) v0.41.18.0: gbrain onboard — the activation surface gbrain didn't have before (garrytan#1521) v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity (garrytan#1519) v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes garrytan#1461) (garrytan#1510) v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes garrytan#1472 RFC) (garrytan#1506)
Problem
extract-conversation-factsonly recognizes one message format:Conversation pages synced from Telegram use a different bracket-time format:
A production brain with 134 conversation pages typed correctly as
conversationshows 0 eligible segments because every line fails the regex match. Doctor reports the pages as backlog, but the extractor silently skips them all.Solution
Add a second regex (
BRACKET_TIME_RX) that matches the bracket-time format:**Name** (YYYY-MM-DD H:MM AM/PM): text**[HH:MM] 👤 Name:** textdatefieldChanges
src/commands/extract-conversation-facts.tsBRACKET_TIME_RXregex:/^\*\*\[(\d{1,2}):(\d{2})\]\s+(.+?):\*\*\s*(.*)$/cleanSpeaker()helper strips leading emoji + whitespace from speaker names using Unicode property escapesParseConversationOpts.fallbackDateparameter supplies the date for time-only linesprocessPage()derivesfallbackDatefrompage.frontmatter.date(preferred) orpage.effective_date(fallback)test/extract-conversation-facts.test.tsfallbackDateprovidedTesting
33/33 tests pass (
bun test test/extract-conversation-facts.test.ts). All 27 existing tests unchanged; 6 new tests added.Behavior Matrix
**Alice** (2024-03-15 9:00 AM): hi**[18:37] 👤 G T:** helloG T**[06:00] 🤖 Bot:** responseBot**[22:15] Plain Name:** textPlain Name