v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes #1461) by garrytan · Pull Request #1510 · garrytan/gbrain

garrytan · 2026-05-26T18:52:54Z

Summary

Replaces PR #1461's single-format Telegram regex with a 12-pattern built-in registry covering every common chat-export format, plus an opt-in LLM polish + fallback layer for the long tail. Same wave extracts a new progressive-batch primitive (Wintermute-inspired ramp-up: trial 10 → 100 → 500 → full with verification at each stage) after rule-of-three was satisfied across 12+ ad-hoc cost-prompt sites.

The user impact: a real production brain has 134 Telegram-shaped pages typed conversation that the existing parser silently drops. After this upgrade, the dream cycle's conversation_facts_backfill phase extracts facts from them automatically. Same for any Discord, WhatsApp, Signal, IRC, Matrix, Teams, or email-thread page.

PR #1461 contributor's BRACKET_TIME_RX + cleanSpeaker survive verbatim as the telegram-bracket built-in pattern + DEFAULT_SPEAKER_CLEAN export. All 33 of their test cases pass against the new orchestrator. Co-Authored-By preserved.

The 12 built-in patterns

id	sample
`imessage-slack`	`Alice (2024-03-15 9:00 AM): hi`
`telegram-bracket`	`[18:37] 👤 Alice: hi` (PR #1461 verbatim)
`telegram-text-export`	`Alice, [Mar 15, 2024 at 6:37:00 PM]`
`whatsapp-iso`	`[15/03/24, 18:37:00] Alice: hi`
`whatsapp-us`	`3/15/24, 6:37 PM - Alice: hi`
`discord-export`	`[03/15/2024 6:37 PM] Alice` + multi-line body
`discord-classic`	`Alice — Today at 18:37` + multi-line body
`signal-export`	`Alice (2024-03-15 18:37:00 UTC): hi`
`matrix-element`	`[18:37] @alice:matrix.org: hi`
`irc-classic`	`<alice> hi`
`irc-weechat`	`18:37 <alice> hi`
`teams-export`	`Alice, 3/15/2024 6:37 PM: hi`

Test Coverage

Layer	Test file	Cases	Status
Primitive	`test/progressive-batch/orchestrator.test.ts`	43	✅
Primitive — retrofit-wrap	`test/progressive-batch/retrofit-wrap.test.ts`	7	✅
Parser orchestrator	`test/conversation-parser/parse.test.ts`	39 (incl. 6 PR #1461 regression cases verbatim)	✅
LLM base	`test/conversation-parser/llm-base.test.ts`	16	✅
LLM fallback	`test/conversation-parser/llm-fallback.test.ts`	6	✅
LLM polish	`test/conversation-parser/llm-polish.test.ts`	11	✅
Nightly probe	`test/conversation-parser/nightly-probe.test.ts`	7	✅
Eval CLI	`test/eval-conversation-parser-cli.test.ts`	15 (exit codes, --json, --no-llm)	✅
Debug CLI	`test/conversation-parser-cli.test.ts`	10	✅
Migration v98	`test/migrations-v98.test.ts`	7 (JSONB round-trip, CHECK, composite PK)	✅
Doctor checks	`test/doctor-v0_41_13_checks.test.ts`	4 (3 new checks present + shape)	✅
E2E parser ↔ engine	`test/e2e/conversation-parser-pglite.test.ts`	15 (12 built-ins through PGLite)	✅
Back-compat	`test/extract-conversation-facts.test.ts`	27 (PR #1461 baseline)	✅

Total: 350/350 across touched suites. bun run verify: 24/24 green.

Real bug caught + fixed during gap audit: IdempotentMutationVerifier was comparing absolute mutated-count vs per-stage expected (failed silently on stage 2+); now uses per-stage delta semantics matching OutputCountVerifier.

Pre-Landing Review

Codex outside-voice review during planning absorbed 8 substantive technical findings (all adopted):

Privacy posture — LLM polish/fallback flipped to opt-IN (chat content goes to Anthropic only when user explicitly enables)
ReDoS theater — dropped arbitrary user regex; user patterns wait for v0.42+ with worker-isolated regex (safe-regex / RE2)
LLM-inferred-regex persistence as silent-corruption machine — dropped entirely; LLM fallback parses for THIS page only, cached by content_hash
Pattern priority scoring across first 10 lines (not first-wins) — defeats overlap mis-routing
Timezone policy per PatternEntry — inline_utc | frontmatter_tz | utc_assumed_with_warn; time-only formats warn when no frontmatter timezone
Verifier shape discriminated union (output_count | idempotent_mutation | noop) — fits reindex / embed / eval-contradictions / parser cleanly
Behavior parity for retrofits — sites that previously jumped to full keep doing so; ramp is opt-in per-site via interactiveAbortMs > 0
Real-corpus-redacted fixture gap — v0.42+ TODO

Plan Completion

23 decisions captured across plan-eng-review + 14 decisions in plan-mode. All resolved. Plan file at ~/.claude/plans/system-instruction-you-are-working-cuddly-hollerith.md. The 9-site progressive-batch retrofit is deferred to v0.41.16.0+ per D2 (one proven consumer ships now; bisectable retrofits land per-PR).

Documentation

CLAUDE.md extended with two new file-cluster annotations (src/core/conversation-parser/ and src/core/progressive-batch/). CHANGELOG.md carries the ELI10-lead-first v0.41.15.0 entry. TODOS.md filed 7 v0.42+ follow-up entries.

CI gates added

bun run check:conversation-parser — 13-fixture deterministic eval gate with --no-llm (no API keys needed). Wired into bun run verify.
bun run check:fixture-privacy — banned-token grep over test/fixtures/conversation-formats/. Wired into verify.

Closed in this wave

PR feat: support bracket-time format in conversation facts parser #1461 — superseded with full context note. Contributor's regex + helper preserved verbatim; Co-Authored-By in the merge commit.

Test plan

bun run verify — 24/24 green
350 unit tests across all touched suites pass
E2E parser ↔ engine integration test (15 cases) hermetic via PGLite — all 12 built-ins import + parse correctly
PR feat: support bracket-time format in conversation facts parser #1461's 33 test cases pass verbatim against the new orchestrator
Migration v98 round-trip (7 cases) — CHECK + composite PK + JSONB shape
Codex outside-voice review — 14 findings, 13 absorbed
Real bug caught during gap audit — IdempotentMutationVerifier per-stage delta semantics
Master merged (v0.41.13.0 + v0.41.14.0); conflicts resolved per CLAUDE.md merge-recovery procedure; VERSION/package.json/CHANGELOG audit clean

🤖 Generated with Claude Code

…imitive (closes #1461) Replaces PR #1461's single-format Telegram regex with a 12-pattern built-in registry covering iMessage/Slack, Telegram (×2), Discord (×2), WhatsApp (×2 locales), Signal, Matrix/Element, IRC (×2), Teams. Each pattern is hand-vetted from public format docs (signal-cli, DiscordChatExporter, Telegram Desktop, WhatsApp export docs, Element matrix-archive, irssi/weechat defaults); module-load validation runs test_positive[] + test_negative[] for every pattern at startup so a typo makes gbrain refuse to start. PR #1461 contributor's BRACKET_TIME_RX + cleanSpeaker survive verbatim as the `telegram-bracket` built-in pattern + DEFAULT_SPEAKER_CLEAN export. All 33 of their test cases pass against the new orchestrator. Three layers per page (orchestrator chooses): 1. Built-in pattern registry (zero-cost, deterministic) 2. User-declared simple_pattern via config (deferred to v0.42+) 3. Opt-IN LLM polish + fallback (privacy-first; chat content goes to Anthropic only when user explicitly enables) D18 priority scoring picks the highest-match-rate pattern across the first 10 lines (not first-wins) so overlapping formats don't silently mis-route. D5 multi_line per-pattern + D11 quick_reject prefix screen + D19 timezone_policy per-pattern complete the registry shape. Companion: src/core/progressive-batch/ primitive (rule of three satisfied across 12+ ad-hoc cost-prompt sites). Wintermute-inspired ramp shape (trial 10 → 100 → 500 → full with verification at each stage), productionized with verifier+policy injection (callers describe HOW TO MEASURE SUCCESS, not WHEN TO WAIT FOR CTRL-C). D3 fail-closed budget gate: null tracker + null Policy.maxCostUsd → abort_cost_cap reason='no_budget_safety_net'. D20 discriminated Verifier union (output_count | idempotent_mutation | noop). extract-conversation-facts is the one proven consumer in v0.41.15.0; 9-site retrofit deferred to v0.41.16.0+ per TODOS.md. Codex outside-voice review absorbed 8 substantive findings: - Privacy posture (LLM polish/fallback flipped to opt-IN) - ReDoS theater (dropped arbitrary user regex; v0.42+ uses RE2) - LLM-inferred-regex persistence as silent-corruption machine - Pattern priority scoring across first 10 lines - Timezone policy on every PatternEntry - Verifier shape discriminated union - Behavior parity for sites that "jumped straight to full" - Real-corpus-redacted fixture gap (v0.42+ TODO) CI gates: - bun run check:conversation-parser (13 fixtures, --no-llm, deterministic) - bun run check:fixture-privacy (banned-token grep) Doctor surfaces 3 new checks: conversation_format_coverage, progressive_batch_audit_health, conversation_parser_probe_health. Tests: 198/198 across primitive + parser + LLM + nightly probe + eval CLI + debug CLI + doctor checks + migration v97 round-trip + E2E parser ↔ engine integration. Real bug caught + fixed during gap audit: IdempotentMutationVerifier was comparing absolute mutated-count vs per-stage expected (failed silently on stage 2+); now uses per-stage delta semantics matching OutputCountVerifier. Schema migration v97: conversation_parser_llm_cache table with (content_sha256, model_id, call_shape) composite key. NO inferred_patterns table (D17: silent-corruption machine). Plan + 23 decisions + codex outside-voice absorption at ~/.claude/plans/system-instruction-you-are-working-cuddly-hollerith.md. Co-Authored-By: garrytan-agents (PR #1461) <noreply@github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Conflict resolutions: - VERSION → 0.41.15.0 (queue-aware: master claimed 13+14) - package.json → 0.41.15.0 - CHANGELOG.md → my v0.41.15.0 entry on top, master's v0.41.14.0 + v0.41.13.0 entries preserved below - TODOS.md → both sections preserved (mine = v0.41.15.0 conv-parser, master = v0.41.14.0 #1451 drift) - scripts/run-verify-parallel.sh → keep all 3 checks (fixture-privacy + conversation-parser + resolver) - src/cli.ts CLI_ONLY → merged set (conversation-parser + reindex) - src/core/migrate.ts → master's v97 (pages_dedup_partial_index) kept; mine renumbered to v98 (conversation_parser_llm_cache_table) - test/migrations-v97.test.ts → renamed to test/migrations-v98.test.ts to match Regenerated llms-full.txt after merge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The new sibling privacy guard literally names the banned tokens in its BANNED_TOKENS array — same meta-exception that check-privacy.sh itself gets. Without this allowlist entry, bun run verify rejects the file post-merge because the banned name appears in the rule-definition script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mechanical rename across all surfaces: VERSION, package.json, CHANGELOG (header + body refs), CLAUDE.md, TODOS.md, src/core/ migrate.ts (migration v98 comment), all src/core/conversation-parser/* and src/core/progressive-batch/* file headers, all test/ headers, scripts/check-privacy.sh allowlist comment, llms-full.txt regenerated. Audit clean: VERSION + package.json + CHANGELOG header all show 0.41.16.0. verify 24/24, touched tests 179/179. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolutions: - VERSION → 0.41.16.0 (mine — higher) - package.json → 0.41.16.0 - CHANGELOG → my v0.41.16.0 on top + master's v0.41.15.0 below - TODOS → preserve both sections (mine = v0.41.16.0 conv-parser, master = v0.41.15.0 sync-reliability) - src/core/migrate.ts → master's v98 (gbrain_cycle_locks_last_refreshed_at) kept; mine renumbered to v99 (conversation_parser_llm_cache_table) - test/migrations-v98.test.ts → renamed to test/migrations-v99.test.ts to match verify 24/24, 331/331 tests across touched suites + migrate.test.ts green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR #1510 (garrytan/dynamic-regex-conversation-formats) claimed v0.41.16.0 on master in parallel. Advancing this wave to v0.41.17.0 so both can land cleanly. Pure mechanical version bump: - VERSION + package.json → 0.41.17.0 - CHANGELOG.md header + "To take advantage of v0.41.17.0" block - TODOS.md section header + v0.41.18+ forward references - CLAUDE.md inline version tags - Regenerated llms-full.txt / llms.txt No code changes. The actual workers cathedral feature set is unchanged from the two prior commits in this branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… parity (#1519) * feat(worker-pool): shared sliding pool + bounded semaphore + PGLite-clamp wrapper T1 + T2 of the v0.41.16.0 workers cathedral. New src/core/worker-pool.ts is the canonical primitive every --workers N bulk command in this wave (and future bulk commands) builds on. Atomic-claim invariant enforced by scripts/check-worker-pool-atomicity.sh (wired into bun run verify). BudgetExhausted bypass + AbortSignal composition baked into the helper so budget caps are a structural ceiling under concurrency, not a per-caller convention. The new resolveWorkersWithClamp wrapper composes existing autoConcurrency with PGLite-clamp + per-(command, requested) stderr dedup. Deliberately NOT a modification to shared autoConcurrency (silent today, used by sync + import); embed.ts keeps GBRAIN_EMBED_CONCURRENCY || 20 default per codex #13. 23 + 12 + 9 = 44 hermetic tests pin every contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: structural + dim-check regression suites for v0.41.16.0 wave - test/embed-helper-migration.test.ts (T3): asserts embed.ts's two sliding-pool sites are migrated to runSlidingPool, pre-migration shapes (let nextIdx = 0, Promise.all(Array.from(...))) are gone, GBRAIN_EMBED_CONCURRENCY || 20 default preserved, failureLabel threads page.slug. Per codex #16/#17 these are invariant assertions, not byte-equality on progress event ORDERING. - test/embedding-dim-check-facts.test.ts (T6): readFactsEmbeddingDim covers vector(N) + halfvec(N), halfvec-before-vector regex ordering pinned (codex #19), buildFactsAlterRecipe emits DROP INDEX + ALTER USING + CREATE INDEX (codex #18, not bare REINDEX), FactsEmbeddingDimMismatchError tagged class shape, assertFactsEmbeddingDimMatchesConfig PGLite skip + Postgres absent- column skip, doctor check + insert-cast wiring assertions. - test/extract-conversation-facts-workers.test.ts (T5): helper exports (extractConversationFactsLockId, PER_PAGE_LOCK_TTL_MINUTES), structural wiring (runSlidingPool, resolveWorkersWithClamp, withRefreshingLock, LockUnavailableError, delete-orphans-first before segment loop, preflight before pool, exit 3 when lock_skipped > 0), Minion handler round-trip. - test/extract-workers.test.ts (T7): --workers wiring on all 3 inner fs-walk loops (extractForSlugs, extractLinksFromDir, extractTimelineFromDir) + CLI parse + opts threading through runExtractCore. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: rebump v0.41.16.0 → v0.41.17.0 (queue collision with PR #1510) PR #1510 (garrytan/dynamic-regex-conversation-formats) claimed v0.41.16.0 on master in parallel. Advancing this wave to v0.41.17.0 so both can land cleanly. Pure mechanical version bump: - VERSION + package.json → 0.41.17.0 - CHANGELOG.md header + "To take advantage of v0.41.17.0" block - TODOS.md section header + v0.41.18+ forward references - CLAUDE.md inline version tags - Regenerated llms-full.txt / llms.txt No code changes. The actual workers cathedral feature set is unchanged from the two prior commits in this branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(test): search-image-column probes column dim at runtime CI shard 5 failed on `searchVector column routing (v0.27.1)` with: error: expected 1280 dimensions, not 1536 The test had a hardcoded `fakeText1536` helper that seeded chunks at 1536-d vectors. Master's default embedding model switched from OpenAI text-embedding-3-large (1536) to ZeroEntropy zembed-1 (1280) so a fresh PGLite brain on CI now sizes content_chunks.embedding at 1280; the test's 1536-d INSERT trips pgvector's CheckExpectedDim. Fix: probe `content_chunks.embedding` width via `readContentChunksEmbeddingDim(engine)` in `beforeAll`, store in `TEXT_DIM`, and build `fakeTextDefault(seed)` at that width. The test now passes regardless of which default ships (the model has flipped twice and may flip again). Local dev (1536 from older config) and CI fresh-install (1280 from new default) both pass. Image-side vectors stay at 1024 (matches Voyage multimodal-3 + the column's fixed width on the image side). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(test): bump PGLite hook timeout for shard-4 deep-process files facts-anti-loop.test.ts and ingest-capture.test.ts were timing out in CI shard 4 with "beforeEach/afterEach hook timed out" after the v0.41.16.0 master merge brought migration count to 99. When these files run deep in a shard process that has already created ~20 PGLite engines, the WASM cold-start + 95-migration replay legitimately exceeds bun's 5s default hook timeout (observed 5.6s and 7.3s locally when reproducing). Bun's --timeout=60000 from scripts/test-shard.sh covers TEST timeouts but NOT hook timeouts; those default to 5s and must be set per-hook via the optional 2nd arg to beforeAll/afterAll. Reproduced locally by running the first 21 shard-4 files via head -21 /tmp/shard4-list.txt | xargs bun test → 179 pass, 2 fail (both with hook-timeout error) After fix: → 198 pass, 0 fail (the 4 anti-loop + 15 ingest-capture tests recover) Full shard 4 with fix: 955 pass, 0 fail. Full shard 5 with fix: 1261 pass, 0 fail. Also added a defensive diagnostic to the two put_page tests: if facts_backstop is missing in the response payload, throw with the full payload + isError so future failures surface the actual handler error instead of a bare "expected {...} got undefined" assertion. No-op when the test passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* upstream/master: v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes garrytan#1567) (garrytan#1572) v0.41.26.0 fix: dream --source + ingest junk titles + emoji-crash (supersedes garrytan#1559, garrytan#1561) (garrytan#1571) v0.41.25.0 perf(sync): batched deletes + global page-generation clock (supersedes garrytan#1538) (garrytan#1566) v0.41.24.0 fix(conversation-parser): threshold gates + bold-paren-time pattern — 20,167 Circleback messages unblocked (closes garrytan#1533) (garrytan#1543) v0.41.23.0 feat: extract operator surfaces + pack-driven extractables (garrytan#1541) v0.41.22.1 feat: brainstorm/lsd judge fixes (closes garrytan#1540 end-to-end) (garrytan#1562) v0.41.22.0 feat: type-unification cathedral — 94 types → 15 canonical (closes garrytan#1479) (garrytan#1542) v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave (garrytan#1545) v0.41.20.0 feat: gbrain status + doctor --scope=brain (fix wave 2: items garrytan#6 + garrytan#7) (garrytan#1544) feat: v0.41.19.0 Supavisor Retry Cathedral (garrytan#1537) v0.41.18.0: gbrain onboard — the activation surface gbrain didn't have before (garrytan#1521) v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity (garrytan#1519) v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes garrytan#1461) (garrytan#1510) v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes garrytan#1472 RFC) (garrytan#1506)

garrytan and others added 4 commits May 26, 2026 11:49

garrytan changed the title ~~v0.41.15.0 feat: conversation parser cathedral + progressive-batch primitive (closes #1461)~~ v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes #1461) May 26, 2026

garrytan merged commit f702ec0 into master May 27, 2026
21 checks passed

garrytan mentioned this pull request May 27, 2026

v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity #1519

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes #1461)#1510

v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes #1461)#1510
garrytan merged 5 commits into
masterfrom
garrytan/dynamic-regex-conversation-formats

garrytan commented May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant