v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes #1461)#1510
Merged
Merged
Conversation
…imitive (closes #1461) Replaces PR #1461's single-format Telegram regex with a 12-pattern built-in registry covering iMessage/Slack, Telegram (×2), Discord (×2), WhatsApp (×2 locales), Signal, Matrix/Element, IRC (×2), Teams. Each pattern is hand-vetted from public format docs (signal-cli, DiscordChatExporter, Telegram Desktop, WhatsApp export docs, Element matrix-archive, irssi/weechat defaults); module-load validation runs test_positive[] + test_negative[] for every pattern at startup so a typo makes gbrain refuse to start. PR #1461 contributor's BRACKET_TIME_RX + cleanSpeaker survive verbatim as the `telegram-bracket` built-in pattern + DEFAULT_SPEAKER_CLEAN export. All 33 of their test cases pass against the new orchestrator. Three layers per page (orchestrator chooses): 1. Built-in pattern registry (zero-cost, deterministic) 2. User-declared simple_pattern via config (deferred to v0.42+) 3. Opt-IN LLM polish + fallback (privacy-first; chat content goes to Anthropic only when user explicitly enables) D18 priority scoring picks the highest-match-rate pattern across the first 10 lines (not first-wins) so overlapping formats don't silently mis-route. D5 multi_line per-pattern + D11 quick_reject prefix screen + D19 timezone_policy per-pattern complete the registry shape. Companion: src/core/progressive-batch/ primitive (rule of three satisfied across 12+ ad-hoc cost-prompt sites). Wintermute-inspired ramp shape (trial 10 → 100 → 500 → full with verification at each stage), productionized with verifier+policy injection (callers describe HOW TO MEASURE SUCCESS, not WHEN TO WAIT FOR CTRL-C). D3 fail-closed budget gate: null tracker + null Policy.maxCostUsd → abort_cost_cap reason='no_budget_safety_net'. D20 discriminated Verifier union (output_count | idempotent_mutation | noop). extract-conversation-facts is the one proven consumer in v0.41.15.0; 9-site retrofit deferred to v0.41.16.0+ per TODOS.md. Codex outside-voice review absorbed 8 substantive findings: - Privacy posture (LLM polish/fallback flipped to opt-IN) - ReDoS theater (dropped arbitrary user regex; v0.42+ uses RE2) - LLM-inferred-regex persistence as silent-corruption machine - Pattern priority scoring across first 10 lines - Timezone policy on every PatternEntry - Verifier shape discriminated union - Behavior parity for sites that "jumped straight to full" - Real-corpus-redacted fixture gap (v0.42+ TODO) CI gates: - bun run check:conversation-parser (13 fixtures, --no-llm, deterministic) - bun run check:fixture-privacy (banned-token grep) Doctor surfaces 3 new checks: conversation_format_coverage, progressive_batch_audit_health, conversation_parser_probe_health. Tests: 198/198 across primitive + parser + LLM + nightly probe + eval CLI + debug CLI + doctor checks + migration v97 round-trip + E2E parser ↔ engine integration. Real bug caught + fixed during gap audit: IdempotentMutationVerifier was comparing absolute mutated-count vs per-stage expected (failed silently on stage 2+); now uses per-stage delta semantics matching OutputCountVerifier. Schema migration v97: conversation_parser_llm_cache table with (content_sha256, model_id, call_shape) composite key. NO inferred_patterns table (D17: silent-corruption machine). Plan + 23 decisions + codex outside-voice absorption at ~/.claude/plans/system-instruction-you-are-working-cuddly-hollerith.md. Co-Authored-By: garrytan-agents (PR #1461) <noreply@github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Conflict resolutions: - VERSION → 0.41.15.0 (queue-aware: master claimed 13+14) - package.json → 0.41.15.0 - CHANGELOG.md → my v0.41.15.0 entry on top, master's v0.41.14.0 + v0.41.13.0 entries preserved below - TODOS.md → both sections preserved (mine = v0.41.15.0 conv-parser, master = v0.41.14.0 #1451 drift) - scripts/run-verify-parallel.sh → keep all 3 checks (fixture-privacy + conversation-parser + resolver) - src/cli.ts CLI_ONLY → merged set (conversation-parser + reindex) - src/core/migrate.ts → master's v97 (pages_dedup_partial_index) kept; mine renumbered to v98 (conversation_parser_llm_cache_table) - test/migrations-v97.test.ts → renamed to test/migrations-v98.test.ts to match Regenerated llms-full.txt after merge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The new sibling privacy guard literally names the banned tokens in its BANNED_TOKENS array — same meta-exception that check-privacy.sh itself gets. Without this allowlist entry, bun run verify rejects the file post-merge because the banned name appears in the rule-definition script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mechanical rename across all surfaces: VERSION, package.json, CHANGELOG (header + body refs), CLAUDE.md, TODOS.md, src/core/ migrate.ts (migration v98 comment), all src/core/conversation-parser/* and src/core/progressive-batch/* file headers, all test/ headers, scripts/check-privacy.sh allowlist comment, llms-full.txt regenerated. Audit clean: VERSION + package.json + CHANGELOG header all show 0.41.16.0. verify 24/24, touched tests 179/179. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolutions: - VERSION → 0.41.16.0 (mine — higher) - package.json → 0.41.16.0 - CHANGELOG → my v0.41.16.0 on top + master's v0.41.15.0 below - TODOS → preserve both sections (mine = v0.41.16.0 conv-parser, master = v0.41.15.0 sync-reliability) - src/core/migrate.ts → master's v98 (gbrain_cycle_locks_last_refreshed_at) kept; mine renumbered to v99 (conversation_parser_llm_cache_table) - test/migrations-v98.test.ts → renamed to test/migrations-v99.test.ts to match verify 24/24, 331/331 tests across touched suites + migrate.test.ts green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan
added a commit
that referenced
this pull request
May 27, 2026
PR #1510 (garrytan/dynamic-regex-conversation-formats) claimed v0.41.16.0 on master in parallel. Advancing this wave to v0.41.17.0 so both can land cleanly. Pure mechanical version bump: - VERSION + package.json → 0.41.17.0 - CHANGELOG.md header + "To take advantage of v0.41.17.0" block - TODOS.md section header + v0.41.18+ forward references - CLAUDE.md inline version tags - Regenerated llms-full.txt / llms.txt No code changes. The actual workers cathedral feature set is unchanged from the two prior commits in this branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7 tasks
garrytan
added a commit
that referenced
this pull request
May 27, 2026
… parity (#1519) * feat(worker-pool): shared sliding pool + bounded semaphore + PGLite-clamp wrapper T1 + T2 of the v0.41.16.0 workers cathedral. New src/core/worker-pool.ts is the canonical primitive every --workers N bulk command in this wave (and future bulk commands) builds on. Atomic-claim invariant enforced by scripts/check-worker-pool-atomicity.sh (wired into bun run verify). BudgetExhausted bypass + AbortSignal composition baked into the helper so budget caps are a structural ceiling under concurrency, not a per-caller convention. The new resolveWorkersWithClamp wrapper composes existing autoConcurrency with PGLite-clamp + per-(command, requested) stderr dedup. Deliberately NOT a modification to shared autoConcurrency (silent today, used by sync + import); embed.ts keeps GBRAIN_EMBED_CONCURRENCY || 20 default per codex #13. 23 + 12 + 9 = 44 hermetic tests pin every contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: structural + dim-check regression suites for v0.41.16.0 wave - test/embed-helper-migration.test.ts (T3): asserts embed.ts's two sliding-pool sites are migrated to runSlidingPool, pre-migration shapes (let nextIdx = 0, Promise.all(Array.from(...))) are gone, GBRAIN_EMBED_CONCURRENCY || 20 default preserved, failureLabel threads page.slug. Per codex #16/#17 these are invariant assertions, not byte-equality on progress event ORDERING. - test/embedding-dim-check-facts.test.ts (T6): readFactsEmbeddingDim covers vector(N) + halfvec(N), halfvec-before-vector regex ordering pinned (codex #19), buildFactsAlterRecipe emits DROP INDEX + ALTER USING + CREATE INDEX (codex #18, not bare REINDEX), FactsEmbeddingDimMismatchError tagged class shape, assertFactsEmbeddingDimMatchesConfig PGLite skip + Postgres absent- column skip, doctor check + insert-cast wiring assertions. - test/extract-conversation-facts-workers.test.ts (T5): helper exports (extractConversationFactsLockId, PER_PAGE_LOCK_TTL_MINUTES), structural wiring (runSlidingPool, resolveWorkersWithClamp, withRefreshingLock, LockUnavailableError, delete-orphans-first before segment loop, preflight before pool, exit 3 when lock_skipped > 0), Minion handler round-trip. - test/extract-workers.test.ts (T7): --workers wiring on all 3 inner fs-walk loops (extractForSlugs, extractLinksFromDir, extractTimelineFromDir) + CLI parse + opts threading through runExtractCore. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: rebump v0.41.16.0 → v0.41.17.0 (queue collision with PR #1510) PR #1510 (garrytan/dynamic-regex-conversation-formats) claimed v0.41.16.0 on master in parallel. Advancing this wave to v0.41.17.0 so both can land cleanly. Pure mechanical version bump: - VERSION + package.json → 0.41.17.0 - CHANGELOG.md header + "To take advantage of v0.41.17.0" block - TODOS.md section header + v0.41.18+ forward references - CLAUDE.md inline version tags - Regenerated llms-full.txt / llms.txt No code changes. The actual workers cathedral feature set is unchanged from the two prior commits in this branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(test): search-image-column probes column dim at runtime CI shard 5 failed on `searchVector column routing (v0.27.1)` with: error: expected 1280 dimensions, not 1536 The test had a hardcoded `fakeText1536` helper that seeded chunks at 1536-d vectors. Master's default embedding model switched from OpenAI text-embedding-3-large (1536) to ZeroEntropy zembed-1 (1280) so a fresh PGLite brain on CI now sizes content_chunks.embedding at 1280; the test's 1536-d INSERT trips pgvector's CheckExpectedDim. Fix: probe `content_chunks.embedding` width via `readContentChunksEmbeddingDim(engine)` in `beforeAll`, store in `TEXT_DIM`, and build `fakeTextDefault(seed)` at that width. The test now passes regardless of which default ships (the model has flipped twice and may flip again). Local dev (1536 from older config) and CI fresh-install (1280 from new default) both pass. Image-side vectors stay at 1024 (matches Voyage multimodal-3 + the column's fixed width on the image side). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(test): bump PGLite hook timeout for shard-4 deep-process files facts-anti-loop.test.ts and ingest-capture.test.ts were timing out in CI shard 4 with "beforeEach/afterEach hook timed out" after the v0.41.16.0 master merge brought migration count to 99. When these files run deep in a shard process that has already created ~20 PGLite engines, the WASM cold-start + 95-migration replay legitimately exceeds bun's 5s default hook timeout (observed 5.6s and 7.3s locally when reproducing). Bun's --timeout=60000 from scripts/test-shard.sh covers TEST timeouts but NOT hook timeouts; those default to 5s and must be set per-hook via the optional 2nd arg to beforeAll/afterAll. Reproduced locally by running the first 21 shard-4 files via head -21 /tmp/shard4-list.txt | xargs bun test → 179 pass, 2 fail (both with hook-timeout error) After fix: → 198 pass, 0 fail (the 4 anti-loop + 15 ingest-capture tests recover) Full shard 4 with fix: 955 pass, 0 fail. Full shard 5 with fix: 1261 pass, 0 fail. Also added a defensive diagnostic to the two put_page tests: if facts_backstop is missing in the response payload, throw with the full payload + isError so future failures surface the actual handler error instead of a bare "expected {...} got undefined" assertion. No-op when the test passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mgunnin
added a commit
to mgunnin/gbrain
that referenced
this pull request
May 28, 2026
* upstream/master: v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes garrytan#1567) (garrytan#1572) v0.41.26.0 fix: dream --source + ingest junk titles + emoji-crash (supersedes garrytan#1559, garrytan#1561) (garrytan#1571) v0.41.25.0 perf(sync): batched deletes + global page-generation clock (supersedes garrytan#1538) (garrytan#1566) v0.41.24.0 fix(conversation-parser): threshold gates + bold-paren-time pattern — 20,167 Circleback messages unblocked (closes garrytan#1533) (garrytan#1543) v0.41.23.0 feat: extract operator surfaces + pack-driven extractables (garrytan#1541) v0.41.22.1 feat: brainstorm/lsd judge fixes (closes garrytan#1540 end-to-end) (garrytan#1562) v0.41.22.0 feat: type-unification cathedral — 94 types → 15 canonical (closes garrytan#1479) (garrytan#1542) v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave (garrytan#1545) v0.41.20.0 feat: gbrain status + doctor --scope=brain (fix wave 2: items garrytan#6 + garrytan#7) (garrytan#1544) feat: v0.41.19.0 Supavisor Retry Cathedral (garrytan#1537) v0.41.18.0: gbrain onboard — the activation surface gbrain didn't have before (garrytan#1521) v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity (garrytan#1519) v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes garrytan#1461) (garrytan#1510) v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes garrytan#1472 RFC) (garrytan#1506)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces PR #1461's single-format Telegram regex with a 12-pattern built-in registry covering every common chat-export format, plus an opt-in LLM polish + fallback layer for the long tail. Same wave extracts a new
progressive-batchprimitive (Wintermute-inspired ramp-up: trial 10 → 100 → 500 → full with verification at each stage) after rule-of-three was satisfied across 12+ ad-hoc cost-prompt sites.The user impact: a real production brain has 134 Telegram-shaped pages typed
conversationthat the existing parser silently drops. After this upgrade, the dream cycle'sconversation_facts_backfillphase extracts facts from them automatically. Same for any Discord, WhatsApp, Signal, IRC, Matrix, Teams, or email-thread page.PR #1461 contributor's
BRACKET_TIME_RX+cleanSpeakersurvive verbatim as thetelegram-bracketbuilt-in pattern +DEFAULT_SPEAKER_CLEANexport. All 33 of their test cases pass against the new orchestrator. Co-Authored-By preserved.The 12 built-in patterns
imessage-slack**Alice** (2024-03-15 9:00 AM): hitelegram-bracket**[18:37] 👤 Alice:** hi(PR #1461 verbatim)telegram-text-exportAlice, [Mar 15, 2024 at 6:37:00 PM]whatsapp-iso[15/03/24, 18:37:00] Alice: hiwhatsapp-us3/15/24, 6:37 PM - Alice: hidiscord-export[03/15/2024 6:37 PM] Alice+ multi-line bodydiscord-classicAlice — Today at 18:37+ multi-line bodysignal-exportAlice (2024-03-15 18:37:00 UTC): himatrix-element[18:37] @alice:matrix.org: hiirc-classic<alice> hiirc-weechat18:37 <alice> hiteams-exportAlice, 3/15/2024 6:37 PM: hiTest Coverage
test/progressive-batch/orchestrator.test.tstest/progressive-batch/retrofit-wrap.test.tstest/conversation-parser/parse.test.tstest/conversation-parser/llm-base.test.tstest/conversation-parser/llm-fallback.test.tstest/conversation-parser/llm-polish.test.tstest/conversation-parser/nightly-probe.test.tstest/eval-conversation-parser-cli.test.tstest/conversation-parser-cli.test.tstest/migrations-v98.test.tstest/doctor-v0_41_13_checks.test.tstest/e2e/conversation-parser-pglite.test.tstest/extract-conversation-facts.test.tsTotal: 350/350 across touched suites.
bun run verify: 24/24 green.Real bug caught + fixed during gap audit:
IdempotentMutationVerifierwas comparing absolute mutated-count vs per-stage expected (failed silently on stage 2+); now uses per-stage delta semantics matchingOutputCountVerifier.Pre-Landing Review
Codex outside-voice review during planning absorbed 8 substantive technical findings (all adopted):
inline_utc | frontmatter_tz | utc_assumed_with_warn; time-only formats warn when no frontmatter timezoneinteractiveAbortMs > 0Plan Completion
23 decisions captured across plan-eng-review + 14 decisions in plan-mode. All resolved. Plan file at
~/.claude/plans/system-instruction-you-are-working-cuddly-hollerith.md. The 9-site progressive-batch retrofit is deferred to v0.41.16.0+ per D2 (one proven consumer ships now; bisectable retrofits land per-PR).Documentation
CLAUDE.mdextended with two new file-cluster annotations (src/core/conversation-parser/andsrc/core/progressive-batch/).CHANGELOG.mdcarries the ELI10-lead-first v0.41.15.0 entry.TODOS.mdfiled 7 v0.42+ follow-up entries.CI gates added
bun run check:conversation-parser— 13-fixture deterministic eval gate with--no-llm(no API keys needed). Wired intobun run verify.bun run check:fixture-privacy— banned-token grep overtest/fixtures/conversation-formats/. Wired into verify.Closed in this wave
Test plan
bun run verify— 24/24 green🤖 Generated with Claude Code