Skip to content

v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes #1461)#1510

Merged
garrytan merged 5 commits into
masterfrom
garrytan/dynamic-regex-conversation-formats
May 27, 2026
Merged

v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes #1461)#1510
garrytan merged 5 commits into
masterfrom
garrytan/dynamic-regex-conversation-formats

Conversation

@garrytan

Copy link
Copy Markdown
Owner

Summary

Replaces PR #1461's single-format Telegram regex with a 12-pattern built-in registry covering every common chat-export format, plus an opt-in LLM polish + fallback layer for the long tail. Same wave extracts a new progressive-batch primitive (Wintermute-inspired ramp-up: trial 10 → 100 → 500 → full with verification at each stage) after rule-of-three was satisfied across 12+ ad-hoc cost-prompt sites.

The user impact: a real production brain has 134 Telegram-shaped pages typed conversation that the existing parser silently drops. After this upgrade, the dream cycle's conversation_facts_backfill phase extracts facts from them automatically. Same for any Discord, WhatsApp, Signal, IRC, Matrix, Teams, or email-thread page.

PR #1461 contributor's BRACKET_TIME_RX + cleanSpeaker survive verbatim as the telegram-bracket built-in pattern + DEFAULT_SPEAKER_CLEAN export. All 33 of their test cases pass against the new orchestrator. Co-Authored-By preserved.

The 12 built-in patterns

id sample
imessage-slack **Alice** (2024-03-15 9:00 AM): hi
telegram-bracket **[18:37] 👤 Alice:** hi (PR #1461 verbatim)
telegram-text-export Alice, [Mar 15, 2024 at 6:37:00 PM]
whatsapp-iso [15/03/24, 18:37:00] Alice: hi
whatsapp-us 3/15/24, 6:37 PM - Alice: hi
discord-export [03/15/2024 6:37 PM] Alice + multi-line body
discord-classic Alice — Today at 18:37 + multi-line body
signal-export Alice (2024-03-15 18:37:00 UTC): hi
matrix-element [18:37] @alice:matrix.org: hi
irc-classic <alice> hi
irc-weechat 18:37 <alice> hi
teams-export Alice, 3/15/2024 6:37 PM: hi

Test Coverage

Layer Test file Cases Status
Primitive test/progressive-batch/orchestrator.test.ts 43
Primitive — retrofit-wrap test/progressive-batch/retrofit-wrap.test.ts 7
Parser orchestrator test/conversation-parser/parse.test.ts 39 (incl. 6 PR #1461 regression cases verbatim)
LLM base test/conversation-parser/llm-base.test.ts 16
LLM fallback test/conversation-parser/llm-fallback.test.ts 6
LLM polish test/conversation-parser/llm-polish.test.ts 11
Nightly probe test/conversation-parser/nightly-probe.test.ts 7
Eval CLI test/eval-conversation-parser-cli.test.ts 15 (exit codes, --json, --no-llm)
Debug CLI test/conversation-parser-cli.test.ts 10
Migration v98 test/migrations-v98.test.ts 7 (JSONB round-trip, CHECK, composite PK)
Doctor checks test/doctor-v0_41_13_checks.test.ts 4 (3 new checks present + shape)
E2E parser ↔ engine test/e2e/conversation-parser-pglite.test.ts 15 (12 built-ins through PGLite)
Back-compat test/extract-conversation-facts.test.ts 27 (PR #1461 baseline)

Total: 350/350 across touched suites. bun run verify: 24/24 green.

Real bug caught + fixed during gap audit: IdempotentMutationVerifier was comparing absolute mutated-count vs per-stage expected (failed silently on stage 2+); now uses per-stage delta semantics matching OutputCountVerifier.

Pre-Landing Review

Codex outside-voice review during planning absorbed 8 substantive technical findings (all adopted):

  • Privacy posture — LLM polish/fallback flipped to opt-IN (chat content goes to Anthropic only when user explicitly enables)
  • ReDoS theater — dropped arbitrary user regex; user patterns wait for v0.42+ with worker-isolated regex (safe-regex / RE2)
  • LLM-inferred-regex persistence as silent-corruption machine — dropped entirely; LLM fallback parses for THIS page only, cached by content_hash
  • Pattern priority scoring across first 10 lines (not first-wins) — defeats overlap mis-routing
  • Timezone policy per PatternEntryinline_utc | frontmatter_tz | utc_assumed_with_warn; time-only formats warn when no frontmatter timezone
  • Verifier shape discriminated union (output_count | idempotent_mutation | noop) — fits reindex / embed / eval-contradictions / parser cleanly
  • Behavior parity for retrofits — sites that previously jumped to full keep doing so; ramp is opt-in per-site via interactiveAbortMs > 0
  • Real-corpus-redacted fixture gap — v0.42+ TODO

Plan Completion

23 decisions captured across plan-eng-review + 14 decisions in plan-mode. All resolved. Plan file at ~/.claude/plans/system-instruction-you-are-working-cuddly-hollerith.md. The 9-site progressive-batch retrofit is deferred to v0.41.16.0+ per D2 (one proven consumer ships now; bisectable retrofits land per-PR).

Documentation

CLAUDE.md extended with two new file-cluster annotations (src/core/conversation-parser/ and src/core/progressive-batch/). CHANGELOG.md carries the ELI10-lead-first v0.41.15.0 entry. TODOS.md filed 7 v0.42+ follow-up entries.

CI gates added

  • bun run check:conversation-parser — 13-fixture deterministic eval gate with --no-llm (no API keys needed). Wired into bun run verify.
  • bun run check:fixture-privacy — banned-token grep over test/fixtures/conversation-formats/. Wired into verify.

Closed in this wave

Test plan

  • bun run verify — 24/24 green
  • 350 unit tests across all touched suites pass
  • E2E parser ↔ engine integration test (15 cases) hermetic via PGLite — all 12 built-ins import + parse correctly
  • PR feat: support bracket-time format in conversation facts parser #1461's 33 test cases pass verbatim against the new orchestrator
  • Migration v98 round-trip (7 cases) — CHECK + composite PK + JSONB shape
  • Codex outside-voice review — 14 findings, 13 absorbed
  • Real bug caught during gap audit — IdempotentMutationVerifier per-stage delta semantics
  • Master merged (v0.41.13.0 + v0.41.14.0); conflicts resolved per CLAUDE.md merge-recovery procedure; VERSION/package.json/CHANGELOG audit clean

🤖 Generated with Claude Code

garrytan and others added 4 commits May 26, 2026 11:49
…imitive (closes #1461)

Replaces PR #1461's single-format Telegram regex with a 12-pattern
built-in registry covering iMessage/Slack, Telegram (×2), Discord
(×2), WhatsApp (×2 locales), Signal, Matrix/Element, IRC (×2), Teams.
Each pattern is hand-vetted from public format docs (signal-cli,
DiscordChatExporter, Telegram Desktop, WhatsApp export docs, Element
matrix-archive, irssi/weechat defaults); module-load validation runs
test_positive[] + test_negative[] for every pattern at startup so a
typo makes gbrain refuse to start.

PR #1461 contributor's BRACKET_TIME_RX + cleanSpeaker survive verbatim
as the `telegram-bracket` built-in pattern + DEFAULT_SPEAKER_CLEAN
export. All 33 of their test cases pass against the new orchestrator.

Three layers per page (orchestrator chooses):
  1. Built-in pattern registry (zero-cost, deterministic)
  2. User-declared simple_pattern via config (deferred to v0.42+)
  3. Opt-IN LLM polish + fallback (privacy-first; chat content goes
     to Anthropic only when user explicitly enables)

D18 priority scoring picks the highest-match-rate pattern across the
first 10 lines (not first-wins) so overlapping formats don't silently
mis-route. D5 multi_line per-pattern + D11 quick_reject prefix screen
+ D19 timezone_policy per-pattern complete the registry shape.

Companion: src/core/progressive-batch/ primitive (rule of three
satisfied across 12+ ad-hoc cost-prompt sites). Wintermute-inspired
ramp shape (trial 10 → 100 → 500 → full with verification at each
stage), productionized with verifier+policy injection (callers
describe HOW TO MEASURE SUCCESS, not WHEN TO WAIT FOR CTRL-C). D3
fail-closed budget gate: null tracker + null Policy.maxCostUsd →
abort_cost_cap reason='no_budget_safety_net'. D20 discriminated
Verifier union (output_count | idempotent_mutation | noop).
extract-conversation-facts is the one proven consumer in v0.41.15.0;
9-site retrofit deferred to v0.41.16.0+ per TODOS.md.

Codex outside-voice review absorbed 8 substantive findings:
  - Privacy posture (LLM polish/fallback flipped to opt-IN)
  - ReDoS theater (dropped arbitrary user regex; v0.42+ uses RE2)
  - LLM-inferred-regex persistence as silent-corruption machine
  - Pattern priority scoring across first 10 lines
  - Timezone policy on every PatternEntry
  - Verifier shape discriminated union
  - Behavior parity for sites that "jumped straight to full"
  - Real-corpus-redacted fixture gap (v0.42+ TODO)

CI gates:
  - bun run check:conversation-parser (13 fixtures, --no-llm, deterministic)
  - bun run check:fixture-privacy (banned-token grep)

Doctor surfaces 3 new checks: conversation_format_coverage,
progressive_batch_audit_health, conversation_parser_probe_health.

Tests: 198/198 across primitive + parser + LLM + nightly probe + eval
CLI + debug CLI + doctor checks + migration v97 round-trip + E2E
parser ↔ engine integration. Real bug caught + fixed during gap audit:
IdempotentMutationVerifier was comparing absolute mutated-count vs
per-stage expected (failed silently on stage 2+); now uses per-stage
delta semantics matching OutputCountVerifier.

Schema migration v97: conversation_parser_llm_cache table with
(content_sha256, model_id, call_shape) composite key. NO
inferred_patterns table (D17: silent-corruption machine).

Plan + 23 decisions + codex outside-voice absorption at
~/.claude/plans/system-instruction-you-are-working-cuddly-hollerith.md.

Co-Authored-By: garrytan-agents (PR #1461) <noreply@github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Conflict resolutions:
- VERSION → 0.41.15.0 (queue-aware: master claimed 13+14)
- package.json → 0.41.15.0
- CHANGELOG.md → my v0.41.15.0 entry on top, master's v0.41.14.0 + v0.41.13.0 entries preserved below
- TODOS.md → both sections preserved (mine = v0.41.15.0 conv-parser, master = v0.41.14.0 #1451 drift)
- scripts/run-verify-parallel.sh → keep all 3 checks (fixture-privacy + conversation-parser + resolver)
- src/cli.ts CLI_ONLY → merged set (conversation-parser + reindex)
- src/core/migrate.ts → master's v97 (pages_dedup_partial_index) kept; mine renumbered to v98 (conversation_parser_llm_cache_table)
- test/migrations-v97.test.ts → renamed to test/migrations-v98.test.ts to match

Regenerated llms-full.txt after merge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The new sibling privacy guard literally names the banned tokens in its
BANNED_TOKENS array — same meta-exception that check-privacy.sh itself
gets. Without this allowlist entry, bun run verify rejects the file
post-merge because the banned name appears in the rule-definition script.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mechanical rename across all surfaces: VERSION, package.json,
CHANGELOG (header + body refs), CLAUDE.md, TODOS.md, src/core/
migrate.ts (migration v98 comment), all src/core/conversation-parser/*
and src/core/progressive-batch/* file headers, all test/ headers,
scripts/check-privacy.sh allowlist comment, llms-full.txt regenerated.

Audit clean: VERSION + package.json + CHANGELOG header all show
0.41.16.0. verify 24/24, touched tests 179/179.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan changed the title v0.41.15.0 feat: conversation parser cathedral + progressive-batch primitive (closes #1461) v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes #1461) May 26, 2026
Resolutions:
- VERSION → 0.41.16.0 (mine — higher)
- package.json → 0.41.16.0
- CHANGELOG → my v0.41.16.0 on top + master's v0.41.15.0 below
- TODOS → preserve both sections (mine = v0.41.16.0 conv-parser, master = v0.41.15.0 sync-reliability)
- src/core/migrate.ts → master's v98 (gbrain_cycle_locks_last_refreshed_at) kept; mine renumbered to v99 (conversation_parser_llm_cache_table)
- test/migrations-v98.test.ts → renamed to test/migrations-v99.test.ts to match

verify 24/24, 331/331 tests across touched suites + migrate.test.ts green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan added a commit that referenced this pull request May 27, 2026
PR #1510 (garrytan/dynamic-regex-conversation-formats) claimed v0.41.16.0
on master in parallel. Advancing this wave to v0.41.17.0 so both can land
cleanly. Pure mechanical version bump:

- VERSION + package.json → 0.41.17.0
- CHANGELOG.md header + "To take advantage of v0.41.17.0" block
- TODOS.md section header + v0.41.18+ forward references
- CLAUDE.md inline version tags
- Regenerated llms-full.txt / llms.txt

No code changes. The actual workers cathedral feature set is unchanged
from the two prior commits in this branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan merged commit f702ec0 into master May 27, 2026
21 checks passed
garrytan added a commit that referenced this pull request May 27, 2026
… parity (#1519)

* feat(worker-pool): shared sliding pool + bounded semaphore + PGLite-clamp wrapper

T1 + T2 of the v0.41.16.0 workers cathedral. New src/core/worker-pool.ts is
the canonical primitive every --workers N bulk command in this wave (and
future bulk commands) builds on. Atomic-claim invariant enforced by
scripts/check-worker-pool-atomicity.sh (wired into bun run verify).
BudgetExhausted bypass + AbortSignal composition baked into the helper so
budget caps are a structural ceiling under concurrency, not a per-caller
convention.

The new resolveWorkersWithClamp wrapper composes existing autoConcurrency
with PGLite-clamp + per-(command, requested) stderr dedup. Deliberately
NOT a modification to shared autoConcurrency (silent today, used by sync
+ import); embed.ts keeps GBRAIN_EMBED_CONCURRENCY || 20 default per
codex #13.

23 + 12 + 9 = 44 hermetic tests pin every contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: structural + dim-check regression suites for v0.41.16.0 wave

- test/embed-helper-migration.test.ts (T3): asserts embed.ts's two
  sliding-pool sites are migrated to runSlidingPool, pre-migration
  shapes (let nextIdx = 0, Promise.all(Array.from(...))) are gone,
  GBRAIN_EMBED_CONCURRENCY || 20 default preserved, failureLabel
  threads page.slug. Per codex #16/#17 these are invariant assertions,
  not byte-equality on progress event ORDERING.
- test/embedding-dim-check-facts.test.ts (T6): readFactsEmbeddingDim
  covers vector(N) + halfvec(N), halfvec-before-vector regex ordering
  pinned (codex #19), buildFactsAlterRecipe emits DROP INDEX + ALTER
  USING + CREATE INDEX (codex #18, not bare REINDEX),
  FactsEmbeddingDimMismatchError tagged class shape,
  assertFactsEmbeddingDimMatchesConfig PGLite skip + Postgres absent-
  column skip, doctor check + insert-cast wiring assertions.
- test/extract-conversation-facts-workers.test.ts (T5): helper
  exports (extractConversationFactsLockId, PER_PAGE_LOCK_TTL_MINUTES),
  structural wiring (runSlidingPool, resolveWorkersWithClamp,
  withRefreshingLock, LockUnavailableError, delete-orphans-first
  before segment loop, preflight before pool, exit 3 when lock_skipped
  > 0), Minion handler round-trip.
- test/extract-workers.test.ts (T7): --workers wiring on all 3 inner
  fs-walk loops (extractForSlugs, extractLinksFromDir,
  extractTimelineFromDir) + CLI parse + opts threading through
  runExtractCore.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: rebump v0.41.16.0 → v0.41.17.0 (queue collision with PR #1510)

PR #1510 (garrytan/dynamic-regex-conversation-formats) claimed v0.41.16.0
on master in parallel. Advancing this wave to v0.41.17.0 so both can land
cleanly. Pure mechanical version bump:

- VERSION + package.json → 0.41.17.0
- CHANGELOG.md header + "To take advantage of v0.41.17.0" block
- TODOS.md section header + v0.41.18+ forward references
- CLAUDE.md inline version tags
- Regenerated llms-full.txt / llms.txt

No code changes. The actual workers cathedral feature set is unchanged
from the two prior commits in this branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(test): search-image-column probes column dim at runtime

CI shard 5 failed on `searchVector column routing (v0.27.1)` with:
  error: expected 1280 dimensions, not 1536

The test had a hardcoded `fakeText1536` helper that seeded chunks at
1536-d vectors. Master's default embedding model switched from OpenAI
text-embedding-3-large (1536) to ZeroEntropy zembed-1 (1280) so a fresh
PGLite brain on CI now sizes content_chunks.embedding at 1280; the
test's 1536-d INSERT trips pgvector's CheckExpectedDim.

Fix: probe `content_chunks.embedding` width via
`readContentChunksEmbeddingDim(engine)` in `beforeAll`, store in
`TEXT_DIM`, and build `fakeTextDefault(seed)` at that width. The test
now passes regardless of which default ships (the model has flipped
twice and may flip again). Local dev (1536 from older config) and CI
fresh-install (1280 from new default) both pass.

Image-side vectors stay at 1024 (matches Voyage multimodal-3 + the
column's fixed width on the image side).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(test): bump PGLite hook timeout for shard-4 deep-process files

facts-anti-loop.test.ts and ingest-capture.test.ts were timing out in CI
shard 4 with "beforeEach/afterEach hook timed out" after the v0.41.16.0
master merge brought migration count to 99. When these files run deep in
a shard process that has already created ~20 PGLite engines, the WASM
cold-start + 95-migration replay legitimately exceeds bun's 5s default
hook timeout (observed 5.6s and 7.3s locally when reproducing).

Bun's --timeout=60000 from scripts/test-shard.sh covers TEST timeouts
but NOT hook timeouts; those default to 5s and must be set per-hook via
the optional 2nd arg to beforeAll/afterAll.

Reproduced locally by running the first 21 shard-4 files via
  head -21 /tmp/shard4-list.txt | xargs bun test
  → 179 pass, 2 fail (both with hook-timeout error)

After fix:
  → 198 pass, 0 fail (the 4 anti-loop + 15 ingest-capture tests recover)

Full shard 4 with fix:  955 pass, 0 fail.
Full shard 5 with fix:  1261 pass, 0 fail.

Also added a defensive diagnostic to the two put_page tests: if
facts_backstop is missing in the response payload, throw with the full
payload + isError so future failures surface the actual handler error
instead of a bare "expected {...} got undefined" assertion. No-op when
the test passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mgunnin added a commit to mgunnin/gbrain that referenced this pull request May 28, 2026
* upstream/master:
  v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes garrytan#1567) (garrytan#1572)
  v0.41.26.0 fix: dream --source + ingest junk titles + emoji-crash (supersedes garrytan#1559, garrytan#1561) (garrytan#1571)
  v0.41.25.0 perf(sync): batched deletes + global page-generation clock (supersedes garrytan#1538) (garrytan#1566)
  v0.41.24.0 fix(conversation-parser): threshold gates + bold-paren-time pattern — 20,167 Circleback messages unblocked (closes garrytan#1533) (garrytan#1543)
  v0.41.23.0 feat: extract operator surfaces + pack-driven extractables (garrytan#1541)
  v0.41.22.1 feat: brainstorm/lsd judge fixes (closes garrytan#1540 end-to-end) (garrytan#1562)
  v0.41.22.0 feat: type-unification cathedral — 94 types → 15 canonical (closes garrytan#1479) (garrytan#1542)
  v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave (garrytan#1545)
  v0.41.20.0 feat: gbrain status + doctor --scope=brain (fix wave 2: items garrytan#6 + garrytan#7) (garrytan#1544)
  feat: v0.41.19.0 Supavisor Retry Cathedral (garrytan#1537)
  v0.41.18.0: gbrain onboard — the activation surface gbrain didn't have before (garrytan#1521)
  v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity (garrytan#1519)
  v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes garrytan#1461) (garrytan#1510)
  v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes garrytan#1472 RFC) (garrytan#1506)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant