feat(cycle): generalize extract_atoms — page-based extraction + config-plane fallback#1414
Closed
garrytan-agents wants to merge 1 commit into
Closed
Conversation
…nfig-plane fallback Two bugs fixed, one capability added: 1. Config-plane mismatch (bug): extract_atoms read dream.synthesize.* config via loadConfig() (file-plane only), but `gbrain config set` writes to DB. Added resolveConfigStr() helper that falls through to engine.getConfig() when the file plane returns undefined. Matches the precedence model used by loadConfigWithEngine() elsewhere. 2. Transcript-only limitation (design gap): extract_atoms only processed raw .txt/.md transcripts from configured corpus directories. Brains with content already in the DB (meetings, sources, articles, videos, books, originals) had no extraction path. Added page-based discovery via engine.listPages() filtered by extractable types. Skips pages already marked atoms_extracted: true, imported from markdown-greenfield, or shorter than 500 chars. Caps at 50 pages per cycle to bound cost. 3. Dual-source merge: transcript discovery and page discovery run in parallel. Results are deduplicated by content hash before LLM calls. Atom frontmatter distinguishes origin (source_path for transcripts, source_slug for pages) so backlinks work for both. After successful extraction, source pages get atoms_extracted: true stamped in frontmatter, preventing re-processing on subsequent cycles. Budget cap (/usr/bin/sh.30/run default) applies across both sources uniformly. Backward-compatible: transcript-only workflows unchanged. Page discovery is additive. Test seams (_transcripts, _skipPageDiscovery) preserved. 13 new tests covering: page extraction, config fallback, dedup, budget cap, skip-already-extracted, skip-greenfield, skip-short, source_slug vs source_path metadata, backward compat. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Owner
|
Closing in favor of fix-wave incorporation on garrytan/pr-1414-1416-1421 (shipping as v0.41.2.1). Rebuilt with /plan-eng-review + codex outside-voice structural improvements:
Thank you for the careful PR description and the 13 tests — they were the design source for the rebuild. The wave ships with full credit on the merged commit. |
5 tasks
garrytan
added a commit
that referenced
this pull request
May 25, 2026
…mpotency + ze-switch env-gate (#1445) * fix-wave: dream.* DB merge + batch retry + extract_atoms idempotency + ze-switch env-gate + doctor check Closes PRs #1414, #1416, #1421 (rebuilt from designs by @garrytan-agents with structural improvements from /plan-eng-review + codex outside-voice). Three production reliability fixes in one wave: 1. dream.* DB-config merge (closes PR #1416 silent-config gap) - loadConfigWithEngine() sparse-merge extends with 7 dream.* keys - File > DB > defaults precedence (no GBRAIN_DREAM_* env vars) - extract-atoms switches to loadConfigWithEngine() so DB-plane keys reach it 2. Batch retry on transient connection drops (closes PR #1416 ~30%-loss bug) - withRetry() pure primitive exported from src/commands/extract.ts - 6 flush() sites snapshot-before-clear with onRetry callback - Reuses isRetryableConnError from src/core/retry-matcher.ts - retry-matcher extended with GBrainError{problem:'No database connection'} 3. extract_atoms source-hash idempotency + page-based discovery (closes #1414) - One raw SQL with NOT EXISTS subquery replaces 6 listPages + N atom checks - sourceId threaded through every putPage call (codex caught real bug) - NULL content_hash filter + dream_generated exclusion + transcript-side idempotency - cycle.ts passes union of syncPagesAffected + synthesizeWrittenSlugs 4. ze-switch pre-apply + pre-resume env-override gate (closes PR #1421) - Gate fires FIRST in apply AND resume; zero setConfig calls on refusal - ASCII warning box (no Unicode per repo D10) - --ignore-env-override escape hatch for power users - ApplyResult extended with refused variant 5. doctor embedding_env_override check (defense-in-depth for #1421) - Cross-surface parity: buildChecks() + doctorReportRemote() - Uses Check.details (not Check.issues per codex schema review) Co-Authored-By: garrytan-agents <garrytan-agents@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.41.10.0) Adds 61 new tests across 5 new files pinning the fix-wave contracts: - test/extract-batch-retry.test.ts (16 cases) — withRetry primitive + snapshot contract - test/extract-atoms-page-discovery.test.ts (17 cases) — discovery SQL + dual-source idempotency - test/ze-switch-env-override.test.ts (17 cases) — env-gate apply + resume + ZERO-setConfig assertion - test/doctor-embedding-env-override.test.ts (7 cases) — cross-surface parity - test/e2e/extract-atoms-discovery-sql.test.ts (4 cases) — real-Postgres parity for raw SQL Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(test): pin gateway to 1536-dim in 2 PGLite tests that hardcode 1536-vector inserts CI shards 1 + 4 failed persistently (not flake — confirmed via retry) after the v0.41.6.0 merge with this error: error: expected 1280 dimensions, not 1536 file: "vector.c", routine: "CheckExpectedDim" Two test files insert 1536-dim Float32Array vectors into `content_chunks.embedding` / `facts.embedding`, but v0.41.5.0 flipped `DEFAULT_EMBEDDING_DIMENSIONS` from 1536 to 1280 (ZE Matryoshka default). On a fresh CI bun process where no prior test pre-configured the gateway, `initSchema()` sizes the vector column at vector(1280) and the inserts throw. Locally this is hidden when an earlier test file in the shard happens to have called `configureGateway({embedding_dimensions: 1536})` — that state leaks forward through bun's shared process. The v0.41.6.0 LPT shard re-balancing reordered files so these two ran cold, surfacing the latent bug. Fix follows the canonical hermetic pattern from test/consolidate-valid-until.test.ts:23-34: pin the gateway to 1536d in beforeAll, reset in afterAll. Test is now isolated from shard ordering. test/search-types-filter.test.ts — shard 1 fail test/operations-find-trajectory.test.ts — shard 4 (6 fails) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: empty commit to trigger CI * chore: trigger CI again * chore: renumber v0.41.10.0 -> v0.41.10.1 Per request — version slot moved to .1 micro tier to leave .0 available for unrelated wave landing on master. --------- Co-authored-by: garrytan-agents <garrytan-agents@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
extract_atoms(v0.41 T5) has two issues that prevent it from working on production brains:1. Config-plane mismatch
The phase reads
dream.synthesize.session_corpus_dirvialoadConfig(), which only reads the file plane (~/.gbrain/config.json+ env vars). Butgbrain config set dream.synthesize.*writes to the DB config table. Result: config appears set but the phase never sees it — silently returns "no transcripts to process."2. Transcript-only design
The phase ONLY processes raw files from filesystem corpus directories. A brain with thousands of existing pages (meetings, sources, articles) has no extraction path — all that content sits in the DB but
extract_atomscan't reach it.Solution
Fix 1:
resolveConfigStr()helperWalks file config first, falls through to
engine.getConfig(key)for DB-plane. Matches precedence model used byloadConfigWithEngine()elsewhere.Fix 2: Page-based discovery
discoverExtractablePages()queriesengine.listPages()for extractable types (meeting,source,article,video,book,original). Filters out already-processed pages (atoms_extracted: true). Caps at 50 pages/cycle.Fix 3: Dual-source merge
Transcript + page discovery run together. Deduplicated by content hash. Atom frontmatter distinguishes origin (
source_pathfor transcripts,source_slugfor pages). Source pages getatoms_extracted: trueafter extraction.Testing
13 new tests — all pass first-green. Covers: page extraction, config fallback, dedup, budget cap, skip-already-extracted, skip-greenfield, skip-short, metadata correctness, backward compat.