feat(cycle): generalize extract_atoms — page-based extraction + config-plane fallback by garrytan-agents · Pull Request #1414 · garrytan/gbrain

garrytan-agents · 2026-05-25T07:16:57Z

Problem

extract_atoms (v0.41 T5) has two issues that prevent it from working on production brains:

1. Config-plane mismatch

The phase reads dream.synthesize.session_corpus_dir via loadConfig(), which only reads the file plane (~/.gbrain/config.json + env vars). But gbrain config set dream.synthesize.* writes to the DB config table. Result: config appears set but the phase never sees it — silently returns "no transcripts to process."

2. Transcript-only design

The phase ONLY processes raw files from filesystem corpus directories. A brain with thousands of existing pages (meetings, sources, articles) has no extraction path — all that content sits in the DB but extract_atoms can't reach it.

Solution

Fix 1: `resolveConfigStr()` helper

Walks file config first, falls through to engine.getConfig(key) for DB-plane. Matches precedence model used by loadConfigWithEngine() elsewhere.

Fix 2: Page-based discovery

discoverExtractablePages() queries engine.listPages() for extractable types (meeting, source, article, video, book, original). Filters out already-processed pages (atoms_extracted: true). Caps at 50 pages/cycle.

Fix 3: Dual-source merge

Transcript + page discovery run together. Deduplicated by content hash. Atom frontmatter distinguishes origin (source_path for transcripts, source_slug for pages). Source pages get atoms_extracted: true after extraction.

Testing

13 new tests — all pass first-green. Covers: page extraction, config fallback, dedup, budget cap, skip-already-extracted, skip-greenfield, skip-short, metadata correctness, backward compat.

…nfig-plane fallback Two bugs fixed, one capability added: 1. Config-plane mismatch (bug): extract_atoms read dream.synthesize.* config via loadConfig() (file-plane only), but `gbrain config set` writes to DB. Added resolveConfigStr() helper that falls through to engine.getConfig() when the file plane returns undefined. Matches the precedence model used by loadConfigWithEngine() elsewhere. 2. Transcript-only limitation (design gap): extract_atoms only processed raw .txt/.md transcripts from configured corpus directories. Brains with content already in the DB (meetings, sources, articles, videos, books, originals) had no extraction path. Added page-based discovery via engine.listPages() filtered by extractable types. Skips pages already marked atoms_extracted: true, imported from markdown-greenfield, or shorter than 500 chars. Caps at 50 pages per cycle to bound cost. 3. Dual-source merge: transcript discovery and page discovery run in parallel. Results are deduplicated by content hash before LLM calls. Atom frontmatter distinguishes origin (source_path for transcripts, source_slug for pages) so backlinks work for both. After successful extraction, source pages get atoms_extracted: true stamped in frontmatter, preventing re-processing on subsequent cycles. Budget cap (/usr/bin/sh.30/run default) applies across both sources uniformly. Backward-compatible: transcript-only workflows unchanged. Page discovery is additive. Test seams (_transcripts, _skipPageDiscovery) preserved. 13 new tests covering: page extraction, config fallback, dedup, budget cap, skip-already-extracted, skip-greenfield, skip-short, source_slug vs source_path metadata, backward compat. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

garrytan · 2026-05-25T18:33:43Z

Closing in favor of fix-wave incorporation on garrytan/pr-1414-1416-1421 (shipping as v0.41.2.1). Rebuilt with /plan-eng-review + codex outside-voice structural improvements:

Source-hash existence check (idempotency-via-data; survives gbrain sync --force; also fixes the pre-existing transcript-side date-stamp duplicate bug) replaces the frontmatter atoms_extracted: true marker.
Single raw SQL with NOT EXISTS subquery (one round-trip) replaces the 6 listPages iterations + per-candidate atom-existence checks.
sourceId threaded through every atom putPage call (codex caught a real bug here that would have routed non-default-source atoms to 'default').
Discovery SQL filters out dream_generated:true to prevent self-consumption, content_hash IS NOT NULL to prevent crashes, and imported_from='markdown-greenfield' as before.
cycle.ts threads UNION of syncPagesAffected + synthesizeWrittenSlugs as affectedSlugs (incremental cycles now see pages synthesize just wrote).
The systemic dream.* config merge from fix: batch retry for PgBouncer connection drops + dream.* DB config merge #1416 obviates the per-call-site resolveConfigStr() helper.

Thank you for the careful PR description and the 13 tests — they were the design source for the rebuild. The wave ships with full credit on the merged commit.

@garrytan-agents

…mpotency + ze-switch env-gate (#1445) * fix-wave: dream.* DB merge + batch retry + extract_atoms idempotency + ze-switch env-gate + doctor check Closes PRs #1414, #1416, #1421 (rebuilt from designs by @garrytan-agents with structural improvements from /plan-eng-review + codex outside-voice). Three production reliability fixes in one wave: 1. dream.* DB-config merge (closes PR #1416 silent-config gap) - loadConfigWithEngine() sparse-merge extends with 7 dream.* keys - File > DB > defaults precedence (no GBRAIN_DREAM_* env vars) - extract-atoms switches to loadConfigWithEngine() so DB-plane keys reach it 2. Batch retry on transient connection drops (closes PR #1416 ~30%-loss bug) - withRetry() pure primitive exported from src/commands/extract.ts - 6 flush() sites snapshot-before-clear with onRetry callback - Reuses isRetryableConnError from src/core/retry-matcher.ts - retry-matcher extended with GBrainError{problem:'No database connection'} 3. extract_atoms source-hash idempotency + page-based discovery (closes #1414) - One raw SQL with NOT EXISTS subquery replaces 6 listPages + N atom checks - sourceId threaded through every putPage call (codex caught real bug) - NULL content_hash filter + dream_generated exclusion + transcript-side idempotency - cycle.ts passes union of syncPagesAffected + synthesizeWrittenSlugs 4. ze-switch pre-apply + pre-resume env-override gate (closes PR #1421) - Gate fires FIRST in apply AND resume; zero setConfig calls on refusal - ASCII warning box (no Unicode per repo D10) - --ignore-env-override escape hatch for power users - ApplyResult extended with refused variant 5. doctor embedding_env_override check (defense-in-depth for #1421) - Cross-surface parity: buildChecks() + doctorReportRemote() - Uses Check.details (not Check.issues per codex schema review) Co-Authored-By: garrytan-agents <garrytan-agents@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump version and changelog (v0.41.10.0) Adds 61 new tests across 5 new files pinning the fix-wave contracts: - test/extract-batch-retry.test.ts (16 cases) — withRetry primitive + snapshot contract - test/extract-atoms-page-discovery.test.ts (17 cases) — discovery SQL + dual-source idempotency - test/ze-switch-env-override.test.ts (17 cases) — env-gate apply + resume + ZERO-setConfig assertion - test/doctor-embedding-env-override.test.ts (7 cases) — cross-surface parity - test/e2e/extract-atoms-discovery-sql.test.ts (4 cases) — real-Postgres parity for raw SQL Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(test): pin gateway to 1536-dim in 2 PGLite tests that hardcode 1536-vector inserts CI shards 1 + 4 failed persistently (not flake — confirmed via retry) after the v0.41.6.0 merge with this error: error: expected 1280 dimensions, not 1536 file: "vector.c", routine: "CheckExpectedDim" Two test files insert 1536-dim Float32Array vectors into `content_chunks.embedding` / `facts.embedding`, but v0.41.5.0 flipped `DEFAULT_EMBEDDING_DIMENSIONS` from 1536 to 1280 (ZE Matryoshka default). On a fresh CI bun process where no prior test pre-configured the gateway, `initSchema()` sizes the vector column at vector(1280) and the inserts throw. Locally this is hidden when an earlier test file in the shard happens to have called `configureGateway({embedding_dimensions: 1536})` — that state leaks forward through bun's shared process. The v0.41.6.0 LPT shard re-balancing reordered files so these two ran cold, surfacing the latent bug. Fix follows the canonical hermetic pattern from test/consolidate-valid-until.test.ts:23-34: pin the gateway to 1536d in beforeAll, reset in afterAll. Test is now isolated from shard ordering. test/search-types-filter.test.ts — shard 1 fail test/operations-find-trajectory.test.ts — shard 4 (6 fails) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: empty commit to trigger CI * chore: trigger CI again * chore: renumber v0.41.10.0 -> v0.41.10.1 Per request — version slot moved to .1 micro tier to leave .0 available for unrelated wave landing on master. --------- Co-authored-by: garrytan-agents <garrytan-agents@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

garrytan-agents mentioned this pull request May 25, 2026

fix: batch retry for PgBouncer connection drops + dream.* DB config merge #1416

Closed

garrytan closed this May 25, 2026

garrytan mentioned this pull request May 25, 2026

v0.41.10.1 fix-wave: dream.* config + batch retry + extract_atoms idempotency + ze-switch env-gate #1445

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cycle): generalize extract_atoms — page-based extraction + config-plane fallback#1414

feat(cycle): generalize extract_atoms — page-based extraction + config-plane fallback#1414
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/extract-atoms-general-purpose

garrytan-agents commented May 25, 2026

Uh oh!

garrytan commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

garrytan-agents commented May 25, 2026

Problem

1. Config-plane mismatch

2. Transcript-only design

Solution

Fix 1: resolveConfigStr() helper

Fix 2: Page-based discovery

Fix 3: Dual-source merge

Testing

Uh oh!

garrytan commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix 1: `resolveConfigStr()` helper