Skip to content

feat(cycle): generalize extract_atoms — page-based extraction + config-plane fallback#1414

Closed
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/extract-atoms-general-purpose
Closed

feat(cycle): generalize extract_atoms — page-based extraction + config-plane fallback#1414
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/extract-atoms-general-purpose

Conversation

@garrytan-agents

Copy link
Copy Markdown
Contributor

Problem

extract_atoms (v0.41 T5) has two issues that prevent it from working on production brains:

1. Config-plane mismatch

The phase reads dream.synthesize.session_corpus_dir via loadConfig(), which only reads the file plane (~/.gbrain/config.json + env vars). But gbrain config set dream.synthesize.* writes to the DB config table. Result: config appears set but the phase never sees it — silently returns "no transcripts to process."

2. Transcript-only design

The phase ONLY processes raw files from filesystem corpus directories. A brain with thousands of existing pages (meetings, sources, articles) has no extraction path — all that content sits in the DB but extract_atoms can't reach it.

Solution

Fix 1: resolveConfigStr() helper

Walks file config first, falls through to engine.getConfig(key) for DB-plane. Matches precedence model used by loadConfigWithEngine() elsewhere.

Fix 2: Page-based discovery

discoverExtractablePages() queries engine.listPages() for extractable types (meeting, source, article, video, book, original). Filters out already-processed pages (atoms_extracted: true). Caps at 50 pages/cycle.

Fix 3: Dual-source merge

Transcript + page discovery run together. Deduplicated by content hash. Atom frontmatter distinguishes origin (source_path for transcripts, source_slug for pages). Source pages get atoms_extracted: true after extraction.

Testing

13 new tests — all pass first-green. Covers: page extraction, config fallback, dedup, budget cap, skip-already-extracted, skip-greenfield, skip-short, metadata correctness, backward compat.

…nfig-plane fallback

Two bugs fixed, one capability added:

1. Config-plane mismatch (bug): extract_atoms read dream.synthesize.*
   config via loadConfig() (file-plane only), but `gbrain config set`
   writes to DB. Added resolveConfigStr() helper that falls through to
   engine.getConfig() when the file plane returns undefined. Matches the
   precedence model used by loadConfigWithEngine() elsewhere.

2. Transcript-only limitation (design gap): extract_atoms only processed
   raw .txt/.md transcripts from configured corpus directories. Brains
   with content already in the DB (meetings, sources, articles, videos,
   books, originals) had no extraction path. Added page-based discovery
   via engine.listPages() filtered by extractable types. Skips pages
   already marked atoms_extracted: true, imported from markdown-greenfield,
   or shorter than 500 chars. Caps at 50 pages per cycle to bound cost.

3. Dual-source merge: transcript discovery and page discovery run in
   parallel. Results are deduplicated by content hash before LLM calls.
   Atom frontmatter distinguishes origin (source_path for transcripts,
   source_slug for pages) so backlinks work for both.

After successful extraction, source pages get atoms_extracted: true
stamped in frontmatter, preventing re-processing on subsequent cycles.

Budget cap (/usr/bin/sh.30/run default) applies across both sources uniformly.

Backward-compatible: transcript-only workflows unchanged. Page discovery
is additive. Test seams (_transcripts, _skipPageDiscovery) preserved.

13 new tests covering: page extraction, config fallback, dedup, budget
cap, skip-already-extracted, skip-greenfield, skip-short, source_slug
vs source_path metadata, backward compat.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@garrytan

Copy link
Copy Markdown
Owner

Closing in favor of fix-wave incorporation on garrytan/pr-1414-1416-1421 (shipping as v0.41.2.1). Rebuilt with /plan-eng-review + codex outside-voice structural improvements:

  • Source-hash existence check (idempotency-via-data; survives gbrain sync --force; also fixes the pre-existing transcript-side date-stamp duplicate bug) replaces the frontmatter atoms_extracted: true marker.
  • Single raw SQL with NOT EXISTS subquery (one round-trip) replaces the 6 listPages iterations + per-candidate atom-existence checks.
  • sourceId threaded through every atom putPage call (codex caught a real bug here that would have routed non-default-source atoms to 'default').
  • Discovery SQL filters out dream_generated:true to prevent self-consumption, content_hash IS NOT NULL to prevent crashes, and imported_from='markdown-greenfield' as before.
  • cycle.ts threads UNION of syncPagesAffected + synthesizeWrittenSlugs as affectedSlugs (incremental cycles now see pages synthesize just wrote).
  • The systemic dream.* config merge from fix: batch retry for PgBouncer connection drops + dream.* DB config merge #1416 obviates the per-call-site resolveConfigStr() helper.

Thank you for the careful PR description and the 13 tests — they were the design source for the rebuild. The wave ships with full credit on the merged commit.

@garrytan garrytan closed this May 25, 2026
garrytan added a commit that referenced this pull request May 25, 2026
…mpotency + ze-switch env-gate (#1445)

* fix-wave: dream.* DB merge + batch retry + extract_atoms idempotency + ze-switch env-gate + doctor check

Closes PRs #1414, #1416, #1421 (rebuilt from designs by @garrytan-agents
with structural improvements from /plan-eng-review + codex outside-voice).

Three production reliability fixes in one wave:

1. dream.* DB-config merge (closes PR #1416 silent-config gap)
   - loadConfigWithEngine() sparse-merge extends with 7 dream.* keys
   - File > DB > defaults precedence (no GBRAIN_DREAM_* env vars)
   - extract-atoms switches to loadConfigWithEngine() so DB-plane keys reach it

2. Batch retry on transient connection drops (closes PR #1416 ~30%-loss bug)
   - withRetry() pure primitive exported from src/commands/extract.ts
   - 6 flush() sites snapshot-before-clear with onRetry callback
   - Reuses isRetryableConnError from src/core/retry-matcher.ts
   - retry-matcher extended with GBrainError{problem:'No database connection'}

3. extract_atoms source-hash idempotency + page-based discovery (closes #1414)
   - One raw SQL with NOT EXISTS subquery replaces 6 listPages + N atom checks
   - sourceId threaded through every putPage call (codex caught real bug)
   - NULL content_hash filter + dream_generated exclusion + transcript-side idempotency
   - cycle.ts passes union of syncPagesAffected + synthesizeWrittenSlugs

4. ze-switch pre-apply + pre-resume env-override gate (closes PR #1421)
   - Gate fires FIRST in apply AND resume; zero setConfig calls on refusal
   - ASCII warning box (no Unicode per repo D10)
   - --ignore-env-override escape hatch for power users
   - ApplyResult extended with refused variant

5. doctor embedding_env_override check (defense-in-depth for #1421)
   - Cross-surface parity: buildChecks() + doctorReportRemote()
   - Uses Check.details (not Check.issues per codex schema review)

Co-Authored-By: garrytan-agents <garrytan-agents@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.41.10.0)

Adds 61 new tests across 5 new files pinning the fix-wave contracts:
- test/extract-batch-retry.test.ts (16 cases) — withRetry primitive + snapshot contract
- test/extract-atoms-page-discovery.test.ts (17 cases) — discovery SQL + dual-source idempotency
- test/ze-switch-env-override.test.ts (17 cases) — env-gate apply + resume + ZERO-setConfig assertion
- test/doctor-embedding-env-override.test.ts (7 cases) — cross-surface parity
- test/e2e/extract-atoms-discovery-sql.test.ts (4 cases) — real-Postgres parity for raw SQL

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(test): pin gateway to 1536-dim in 2 PGLite tests that hardcode 1536-vector inserts

CI shards 1 + 4 failed persistently (not flake — confirmed via retry) after the
v0.41.6.0 merge with this error:

  error: expected 1280 dimensions, not 1536
  file: "vector.c", routine: "CheckExpectedDim"

Two test files insert 1536-dim Float32Array vectors into `content_chunks.embedding`
/ `facts.embedding`, but v0.41.5.0 flipped `DEFAULT_EMBEDDING_DIMENSIONS` from
1536 to 1280 (ZE Matryoshka default). On a fresh CI bun process where no prior
test pre-configured the gateway, `initSchema()` sizes the vector column at
vector(1280) and the inserts throw.

Locally this is hidden when an earlier test file in the shard happens to have
called `configureGateway({embedding_dimensions: 1536})` — that state leaks
forward through bun's shared process. The v0.41.6.0 LPT shard re-balancing
reordered files so these two ran cold, surfacing the latent bug.

Fix follows the canonical hermetic pattern from
test/consolidate-valid-until.test.ts:23-34: pin the gateway to 1536d in
beforeAll, reset in afterAll. Test is now isolated from shard ordering.

  test/search-types-filter.test.ts     — shard 1 fail
  test/operations-find-trajectory.test.ts — shard 4 (6 fails)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: empty commit to trigger CI

* chore: trigger CI again

* chore: renumber v0.41.10.0 -> v0.41.10.1

Per request — version slot moved to .1 micro tier to leave .0 available
for unrelated wave landing on master.

---------

Co-authored-by: garrytan-agents <garrytan-agents@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants