Skip to content

v0.42.7.0 feat(extract): link/timeline extraction freshness watermark — gbrain extract --stale + doctor lag check (#1696)#1755

Merged
garrytan merged 7 commits into
masterfrom
garrytan/extract-in-default-loop
Jun 2, 2026
Merged

v0.42.7.0 feat(extract): link/timeline extraction freshness watermark — gbrain extract --stale + doctor lag check (#1696)#1755
garrytan merged 7 commits into
masterfrom
garrytan/extract-in-default-loop

Conversation

@garrytan

@garrytan garrytan commented Jun 2, 2026

Copy link
Copy Markdown
Owner

Closes #1696.

What this ships

Extraction is the silent third leg of sync → extract → embed: it reads page text and builds the typed-edge graph (founded, works_at, invested_in, advises) + dated timeline that gbrain think, graph traversal, and link search depend on. Plain gbrain sync only extracts changed pages, so a brain with autopilot off accumulated a links table that was ~99.7% untyped mentions — and nothing surfaced it. A per-page freshness watermark (pages.links_extracted_at, migration v112) plus three things built on it close that gap:

  • gbrain extract --stale [--source-id <id>] [--catch-up] [--dry-run] [--json] — incremental DB-source link+timeline sweep over pages whose extraction is stale (never extracted, edited since, or extractor version bumped). Works on checkout-less Postgres/Supabase brains. Small byte-bounded batches, non-swallowing flush, stamp-after-flush so a crash re-extracts idempotently.
  • links_extraction_lag doctor check (local + remote) — warn-only by default (>20%, GBRAIN_EXTRACTION_LAG_WARN_PCT), hard-fail only via GBRAIN_EXTRACTION_LAG_FAIL_PCT. Vacuous-skip <100 pages; pre-v112 brains graceful-skip.
  • gbrain sync --no-extract + an end-of-sync stderr nudge (fires on synced|first_sync|up_to_date so the initial import surfaces its backlog — the largest-backlog moment).

Stale predicate: links_extracted_at IS NULL OR < versionTs::timestamptz OR updated_at > links_extracted_at. The updated_at arm catches the MCP put_page / sync --no-extract "imported ≠ curated" path. Three new BrainEngine methods (countStalePagesForExtraction / listStalePagesForExtraction / markPagesExtractedBatch) with Postgres↔PGLite parity + bootstrap probes; migration v112 adds the column + composite (source_id, links_extracted_at) index with no backfill (so the real backlog surfaces on first gbrain doctor).

Review

Reviewed as-built (implementation was complete when the session resumed):

  • Eng review — architecture clean; 3 findings, all folded: OOM byte-bound on transcript-heavy brains (default batch 100→25), missing crash-contract test, warn-% threshold de-dup.
  • Codex — 8 findings; 6 folded (stamp-race: stamp with read updated_at not now(); first-sync nudge gated too narrowly; links-only stamp hid timeline staleness; threshold off-by-one), 2 filed as TODOs with rationale (pre-existing DROP INDEX CONCURRENTLY-in-DO repo-wide pattern; add-only edge reconciliation needs a provenance column — neither a extract is not wired into the default maintenance loop — brains silently accumulate 0% link/timeline coverage #1696 regression).

Tests

  • 12,514 pass / 0 fail / 0 skip (full parallel suite, merged with master incl. skillopt).
  • New: test/extract-stale.test.ts (incl. CDX-1 edited-after-stamp regression + crash-contract), test/sync-inline-extract-stamps.serial.test.ts (IRON-RULE), test/sync-nudge-status-gate.test.ts, test/doctor-links-extraction-lag.test.ts, engine-parity (Postgres↔PGLite) for the 3 methods + v112 round-trip.
  • typecheck clean; verify 29/29.

Drive-by

test(audit): batch-retry-audit.test.ts's ENOENT case forgot its GBRAIN_AUDIT_DIR override and read the real ~/.gbrain/audit, flaking on any machine with audit history (pre-existing since v0.41.19.0, unrelated to #1696). Separate commit.

🤖 Generated with Claude Code

garrytan and others added 6 commits June 1, 2026 09:11
Closes the "imported != curated" gap: plain `gbrain sync` only extracts
CHANGED pages, so a brain with autopilot off accumulated a links table that
was ~99.7% untyped `mentions` with nothing surfacing it. Adds a per-page
freshness watermark (pages.links_extracted_at, migration v112) and three
things built on it:

- `gbrain extract --stale [--source-id] [--catch-up] [--dry-run] [--json]`:
  incremental DB-source link+timeline sweep over pages whose extraction is
  stale (never extracted, edited since, or extractor version bumped). Small
  byte-bounded batches, non-swallowing flush, stamp-after-flush so a crash
  re-extracts idempotently. Stamps with the row's READ updated_at (not now())
  so a concurrent edit during the sweep stays stale instead of being lost.
- `links_extraction_lag` doctor check (local + remote): warn-only by default
  (>20%), hard-fail only via GBRAIN_EXTRACTION_LAG_FAIL_PCT. Vacuous-skip
  <100 pages; pre-v112 brains graceful-skip.
- `gbrain sync --no-extract` flag + end-of-sync nudge (fires on
  synced|first_sync|up_to_date so the initial import surfaces its backlog).

Three new BrainEngine methods (countStalePagesForExtraction /
listStalePagesForExtraction / markPagesExtractedBatch) with Postgres<->PGLite
parity + bootstrap probes. Schema parity: schema.sql + regenerated
pglite-schema.ts + schema-embedded.ts + bootstrap-coverage test. Migration
v112 (composite (source_id, links_extracted_at) index, no backfill so the
real backlog surfaces on first doctor run).
The "no-op when audit dir does not exist (ENOENT)" case called
pruneOldBatchRetryAuditFiles without a GBRAIN_AUDIT_DIR override, so it read
the developer's real ~/.gbrain/audit and flaked (kept>0) on any machine with
prior gbrain audit history. Point it at a guaranteed-nonexistent temp path so
it tests the real missing-dir branch hermetically — matching the file
header's "never touches ~/.gbrain/audit" contract. Pre-existing flake
(introduced by v0.41.19.0 #1537), unrelated to #1696.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-default-loop

# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
…-default-loop

# Conflicts:
#	CHANGELOG.md
#	TODOS.md
#	test/audit/batch-retry-audit.test.ts
@garrytan garrytan changed the title v0.42.2.0 feat(extract): link/timeline extraction freshness watermark — gbrain extract --stale + doctor lag check (#1696) v0.42.7.0 feat(extract): link/timeline extraction freshness watermark — gbrain extract --stale + doctor lag check (#1696) Jun 2, 2026
…-default-loop

# Conflicts:
#	CHANGELOG.md
#	TODOS.md
#	VERSION
#	package.json
#	src/commands/doctor.ts
#	test/e2e/engine-parity.test.ts
@garrytan garrytan merged commit ca68a55 into master Jun 2, 2026
21 checks passed
mgunnin added a commit to mgunnin/gbrain that referenced this pull request Jun 3, 2026
* upstream/master:
  v0.42.8.0 feat: content-quality gate on sync — quarantine junk + flag boilerplate (garrytan#1699) (garrytan#1756)
  v0.42.7.0 feat(extract): link/timeline extraction freshness watermark — gbrain extract --stale + doctor lag check (garrytan#1696) (garrytan#1755)
  v0.42.6.0 feat(enrich): gbrain enrich --thin — brain-internal grounded synthesis for stub pages (garrytan#1700) (garrytan#1757)
  v0.42.5.0 fix(minions): RSS watchdog opacity + pooler-reap self-heal + silent lens backlog + cycle lint DB-disconnect (garrytan#1678) (garrytan#1735)
  v0.42.4.0 fix: think --model fails loud — slash-form ids + never persist empty synthesis (garrytan#1698) (garrytan#1736)
  v0.42.3.0 feat(search): autocut — score-discontinuity result-sizing (garrytan#1663 wave 1) (garrytan#1682)
  v0.42.2.0 feat: gbrain connect — one-command Claude Code onboarding from a bearer token (garrytan#1683)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

extract is not wired into the default maintenance loop — brains silently accumulate 0% link/timeline coverage

1 participant