v0.42.7.0 feat(extract): link/timeline extraction freshness watermark — gbrain extract --stale + doctor lag check (#1696) by garrytan · Pull Request #1755 · garrytan/gbrain

garrytan · 2026-06-02T02:10:20Z

Closes #1696.

What this ships

Extraction is the silent third leg of sync → extract → embed: it reads page text and builds the typed-edge graph (founded, works_at, invested_in, advises) + dated timeline that gbrain think, graph traversal, and link search depend on. Plain gbrain sync only extracts changed pages, so a brain with autopilot off accumulated a links table that was ~99.7% untyped mentions — and nothing surfaced it. A per-page freshness watermark (pages.links_extracted_at, migration v112) plus three things built on it close that gap:

gbrain extract --stale [--source-id <id>] [--catch-up] [--dry-run] [--json] — incremental DB-source link+timeline sweep over pages whose extraction is stale (never extracted, edited since, or extractor version bumped). Works on checkout-less Postgres/Supabase brains. Small byte-bounded batches, non-swallowing flush, stamp-after-flush so a crash re-extracts idempotently.
links_extraction_lag doctor check (local + remote) — warn-only by default (>20%, GBRAIN_EXTRACTION_LAG_WARN_PCT), hard-fail only via GBRAIN_EXTRACTION_LAG_FAIL_PCT. Vacuous-skip <100 pages; pre-v112 brains graceful-skip.
gbrain sync --no-extract + an end-of-sync stderr nudge (fires on synced|first_sync|up_to_date so the initial import surfaces its backlog — the largest-backlog moment).

Stale predicate: links_extracted_at IS NULL OR < versionTs::timestamptz OR updated_at > links_extracted_at. The updated_at arm catches the MCP put_page / sync --no-extract "imported ≠ curated" path. Three new BrainEngine methods (countStalePagesForExtraction / listStalePagesForExtraction / markPagesExtractedBatch) with Postgres↔PGLite parity + bootstrap probes; migration v112 adds the column + composite (source_id, links_extracted_at) index with no backfill (so the real backlog surfaces on first gbrain doctor).

Review

Reviewed as-built (implementation was complete when the session resumed):

Eng review — architecture clean; 3 findings, all folded: OOM byte-bound on transcript-heavy brains (default batch 100→25), missing crash-contract test, warn-% threshold de-dup.
Codex — 8 findings; 6 folded (stamp-race: stamp with read updated_at not now(); first-sync nudge gated too narrowly; links-only stamp hid timeline staleness; threshold off-by-one), 2 filed as TODOs with rationale (pre-existing DROP INDEX CONCURRENTLY-in-DO repo-wide pattern; add-only edge reconciliation needs a provenance column — neither a extract is not wired into the default maintenance loop — brains silently accumulate 0% link/timeline coverage #1696 regression).

Tests

12,514 pass / 0 fail / 0 skip (full parallel suite, merged with master incl. skillopt).
New: test/extract-stale.test.ts (incl. CDX-1 edited-after-stamp regression + crash-contract), test/sync-inline-extract-stamps.serial.test.ts (IRON-RULE), test/sync-nudge-status-gate.test.ts, test/doctor-links-extraction-lag.test.ts, engine-parity (Postgres↔PGLite) for the 3 methods + v112 round-trip.
typecheck clean; verify 29/29.

Drive-by

test(audit): batch-retry-audit.test.ts's ENOENT case forgot its GBRAIN_AUDIT_DIR override and read the real ~/.gbrain/audit, flaking on any machine with audit history (pre-existing since v0.41.19.0, unrelated to #1696). Separate commit.

🤖 Generated with Claude Code

Closes the "imported != curated" gap: plain `gbrain sync` only extracts CHANGED pages, so a brain with autopilot off accumulated a links table that was ~99.7% untyped `mentions` with nothing surfacing it. Adds a per-page freshness watermark (pages.links_extracted_at, migration v112) and three things built on it: - `gbrain extract --stale [--source-id] [--catch-up] [--dry-run] [--json]`: incremental DB-source link+timeline sweep over pages whose extraction is stale (never extracted, edited since, or extractor version bumped). Small byte-bounded batches, non-swallowing flush, stamp-after-flush so a crash re-extracts idempotently. Stamps with the row's READ updated_at (not now()) so a concurrent edit during the sweep stays stale instead of being lost. - `links_extraction_lag` doctor check (local + remote): warn-only by default (>20%), hard-fail only via GBRAIN_EXTRACTION_LAG_FAIL_PCT. Vacuous-skip <100 pages; pre-v112 brains graceful-skip. - `gbrain sync --no-extract` flag + end-of-sync nudge (fires on synced|first_sync|up_to_date so the initial import surfaces its backlog). Three new BrainEngine methods (countStalePagesForExtraction / listStalePagesForExtraction / markPagesExtractedBatch) with Postgres<->PGLite parity + bootstrap probes. Schema parity: schema.sql + regenerated pglite-schema.ts + schema-embedded.ts + bootstrap-coverage test. Migration v112 (composite (source_id, links_extracted_at) index, no backfill so the real backlog surfaces on first doctor run).

The "no-op when audit dir does not exist (ENOENT)" case called pruneOldBatchRetryAuditFiles without a GBRAIN_AUDIT_DIR override, so it read the developer's real ~/.gbrain/audit and flaked (kept>0) on any machine with prior gbrain audit history. Point it at a guaranteed-nonexistent temp path so it tests the real missing-dir branch hermetically — matching the file header's "never touches ~/.gbrain/audit" contract. Pre-existing flake (introduced by v0.41.19.0 #1537), unrelated to #1696.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…-default-loop # Conflicts: # CHANGELOG.md # VERSION # package.json

…gen llms-full

…-default-loop # Conflicts: # CHANGELOG.md # TODOS.md # test/audit/batch-retry-audit.test.ts

…-default-loop # Conflicts: # CHANGELOG.md # TODOS.md # VERSION # package.json # src/commands/doctor.ts # test/e2e/engine-parity.test.ts

* upstream/master: v0.42.8.0 feat: content-quality gate on sync — quarantine junk + flag boilerplate (garrytan#1699) (garrytan#1756) v0.42.7.0 feat(extract): link/timeline extraction freshness watermark — gbrain extract --stale + doctor lag check (garrytan#1696) (garrytan#1755) v0.42.6.0 feat(enrich): gbrain enrich --thin — brain-internal grounded synthesis for stub pages (garrytan#1700) (garrytan#1757) v0.42.5.0 fix(minions): RSS watchdog opacity + pooler-reap self-heal + silent lens backlog + cycle lint DB-disconnect (garrytan#1678) (garrytan#1735) v0.42.4.0 fix: think --model fails loud — slash-form ids + never persist empty synthesis (garrytan#1698) (garrytan#1736) v0.42.3.0 feat(search): autocut — score-discontinuity result-sizing (garrytan#1663 wave 1) (garrytan#1682) v0.42.2.0 feat: gbrain connect — one-command Claude Code onboarding from a bearer token (garrytan#1683)

garrytan and others added 6 commits June 1, 2026 09:11

chore: bump version and changelog (v0.42.2.0)

956035f

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/master' into garrytan/extract-in…

58d128d

…-default-loop # Conflicts: # CHANGELOG.md # VERSION # package.json

docs: CLAUDE.md key-files entry for the #1696 extract-stale wave + re…

01b36e7

…gen llms-full

Merge remote-tracking branch 'origin/master' into garrytan/extract-in…

cd41302

…-default-loop # Conflicts: # CHANGELOG.md # TODOS.md # test/audit/batch-retry-audit.test.ts

garrytan changed the title ~~v0.42.2.0 feat(extract): link/timeline extraction freshness watermark — gbrain extract --stale + doctor lag check (#1696)~~ v0.42.7.0 feat(extract): link/timeline extraction freshness watermark — gbrain extract --stale + doctor lag check (#1696) Jun 2, 2026

Merge remote-tracking branch 'origin/master' into garrytan/extract-in…

fba3736

…-default-loop # Conflicts: # CHANGELOG.md # TODOS.md # VERSION # package.json # src/commands/doctor.ts # test/e2e/engine-parity.test.ts

garrytan merged commit ca68a55 into master Jun 2, 2026
21 checks passed

garrytan mentioned this pull request Jun 8, 2026

fix: clear stale extraction and health metrics #1850

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.42.7.0 feat(extract): link/timeline extraction freshness watermark — gbrain extract --stale + doctor lag check (#1696)#1755

v0.42.7.0 feat(extract): link/timeline extraction freshness watermark — gbrain extract --stale + doctor lag check (#1696)#1755
garrytan merged 7 commits into
masterfrom
garrytan/extract-in-default-loop

garrytan commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented Jun 2, 2026

What this ships

Review

Tests

Drive-by

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant