v0.42.7.0 feat(extract): link/timeline extraction freshness watermark — gbrain extract --stale + doctor lag check (#1696)#1755
Merged
Conversation
Closes the "imported != curated" gap: plain `gbrain sync` only extracts CHANGED pages, so a brain with autopilot off accumulated a links table that was ~99.7% untyped `mentions` with nothing surfacing it. Adds a per-page freshness watermark (pages.links_extracted_at, migration v112) and three things built on it: - `gbrain extract --stale [--source-id] [--catch-up] [--dry-run] [--json]`: incremental DB-source link+timeline sweep over pages whose extraction is stale (never extracted, edited since, or extractor version bumped). Small byte-bounded batches, non-swallowing flush, stamp-after-flush so a crash re-extracts idempotently. Stamps with the row's READ updated_at (not now()) so a concurrent edit during the sweep stays stale instead of being lost. - `links_extraction_lag` doctor check (local + remote): warn-only by default (>20%), hard-fail only via GBRAIN_EXTRACTION_LAG_FAIL_PCT. Vacuous-skip <100 pages; pre-v112 brains graceful-skip. - `gbrain sync --no-extract` flag + end-of-sync nudge (fires on synced|first_sync|up_to_date so the initial import surfaces its backlog). Three new BrainEngine methods (countStalePagesForExtraction / listStalePagesForExtraction / markPagesExtractedBatch) with Postgres<->PGLite parity + bootstrap probes. Schema parity: schema.sql + regenerated pglite-schema.ts + schema-embedded.ts + bootstrap-coverage test. Migration v112 (composite (source_id, links_extracted_at) index, no backfill so the real backlog surfaces on first doctor run).
The "no-op when audit dir does not exist (ENOENT)" case called pruneOldBatchRetryAuditFiles without a GBRAIN_AUDIT_DIR override, so it read the developer's real ~/.gbrain/audit and flaked (kept>0) on any machine with prior gbrain audit history. Point it at a guaranteed-nonexistent temp path so it tests the real missing-dir branch hermetically — matching the file header's "never touches ~/.gbrain/audit" contract. Pre-existing flake (introduced by v0.41.19.0 #1537), unrelated to #1696.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-default-loop # Conflicts: # CHANGELOG.md # VERSION # package.json
…-default-loop # Conflicts: # CHANGELOG.md # TODOS.md # test/audit/batch-retry-audit.test.ts
…-default-loop # Conflicts: # CHANGELOG.md # TODOS.md # VERSION # package.json # src/commands/doctor.ts # test/e2e/engine-parity.test.ts
mgunnin
added a commit
to mgunnin/gbrain
that referenced
this pull request
Jun 3, 2026
* upstream/master: v0.42.8.0 feat: content-quality gate on sync — quarantine junk + flag boilerplate (garrytan#1699) (garrytan#1756) v0.42.7.0 feat(extract): link/timeline extraction freshness watermark — gbrain extract --stale + doctor lag check (garrytan#1696) (garrytan#1755) v0.42.6.0 feat(enrich): gbrain enrich --thin — brain-internal grounded synthesis for stub pages (garrytan#1700) (garrytan#1757) v0.42.5.0 fix(minions): RSS watchdog opacity + pooler-reap self-heal + silent lens backlog + cycle lint DB-disconnect (garrytan#1678) (garrytan#1735) v0.42.4.0 fix: think --model fails loud — slash-form ids + never persist empty synthesis (garrytan#1698) (garrytan#1736) v0.42.3.0 feat(search): autocut — score-discontinuity result-sizing (garrytan#1663 wave 1) (garrytan#1682) v0.42.2.0 feat: gbrain connect — one-command Claude Code onboarding from a bearer token (garrytan#1683)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1696.
What this ships
Extraction is the silent third leg of
sync → extract → embed: it reads page text and builds the typed-edge graph (founded,works_at,invested_in,advises) + dated timeline thatgbrain think, graph traversal, and link search depend on. Plaingbrain synconly extracts changed pages, so a brain with autopilot off accumulated a links table that was ~99.7% untypedmentions— and nothing surfaced it. A per-page freshness watermark (pages.links_extracted_at, migration v112) plus three things built on it close that gap:gbrain extract --stale [--source-id <id>] [--catch-up] [--dry-run] [--json]— incremental DB-source link+timeline sweep over pages whose extraction is stale (never extracted, edited since, or extractor version bumped). Works on checkout-less Postgres/Supabase brains. Small byte-bounded batches, non-swallowing flush, stamp-after-flush so a crash re-extracts idempotently.links_extraction_lagdoctor check (local + remote) — warn-only by default (>20%,GBRAIN_EXTRACTION_LAG_WARN_PCT), hard-fail only viaGBRAIN_EXTRACTION_LAG_FAIL_PCT. Vacuous-skip <100 pages; pre-v112 brains graceful-skip.gbrain sync --no-extract+ an end-of-sync stderr nudge (fires onsynced|first_sync|up_to_dateso the initial import surfaces its backlog — the largest-backlog moment).Stale predicate:
links_extracted_at IS NULL OR < versionTs::timestamptz OR updated_at > links_extracted_at. Theupdated_atarm catches the MCPput_page/sync --no-extract"imported ≠ curated" path. Three new BrainEngine methods (countStalePagesForExtraction/listStalePagesForExtraction/markPagesExtractedBatch) with Postgres↔PGLite parity + bootstrap probes; migration v112 adds the column + composite(source_id, links_extracted_at)index with no backfill (so the real backlog surfaces on firstgbrain doctor).Review
Reviewed as-built (implementation was complete when the session resumed):
updated_atnotnow(); first-sync nudge gated too narrowly; links-only stamp hid timeline staleness; threshold off-by-one), 2 filed as TODOs with rationale (pre-existingDROP INDEX CONCURRENTLY-in-DOrepo-wide pattern; add-only edge reconciliation needs a provenance column — neither a extract is not wired into the default maintenance loop — brains silently accumulate 0% link/timeline coverage #1696 regression).Tests
test/extract-stale.test.ts(incl. CDX-1 edited-after-stamp regression + crash-contract),test/sync-inline-extract-stamps.serial.test.ts(IRON-RULE),test/sync-nudge-status-gate.test.ts,test/doctor-links-extraction-lag.test.ts, engine-parity (Postgres↔PGLite) for the 3 methods + v112 round-trip.verify29/29.Drive-by
test(audit):batch-retry-audit.test.ts's ENOENT case forgot itsGBRAIN_AUDIT_DIRoverride and read the real~/.gbrain/audit, flaking on any machine with audit history (pre-existing since v0.41.19.0, unrelated to #1696). Separate commit.🤖 Generated with Claude Code