Symptom (real-world, 280K-page brain)
On a production brain of ~280K pages, pages.edges_backfilled_at was NULL on every single page and the link table was 99.7% untyped mentions (1,042,187 of 1,045,858 links). Only a few hundred typed edges existed total:
mentions: 1042187
attended: 887
related_to: 774
founded: 251
works_at: 221
invested_in: 70
advises: 39
The brain had been ingesting for months. extract all had effectively never run at scale, so the graph could find entity names in text but had almost no semantic relationships (founded, works_at, invested_in, advises) to reason over. gbrain think, graph traversal, and link-based retrieval were all running on a graph that was structurally a bag of mentions.
After manually running extract all once: typed edges jumped to founded 7,232 / invested_in 3,016 / works_at 2,417 / advises 1,430 — a +12,500 typed-edge gain from a pass that had simply never been triggered.
Root cause
autopilot's cycle is sync → extract → embed, but when autopilot isn't installed/running (the common case for CLI-first or externally-cron'd deployments), nothing ever calls extract. sync and embed are the steps people wire into their own cron because they're the obvious 'ingest' and 'make searchable' steps. extract is the silent third leg, and there is no warning, health-check failure, or doctor signal when it's been skipped.
Proposed fixes (any subset)
doctor check for extraction lag. Add a health signal: % of non-tweet/non-digest pages with edges_backfilled_at older than EDGE_EXTRACTOR_VERSION_TS or NULL. Fail (or warn loudly) above a threshold (e.g. >20%). This is the cheapest high-leverage fix — it makes the gap visible instead of silent.
- Fold
extract into sync by default with an opt-out flag (sync --no-extract), so the canonical 'pull new content' command also types its edges. sync already knows which pages are new/changed, so this is incremental, not a full re-scan.
- Emit a one-line stderr nudge at the end of
sync when extraction coverage is low: [sync] N pages have no extracted edges — run 'gbrain extract all' or enable autopilot.
extract --stale mirroring embed --stale: only process pages whose edges_backfilled_at is NULL or older than the extractor version timestamp. Makes the incremental case trivial to cron.
Why this matters
The published evaluation calls out 'imported ≠ embedded or curated' and 'useful operation requires ongoing jobs and monitoring' as the core limitation. This is the sharpest instance: a brain can look healthy (pages present, embeddings fresh, search returns hits) while the graph is structurally empty, and nothing surfaces it. Making extraction-lag a first-class doctor signal closes the gap between 'imported' and 'curated'.
Repro
SELECT count(*) FILTER (WHERE edges_backfilled_at IS NULL) AS unextracted,
count(*) AS total
FROM pages WHERE deleted_at IS NULL;
-- unextracted == total on an un-extracted brain
SELECT link_type, count(*) FROM links GROUP BY link_type ORDER BY count(*) DESC;
-- 'mentions' dominates ~99%+ if extract never ran
Symptom (real-world, 280K-page brain)
On a production brain of ~280K pages,
pages.edges_backfilled_atwas NULL on every single page and the link table was 99.7% untypedmentions(1,042,187 of 1,045,858 links). Only a few hundred typed edges existed total:The brain had been ingesting for months.
extract allhad effectively never run at scale, so the graph could find entity names in text but had almost no semantic relationships (founded,works_at,invested_in,advises) to reason over.gbrain think, graph traversal, and link-based retrieval were all running on a graph that was structurally a bag of mentions.After manually running
extract allonce: typed edges jumped to founded 7,232 / invested_in 3,016 / works_at 2,417 / advises 1,430 — a +12,500 typed-edge gain from a pass that had simply never been triggered.Root cause
autopilot's cycle issync → extract → embed, but when autopilot isn't installed/running (the common case for CLI-first or externally-cron'd deployments), nothing ever callsextract.syncandembedare the steps people wire into their own cron because they're the obvious 'ingest' and 'make searchable' steps.extractis the silent third leg, and there is no warning, health-check failure, ordoctorsignal when it's been skipped.Proposed fixes (any subset)
doctorcheck for extraction lag. Add a health signal:% of non-tweet/non-digest pages with edges_backfilled_at older than EDGE_EXTRACTOR_VERSION_TS or NULL. Fail (or warn loudly) above a threshold (e.g. >20%). This is the cheapest high-leverage fix — it makes the gap visible instead of silent.extractintosyncby default with an opt-out flag (sync --no-extract), so the canonical 'pull new content' command also types its edges.syncalready knows which pages are new/changed, so this is incremental, not a full re-scan.syncwhen extraction coverage is low:[sync] N pages have no extracted edges — run 'gbrain extract all' or enable autopilot.extract --stalemirroringembed --stale: only process pages whoseedges_backfilled_atis NULL or older than the extractor version timestamp. Makes the incremental case trivial to cron.Why this matters
The published evaluation calls out 'imported ≠ embedded or curated' and 'useful operation requires ongoing jobs and monitoring' as the core limitation. This is the sharpest instance: a brain can look healthy (pages present, embeddings fresh, search returns hits) while the graph is structurally empty, and nothing surfaces it. Making extraction-lag a first-class
doctorsignal closes the gap between 'imported' and 'curated'.Repro