Skip to content

extract is not wired into the default maintenance loop — brains silently accumulate 0% link/timeline coverage #1696

@garrytan

Description

@garrytan

Symptom (real-world, 280K-page brain)

On a production brain of ~280K pages, pages.edges_backfilled_at was NULL on every single page and the link table was 99.7% untyped mentions (1,042,187 of 1,045,858 links). Only a few hundred typed edges existed total:

mentions: 1042187
attended: 887
related_to: 774
founded: 251
works_at: 221
invested_in: 70
advises: 39

The brain had been ingesting for months. extract all had effectively never run at scale, so the graph could find entity names in text but had almost no semantic relationships (founded, works_at, invested_in, advises) to reason over. gbrain think, graph traversal, and link-based retrieval were all running on a graph that was structurally a bag of mentions.

After manually running extract all once: typed edges jumped to founded 7,232 / invested_in 3,016 / works_at 2,417 / advises 1,430 — a +12,500 typed-edge gain from a pass that had simply never been triggered.

Root cause

autopilot's cycle is sync → extract → embed, but when autopilot isn't installed/running (the common case for CLI-first or externally-cron'd deployments), nothing ever calls extract. sync and embed are the steps people wire into their own cron because they're the obvious 'ingest' and 'make searchable' steps. extract is the silent third leg, and there is no warning, health-check failure, or doctor signal when it's been skipped.

Proposed fixes (any subset)

  1. doctor check for extraction lag. Add a health signal: % of non-tweet/non-digest pages with edges_backfilled_at older than EDGE_EXTRACTOR_VERSION_TS or NULL. Fail (or warn loudly) above a threshold (e.g. >20%). This is the cheapest high-leverage fix — it makes the gap visible instead of silent.
  2. Fold extract into sync by default with an opt-out flag (sync --no-extract), so the canonical 'pull new content' command also types its edges. sync already knows which pages are new/changed, so this is incremental, not a full re-scan.
  3. Emit a one-line stderr nudge at the end of sync when extraction coverage is low: [sync] N pages have no extracted edges — run 'gbrain extract all' or enable autopilot.
  4. extract --stale mirroring embed --stale: only process pages whose edges_backfilled_at is NULL or older than the extractor version timestamp. Makes the incremental case trivial to cron.

Why this matters

The published evaluation calls out 'imported ≠ embedded or curated' and 'useful operation requires ongoing jobs and monitoring' as the core limitation. This is the sharpest instance: a brain can look healthy (pages present, embeddings fresh, search returns hits) while the graph is structurally empty, and nothing surfaces it. Making extraction-lag a first-class doctor signal closes the gap between 'imported' and 'curated'.

Repro

SELECT count(*) FILTER (WHERE edges_backfilled_at IS NULL) AS unextracted,
       count(*) AS total
FROM pages WHERE deleted_at IS NULL;
-- unextracted == total on an un-extracted brain

SELECT link_type, count(*) FROM links GROUP BY link_type ORDER BY count(*) DESC;
-- 'mentions' dominates ~99%+ if extract never ran

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions