extract is not wired into the default maintenance loop — brains silently accumulate 0% link/timeline coverage

## Symptom (real-world, 280K-page brain)

On a production brain of ~280K pages, `pages.edges_backfilled_at` was **NULL on every single page** and the link table was **99.7% untyped `mentions`** (1,042,187 of 1,045,858 links). Only a few hundred typed edges existed total:

```
mentions: 1042187
attended: 887
related_to: 774
founded: 251
works_at: 221
invested_in: 70
advises: 39
```

The brain had been ingesting for months. `extract all` had effectively **never run at scale**, so the graph could find entity *names* in text but had almost no semantic relationships (`founded`, `works_at`, `invested_in`, `advises`) to reason over. `gbrain think`, graph traversal, and link-based retrieval were all running on a graph that was structurally a bag of mentions.

After manually running `extract all` once: typed edges jumped to founded 7,232 / invested_in 3,016 / works_at 2,417 / advises 1,430 — a **+12,500 typed-edge** gain from a pass that had simply never been triggered.

## Root cause

`autopilot`'s cycle is `sync → extract → embed`, but when autopilot isn't installed/running (the common case for CLI-first or externally-cron'd deployments), nothing ever calls `extract`. `sync` and `embed` are the steps people wire into their own cron because they're the obvious 'ingest' and 'make searchable' steps. `extract` is the silent third leg, and there is no warning, health-check failure, or `doctor` signal when it's been skipped.

## Proposed fixes (any subset)

1. **`doctor` check for extraction lag.** Add a health signal: `% of non-tweet/non-digest pages with edges_backfilled_at older than EDGE_EXTRACTOR_VERSION_TS or NULL`. Fail (or warn loudly) above a threshold (e.g. >20%). This is the cheapest high-leverage fix — it makes the gap *visible* instead of silent.
2. **Fold `extract` into `sync` by default** with an opt-out flag (`sync --no-extract`), so the canonical 'pull new content' command also types its edges. `sync` already knows which pages are new/changed, so this is incremental, not a full re-scan.
3. **Emit a one-line stderr nudge** at the end of `sync` when extraction coverage is low: `[sync] N pages have no extracted edges — run 'gbrain extract all' or enable autopilot`.
4. **`extract --stale`** mirroring `embed --stale`: only process pages whose `edges_backfilled_at` is NULL or older than the extractor version timestamp. Makes the incremental case trivial to cron.

## Why this matters

The published evaluation calls out 'imported ≠ embedded or curated' and 'useful operation requires ongoing jobs and monitoring' as the core limitation. This is the sharpest instance: a brain can look healthy (pages present, embeddings fresh, search returns hits) while the *graph* is structurally empty, and nothing surfaces it. Making extraction-lag a first-class `doctor` signal closes the gap between 'imported' and 'curated'.

## Repro

```sql
SELECT count(*) FILTER (WHERE edges_backfilled_at IS NULL) AS unextracted,
       count(*) AS total
FROM pages WHERE deleted_at IS NULL;
-- unextracted == total on an un-extracted brain

SELECT link_type, count(*) FROM links GROUP BY link_type ORDER BY count(*) DESC;
-- 'mentions' dominates ~99%+ if extract never ran
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract is not wired into the default maintenance loop — brains silently accumulate 0% link/timeline coverage #1696

Symptom (real-world, 280K-page brain)

Root cause

Proposed fixes (any subset)

Why this matters

Repro

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

extract is not wired into the default maintenance loop — brains silently accumulate 0% link/timeline coverage #1696

Description

Symptom (real-world, 280K-page brain)

Root cause

Proposed fixes (any subset)

Why this matters

Repro

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions