Skip to content

No batch/incremental entity enrichment primitive — thin person/company pages stay stubs forever #1700

@garrytan

Description

@garrytan

Symptom

On a 280K-page brain: only 2,330 of 36,391 person pages (6.4%) had more than 2 content chunks. The other ~93.6% were stubs — a name, maybe an email, a sentence. Same shape for companies (~18K thin). These pages have rich context scattered across the brain (meetings, emails, tweets, calendar, deals) but nothing pulls it onto the entity page.

The only tool that synthesizes scattered context is think --anchor, which:

  • is Opus-default and heavy/expensive per call,
  • is designed for one interactive question, not a batch sweep,
  • has no built-in prioritization, concurrency, resumability, or 'only thin pages' targeting.

We ended up hand-rolling a SQL query (thin pages ranked by inbound-link count) + a bash fan-out calling think 3-at-a-time. It worked (199 high-value pages enriched) but it's exactly the kind of thing that should be a first-class primitive, not operator glue.

Proposed: gbrain enrich

A dedicated batch enrichment command:

gbrain enrich [--type person|company|...] [--thin] [--limit N] [--workers K]
              [--order inbound-links|recency|degree] [--model <id>] [--dry-run]
  • --thin: target only pages with ≤ N chunks (the stubs), so you don't re-burn tokens on already-rich pages.
  • --order inbound-links: prioritize highest-signal/lowest-content pages first (most-connected stubs = biggest graph payoff per dollar). This was the heuristic that made our manual pass effective.
  • --workers K: built-in bounded concurrency (we ran 3 in parallel by hand).
  • Resumable: watermark like embed --stale / edges_backfilled_at so an interrupted run resumes. A enriched_at column gated on an enricher version.
  • --model: cost control (we used Sonnet, not Opus default; see the related fail-silent model-override bug).
  • Idempotent + non-destructive: append/merge synthesized profile into the page, don't clobber human-authored content.

Stretch: make it part of autopilot

Once enrich --thin --stale exists, autopilot can run a slow trickle (e.g. top-50 thinnest-but-most-connected entities per cycle) so the brain gets smarter over time, not just bigger. Today the observed failure mode (from a network-intelligence digest) was literally: '1,516 new thin pages vs only 567 enriched — brain growing faster than getting smarter.' A trickle enricher in the maintenance loop directly closes that gap.

Why this matters

This is the 'memory evolution' axis of the published evaluation. dream/autopilot exist for maintenance, but there's no primitive that turns the brain's own scattered knowledge into curated entity profiles at scale. The data is already there; it just needs a batch synthesizer with prioritization + cost control + resumability.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions