Skip to content

design: type proliferation — 94 types should be ~14 (DRY/MECE unification proposal) #1479

@garrytan-agents

Description

@garrytan-agents

Problem

A production brain with 186K pages has organically accumulated 94 distinct page types. Most are duplicates, near-duplicates, or one-off types that should be subtypes via frontmatter fields. This violates DRY (same concept under multiple types) and MECE (types overlap, creating ambiguity about where new content goes).

The type system is the foundation for schema packs, search filtering, extract behavior, enrichment routing, and expert_routing. When types are noisy, every downstream feature degrades.

Production Data

Cluster 1: Social posts — 7 types, 106K pages

Type Count
tweet 33,409
tweet-bundle 42,498
media/x-tweet/bundle 9,246
tweet-stub 1,625
media/x-account/daily 19,631
media/x-account/monthly 67
media/x-account 15

Should be: tweet with subtype: single|bundle|stub + social-digest with period: daily|monthly. 7 → 2.

Cluster 2: Articles — 5 types, 3.6K pages

Type Count
article 1,418
media/article 1,518
sources/article 635
source/article 40
source 24

Should be: ONE type article. 5 → 1.

Cluster 3: Companies — 4 types, 13.5K pages

Type Count
company 5,210
yc-company 5,721
product 2,629
organization 2

"Accelerator membership" should be a field (batch: S23), not a separate type. Should be: company with kind: company|product|org. 4 → 1.

Cluster 4: Atoms — 6 types, 18K pages

Type Count
atom 13,634
atom-extraction 395
content-atom 8
atom-partner-link 8
partner-atom-link 6
lore 4,088

atom-partner-link and partner-atom-link should be LINKS, not page types. Should be: atom with origin: extraction|manual|lore. 6 → 1.

Cluster 5: Media/Content — 8 types, 8.7K pages

Type Count
media 7,510
video 618
youtube-video 130
writing 251
essay 159
blog-post 27
book 8
podcast 1

Content format is a frontmatter field, not a type. Should be: media with format: video|article|essay|book|podcast. 8 → 1.

Cluster 6: Analysis — 8 types, 40 pages(!)

Type Count
analysis 9
media/analysis 2
media-analysis 1
media/x-account/analysis 1
research 8
organization-research 1
competitive-intel 1
yc/competitive-intel 17

8 types for 40 pages is the clearest sign of ad-hoc proliferation. Should be: analysis with domain field. 8 → 1.

Cluster 7: Concept redirects outnumber concepts

  • concept: 4,304
  • concept-redirect: 5,519

More redirect pages than real pages. Redirects should be an alias table, not 5.5K stub pages that inflate orphan counts and waste embedding tokens.

Cluster 8: One-off types — 25+ types with 1-2 pages each

civic, framework, insight, anecdote, principle, memo, rfs-draft, pitch-deck, policy-criticism, production-doc, recording-snippet, registry, reference, schema, video-script, web_page, log, agent-log, content-mining, meta-prompt, queue, eval-test...

Should be: tags or subtypes of note.

Cluster 9: Symlinks as pages — 54 pages

symlink, partner-symlink, symlink-manifest — filesystem operations stored as brain pages.

Why This Matters

  1. Schema packs can't be MECE — the pack declares types but the brain has 94, many undeclared. schema_review_orphans can't distinguish intentional from noise.
  2. Search filtering is ambiguous--type article misses 2.2K articles typed as media/article, sources/article, etc.
  3. Enrichment routing is incompleteenrichable_types can only list a few. 80+ types means most pages never get enriched.
  4. Agent confusion — when ingesting a new article, should it be article, media/article, sources/article, or source/article?
  5. Orphan inflation — concept-redirect pages (5.5K) inflate orphan count without adding knowledge value.

Proposed Target Taxonomy

Type Covers Current types merged
person People person, partner, partner-profile
company Companies, orgs, products company, yc-company, product, organization
concept Ideas concept (redirects → alias table)
atom Knowledge units atom, atom-extraction, content-atom, lore
tweet Social posts tweet, tweet-bundle, tweet-stub, media/x-tweet/bundle
social-digest Social summaries media/x-account/*
article Web content article, media/article, sources/article, source/*
media Rich content media, video, youtube-video, book, podcast
writing Original writing writing, essay, blog-post
meeting Temporal discussions meeting, call, interview
analysis Research + intel analysis, research, competitive-intel, all variants
event Events event, convention
deal Deals deal
note Everything else note, memo, insight, principle, framework, all one-offs

94 types → 14 types. Distinctions move to frontmatter fields (subtype, format, origin, period, domain).

Migration Path

  1. Schema pack declares the 14 canonical types with aliases covering the old names
  2. Migration script retypes pages (e.g., media/articlearticle with source_collection: media)
  3. concept-redirect pages → alias table entries + soft-delete
  4. symlink/atom-partner-link pages → proper link table entries + soft-delete
  5. One-off types → retype to note with original type preserved as legacy_type tag

Impact on Existing Features

  • inferType path prefix mapping shrinks dramatically
  • Schema pack page_types goes from 30+ to 14 entries
  • enrichable_types covers more of the brain naturally
  • extract type filters work correctly across the whole corpus
  • find_experts expert_routing covers all entity types cleanly
  • schema_review_orphans becomes meaningful (currently noisy)
  • Agent ingestion becomes unambiguous (one type per domain)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions