Problem
A production brain with 186K pages has organically accumulated 94 distinct page types. Most are duplicates, near-duplicates, or one-off types that should be subtypes via frontmatter fields. This violates DRY (same concept under multiple types) and MECE (types overlap, creating ambiguity about where new content goes).
The type system is the foundation for schema packs, search filtering, extract behavior, enrichment routing, and expert_routing. When types are noisy, every downstream feature degrades.
Production Data
Cluster 1: Social posts — 7 types, 106K pages
| Type |
Count |
| tweet |
33,409 |
| tweet-bundle |
42,498 |
| media/x-tweet/bundle |
9,246 |
| tweet-stub |
1,625 |
| media/x-account/daily |
19,631 |
| media/x-account/monthly |
67 |
| media/x-account |
15 |
Should be: tweet with subtype: single|bundle|stub + social-digest with period: daily|monthly. 7 → 2.
Cluster 2: Articles — 5 types, 3.6K pages
| Type |
Count |
| article |
1,418 |
| media/article |
1,518 |
| sources/article |
635 |
| source/article |
40 |
| source |
24 |
Should be: ONE type article. 5 → 1.
Cluster 3: Companies — 4 types, 13.5K pages
| Type |
Count |
| company |
5,210 |
| yc-company |
5,721 |
| product |
2,629 |
| organization |
2 |
"Accelerator membership" should be a field (batch: S23), not a separate type. Should be: company with kind: company|product|org. 4 → 1.
Cluster 4: Atoms — 6 types, 18K pages
| Type |
Count |
| atom |
13,634 |
| atom-extraction |
395 |
| content-atom |
8 |
| atom-partner-link |
8 |
| partner-atom-link |
6 |
| lore |
4,088 |
atom-partner-link and partner-atom-link should be LINKS, not page types. Should be: atom with origin: extraction|manual|lore. 6 → 1.
Cluster 5: Media/Content — 8 types, 8.7K pages
| Type |
Count |
| media |
7,510 |
| video |
618 |
| youtube-video |
130 |
| writing |
251 |
| essay |
159 |
| blog-post |
27 |
| book |
8 |
| podcast |
1 |
Content format is a frontmatter field, not a type. Should be: media with format: video|article|essay|book|podcast. 8 → 1.
Cluster 6: Analysis — 8 types, 40 pages(!)
| Type |
Count |
| analysis |
9 |
| media/analysis |
2 |
| media-analysis |
1 |
| media/x-account/analysis |
1 |
| research |
8 |
| organization-research |
1 |
| competitive-intel |
1 |
| yc/competitive-intel |
17 |
8 types for 40 pages is the clearest sign of ad-hoc proliferation. Should be: analysis with domain field. 8 → 1.
Cluster 7: Concept redirects outnumber concepts
- concept: 4,304
- concept-redirect: 5,519
More redirect pages than real pages. Redirects should be an alias table, not 5.5K stub pages that inflate orphan counts and waste embedding tokens.
Cluster 8: One-off types — 25+ types with 1-2 pages each
civic, framework, insight, anecdote, principle, memo, rfs-draft, pitch-deck, policy-criticism, production-doc, recording-snippet, registry, reference, schema, video-script, web_page, log, agent-log, content-mining, meta-prompt, queue, eval-test...
Should be: tags or subtypes of note.
Cluster 9: Symlinks as pages — 54 pages
symlink, partner-symlink, symlink-manifest — filesystem operations stored as brain pages.
Why This Matters
- Schema packs can't be MECE — the pack declares types but the brain has 94, many undeclared.
schema_review_orphans can't distinguish intentional from noise.
- Search filtering is ambiguous —
--type article misses 2.2K articles typed as media/article, sources/article, etc.
- Enrichment routing is incomplete —
enrichable_types can only list a few. 80+ types means most pages never get enriched.
- Agent confusion — when ingesting a new article, should it be
article, media/article, sources/article, or source/article?
- Orphan inflation — concept-redirect pages (5.5K) inflate orphan count without adding knowledge value.
Proposed Target Taxonomy
| Type |
Covers |
Current types merged |
person |
People |
person, partner, partner-profile |
company |
Companies, orgs, products |
company, yc-company, product, organization |
concept |
Ideas |
concept (redirects → alias table) |
atom |
Knowledge units |
atom, atom-extraction, content-atom, lore |
tweet |
Social posts |
tweet, tweet-bundle, tweet-stub, media/x-tweet/bundle |
social-digest |
Social summaries |
media/x-account/* |
article |
Web content |
article, media/article, sources/article, source/* |
media |
Rich content |
media, video, youtube-video, book, podcast |
writing |
Original writing |
writing, essay, blog-post |
meeting |
Temporal discussions |
meeting, call, interview |
analysis |
Research + intel |
analysis, research, competitive-intel, all variants |
event |
Events |
event, convention |
deal |
Deals |
deal |
note |
Everything else |
note, memo, insight, principle, framework, all one-offs |
94 types → 14 types. Distinctions move to frontmatter fields (subtype, format, origin, period, domain).
Migration Path
- Schema pack declares the 14 canonical types with
aliases covering the old names
- Migration script retypes pages (e.g.,
media/article → article with source_collection: media)
- concept-redirect pages → alias table entries + soft-delete
- symlink/atom-partner-link pages → proper link table entries + soft-delete
- One-off types → retype to
note with original type preserved as legacy_type tag
Impact on Existing Features
inferType path prefix mapping shrinks dramatically
- Schema pack
page_types goes from 30+ to 14 entries
enrichable_types covers more of the brain naturally
extract type filters work correctly across the whole corpus
find_experts expert_routing covers all entity types cleanly
schema_review_orphans becomes meaningful (currently noisy)
- Agent ingestion becomes unambiguous (one type per domain)
Related
Problem
A production brain with 186K pages has organically accumulated 94 distinct page types. Most are duplicates, near-duplicates, or one-off types that should be subtypes via frontmatter fields. This violates DRY (same concept under multiple types) and MECE (types overlap, creating ambiguity about where new content goes).
The type system is the foundation for schema packs, search filtering, extract behavior, enrichment routing, and
expert_routing. When types are noisy, every downstream feature degrades.Production Data
Cluster 1: Social posts — 7 types, 106K pages
Should be:
tweetwithsubtype: single|bundle|stub+social-digestwithperiod: daily|monthly. 7 → 2.Cluster 2: Articles — 5 types, 3.6K pages
Should be: ONE type
article. 5 → 1.Cluster 3: Companies — 4 types, 13.5K pages
"Accelerator membership" should be a field (
batch: S23), not a separate type. Should be:companywithkind: company|product|org. 4 → 1.Cluster 4: Atoms — 6 types, 18K pages
atom-partner-linkandpartner-atom-linkshould be LINKS, not page types. Should be:atomwithorigin: extraction|manual|lore. 6 → 1.Cluster 5: Media/Content — 8 types, 8.7K pages
Content format is a frontmatter field, not a type. Should be:
mediawithformat: video|article|essay|book|podcast. 8 → 1.Cluster 6: Analysis — 8 types, 40 pages(!)
8 types for 40 pages is the clearest sign of ad-hoc proliferation. Should be:
analysiswithdomainfield. 8 → 1.Cluster 7: Concept redirects outnumber concepts
More redirect pages than real pages. Redirects should be an alias table, not 5.5K stub pages that inflate orphan counts and waste embedding tokens.
Cluster 8: One-off types — 25+ types with 1-2 pages each
civic, framework, insight, anecdote, principle, memo, rfs-draft, pitch-deck, policy-criticism, production-doc, recording-snippet, registry, reference, schema, video-script, web_page, log, agent-log, content-mining, meta-prompt, queue, eval-test...
Should be: tags or subtypes of
note.Cluster 9: Symlinks as pages — 54 pages
symlink, partner-symlink, symlink-manifest — filesystem operations stored as brain pages.
Why This Matters
schema_review_orphanscan't distinguish intentional from noise.--type articlemisses 2.2K articles typed asmedia/article,sources/article, etc.enrichable_typescan only list a few. 80+ types means most pages never get enriched.article,media/article,sources/article, orsource/article?Proposed Target Taxonomy
personcompanyconceptatomtweetsocial-digestarticlemediawritingmeetinganalysiseventdealnote94 types → 14 types. Distinctions move to frontmatter fields (
subtype,format,origin,period,domain).Migration Path
aliasescovering the old namesmedia/article→articlewithsource_collection: media)notewith original type preserved aslegacy_typetagImpact on Existing Features
inferTypepath prefix mapping shrinks dramaticallypage_typesgoes from 30+ to 14 entriesenrichable_typescovers more of the brain naturallyextracttype filters work correctly across the whole corpusfind_expertsexpert_routing covers all entity types cleanlyschema_review_orphansbecomes meaningful (currently noisy)Related
gbrain onboard— guided agent onboarding with migration prompts #1383 (gbrain onboard — migration prompts would drive this)