Symptom
Two junk pages landed in a production brain via sync:
media/articles/mahanforcalifornia-com-spending — body was a Cloudflare 'checking your browser' CAPTCHA interstitial, ingested as if it were the article.
sources/articles/mahanforcalifornia-com-spending-... — an 890K-character / 171-chunk blob of mostly-boilerplate, embedded and chunked at full cost.
Both had to be soft-deleted by hand (SQL + chunk removal). The only guard sync applies today is a file-size skip (stat.size > 5_000_000). There is no detection for:
- Cloudflare / 'Just a moment' / 'Checking your browser' / 'Enable JavaScript' interstitials
- 'Verify you are human' / hCaptcha / reCAPTCHA challenge pages
- HTTP error bodies (403/429/503 pages saved as content)
- Pages that are >90% navigation/boilerplate with little prose
- Near-duplicate of an existing page
Root cause
The ingest path treats 'we fetched bytes' as 'we got content.' A CAPTCHA wall returns 200 + HTML, passes the 5MB size check, and flows straight into pages + chunks + embeddings. Garbage in, embedded garbage out — and it costs embedding tokens and pollutes retrieval.
Proposed fixes
- Interstitial/blocklist detector before write. A cheap regex/heuristic pass on incoming body: flag and quarantine (don't hard-fail, route to a
sources/_quarantine/ or set a quality_flag) when the body matches known interstitial signatures: Just a moment..., Checking your browser before accessing, cf-browser-verification, Enable JavaScript and cookies to continue, Verify you are human, Attention Required! | Cloudflare, Access denied + Ray ID, etc. This is exactly the kind of rote, deterministic signal where a regex is earned.
- Min-content / max-boilerplate ratio. Reject (quarantine) pages below N chars of extractable prose, or where prose:markup ratio is below a threshold. Catches the 'empty shell' and 'all nav' cases.
- Configurable max page chars (separate from the 5MB file size skip) — e.g. warn/split above 200K chars of a single page, since a legit article is rarely 890K chars. The current 5MB byte limit is far too loose to catch a 890K-char junk blob.
- Quarantine, don't silently drop. Write rejected items to a quarantine surface with the reason, so the operator can review false positives. Silent dropping trades one failure mode for another.
doctor signal: count of quarantined/low-quality pages so operators see ingestion health.
Why this matters
This is the concrete mechanism behind the published 'imported ≠ embedded or curated' limitation. A quality gate at the sync boundary is the single highest-leverage place to enforce the 'brain is a temple' principle — it's cheaper to reject junk at ingest than to detect and soft-delete it later (and to re-embed everything around it).
Symptom
Two junk pages landed in a production brain via sync:
media/articles/mahanforcalifornia-com-spending— body was a Cloudflare 'checking your browser' CAPTCHA interstitial, ingested as if it were the article.sources/articles/mahanforcalifornia-com-spending-...— an 890K-character / 171-chunk blob of mostly-boilerplate, embedded and chunked at full cost.Both had to be soft-deleted by hand (SQL + chunk removal). The only guard sync applies today is a file-size skip (
stat.size > 5_000_000). There is no detection for:Root cause
The ingest path treats 'we fetched bytes' as 'we got content.' A CAPTCHA wall returns 200 + HTML, passes the 5MB size check, and flows straight into pages + chunks + embeddings. Garbage in, embedded garbage out — and it costs embedding tokens and pollutes retrieval.
Proposed fixes
sources/_quarantine/or set aquality_flag) when the body matches known interstitial signatures:Just a moment...,Checking your browser before accessing,cf-browser-verification,Enable JavaScript and cookies to continue,Verify you are human,Attention Required! | Cloudflare,Access denied+Ray ID, etc. This is exactly the kind of rote, deterministic signal where a regex is earned.doctorsignal: count of quarantined/low-quality pages so operators see ingestion health.Why this matters
This is the concrete mechanism behind the published 'imported ≠ embedded or curated' limitation. A quality gate at the sync boundary is the single highest-leverage place to enforce the 'brain is a temple' principle — it's cheaper to reject junk at ingest than to detect and soft-delete it later (and to re-embed everything around it).