Skip to content

sync has no content-quality gate — Cloudflare/CAPTCHA interstitials and junk pages get ingested as real content #1699

@garrytan

Description

@garrytan

Symptom

Two junk pages landed in a production brain via sync:

  • media/articles/mahanforcalifornia-com-spending — body was a Cloudflare 'checking your browser' CAPTCHA interstitial, ingested as if it were the article.
  • sources/articles/mahanforcalifornia-com-spending-... — an 890K-character / 171-chunk blob of mostly-boilerplate, embedded and chunked at full cost.

Both had to be soft-deleted by hand (SQL + chunk removal). The only guard sync applies today is a file-size skip (stat.size > 5_000_000). There is no detection for:

  • Cloudflare / 'Just a moment' / 'Checking your browser' / 'Enable JavaScript' interstitials
  • 'Verify you are human' / hCaptcha / reCAPTCHA challenge pages
  • HTTP error bodies (403/429/503 pages saved as content)
  • Pages that are >90% navigation/boilerplate with little prose
  • Near-duplicate of an existing page

Root cause

The ingest path treats 'we fetched bytes' as 'we got content.' A CAPTCHA wall returns 200 + HTML, passes the 5MB size check, and flows straight into pages + chunks + embeddings. Garbage in, embedded garbage out — and it costs embedding tokens and pollutes retrieval.

Proposed fixes

  1. Interstitial/blocklist detector before write. A cheap regex/heuristic pass on incoming body: flag and quarantine (don't hard-fail, route to a sources/_quarantine/ or set a quality_flag) when the body matches known interstitial signatures: Just a moment..., Checking your browser before accessing, cf-browser-verification, Enable JavaScript and cookies to continue, Verify you are human, Attention Required! | Cloudflare, Access denied + Ray ID, etc. This is exactly the kind of rote, deterministic signal where a regex is earned.
  2. Min-content / max-boilerplate ratio. Reject (quarantine) pages below N chars of extractable prose, or where prose:markup ratio is below a threshold. Catches the 'empty shell' and 'all nav' cases.
  3. Configurable max page chars (separate from the 5MB file size skip) — e.g. warn/split above 200K chars of a single page, since a legit article is rarely 890K chars. The current 5MB byte limit is far too loose to catch a 890K-char junk blob.
  4. Quarantine, don't silently drop. Write rejected items to a quarantine surface with the reason, so the operator can review false positives. Silent dropping trades one failure mode for another.
  5. doctor signal: count of quarantined/low-quality pages so operators see ingestion health.

Why this matters

This is the concrete mechanism behind the published 'imported ≠ embedded or curated' limitation. A quality gate at the sync boundary is the single highest-leverage place to enforce the 'brain is a temple' principle — it's cheaper to reject junk at ingest than to detect and soft-delete it later (and to re-embed everything around it).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions