Skip to content

v0.42.8.0 feat: content-quality gate on sync — quarantine junk + flag boilerplate (#1699)#1756

Merged
garrytan merged 5 commits into
masterfrom
garrytan/sync-content-quality-gate
Jun 2, 2026
Merged

v0.42.8.0 feat: content-quality gate on sync — quarantine junk + flag boilerplate (#1699)#1756
garrytan merged 5 commits into
masterfrom
garrytan/sync-content-quality-gate

Conversation

@garrytan

@garrytan garrytan commented Jun 2, 2026

Copy link
Copy Markdown
Owner

Summary

Content-quality gate on sync (issue #1699). Two junk pages had been landing via gbrain sync — a Cloudflare "checking your browser" interstitial ingested as the article, and an 890K-char boilerplate wall chunked + embedded at full cost. This adds a gate at the importFromContent narrow waist (every ingest path: sync, import, put_page, capture, webhook) with one rule: hide only the unambiguous crap; for anything fuzzy, keep it usable and warn the agent.

Three-tier disposition

  • Quarantine (hide) — high-confidence junk (Cloudflare/CAPTCHA interstitial patterns + operator literals). Lands hidden from search, zero chunks, reviewable. Configurable to reject (throw → sync-failure).
  • Flag (warn, stay searchable) — fuzzy markup-heavy pages (prose-vs-markup ratio, warn-tier byte window, code excluded from the ratio, page_kind: code exempt). Page stays fully searchable; the agent gets a content_flag warning. Legit tables / API-docs / books never disappear.
  • Soft-block + flag — oversize pages keep the existing embed_skip behavior plus a content_flag: oversized warning surfaced via get_page.

Agent-warning channelSearchResult.content_flag (stamped in hybridSearch and the keyword-only search op) and a top-level content_flag on the get_page op, so an agent sees the warning whether it searches or fetches a page directly.

Operator surfacegbrain quarantine list [--include-flagged] / clear [--force] / scan [--apply]; doctor quarantined_pages + flagged_pages checks (run on Postgres AND PGLite); gbrain sources audit disposition-aware; gbrain lint markup-heavy rule; config keys content_sanity.junk_disposition / max_markup_ratio / prose_check_enabled.

New files: src/core/quarantine.ts (both markers, sibling of embed-skip.ts), src/commands/quarantine.ts (CLI). No schema migration — markers are frontmatter JSONB.

Test Coverage

Comprehensive. ~298 tests across the diff's files pass on the merged tree. New + extended: test/content-sanity.test.ts (assessProse, confidence-split, A1/A2 FP guard, oversize, junk+oversize precedence), test/quarantine.test.ts (both markers + filter fragments), test/quarantine-cli.test.ts (list/clear/scan + getContentFlagsByPageIds), test/e2e/quarantine-search-exclusion.test.ts (quarantined absent / flagged present-with-content_flag / get_page surfaces content_flag / clear re-surfaces), plus extensions to import-file / audit / config / doctor / lint / sql-ranking tests.

Coverage audit: 82% at first pass; the three cheapest gaps (markup-heavy lint rule, getContentFlagsByPageIds direct test, max_markup_ratio config/env) were closed before ship.

Note: the full parallel suite OOM-kills on the dev machine (PGLite WASM memory accumulation); verification was done via bun run verify (29 guards) + typecheck + targeted/low-concurrency clean runs + full Postgres E2E, all green.

Pre-Landing Review

4 specialists dispatched (testing, maintainability, security, performance). Security: clean. No CRITICAL findings from the structured pass. Fixed before ship:

  • DRY/dead-code (multi-specialist confirmed): QUARANTINE_FILTER_FRAGMENT was unused while buildVisibilityClause rebuilt it inline → made quarantineFilterFragment(alias) the single source of truth + drift-guard test.
  • runScan dry-run used different thresholds than --apply → now resolves the same config.
  • classifyEventType doc note; clear-re-quarantine / nothing-to-clear / scan---limit tests added.

Performance findings (manual-CLI N+1, GIN-index residual filter) were low-confidence on non-hot paths; noted, not blocking.

Adversarial Review

Both Claude-adversarial and Codex run. Each caught a serious bug the structured pass and the other model missed — fixed + regression-tested:

  • CRITICAL (Claude): a remote write-scoped put_page could plant quarantine/content_flag frontmatter on clean content to silently hide a page or inject text into the agent-warning channel. Fix: gate-owned markers stripped from untrusted (remote) frontmatter, matching the CV6 fail-closed posture.
  • P0 (Codex): the gate's assessed_at timestamp leaked into content_hash, so every flagged page re-chunked + re-embedded on every sync forever. Fix: gate markers excluded from the hash (same class as the captured_at/ingested_at fix); flagged pages now re-sync as unchanged.
  • P1 (Codex): the keyword-only search op bypassed the content_flag stamp. Fix: stampContentFlags wired into that path.

Eval Results

No prompt-related files changed — evals skipped.

Plan Completion

Plan at ~/.claude/plans/system-instruction-you-are-working-functional-gem.md. All implementation items DONE or intentional-CHANGED (e.g. reject-mode reuses PAGE_JUNK_PATTERN rather than a separate PAGE_QUARANTINE token; quarantine clear ships --force/--no-embed semantics). The one NOT-DONE the audit flagged (get_page surfacing content_flag) was implemented before ship.

Documentation

Docs updated for the v0.42.8.0 content-quality gate (#1699):

  • CLAUDE.md — Key Files: added src/core/quarantine.ts, src/commands/quarantine.ts, and extension entries for content-sanity.ts, import-file.ts, the search/operations/types agent-warning channel, the doctor checks, and the lint/sources extensions. Commands: new v0.42.8.0 subsection covering the CLI, the automatic sync gate, the warning channel, and the three new content_sanity config keys.
  • llms-full.txt — regenerated from CLAUDE.md (bun run build:llms); CI drift guard test/build-llms.test.ts passes 7/7.
  • CHANGELOG.md — 0.42.8.0 entry written.
  • README.md / docs/ — no change (content-sanity / quarantine is operator-maintenance surface, never documented in README).

Test plan

  • bun run verify — 29 checks green (typecheck + guards)
  • ~298 unit/e2e tests across the diff's files pass (merged tree)
  • Full Postgres E2E green (engine-parity, search-quality, quarantine-search-exclusion, jsonb, sync)

🤖 Generated with Claude Code

garrytan and others added 5 commits June 1, 2026 19:11
…te (#1699)

Three-tier disposition at the importFromContent narrow waist:
- High-confidence junk (Cloudflare/CAPTCHA interstitial patterns + operator
  literals) -> quarantine (hidden from search, zero chunks) or reject.
- Fuzzy markup-heavy (prose-vs-markup ratio, warn-tier window, code-exempt)
  -> content_flag marker, stays searchable, agent warned.
- Oversize -> existing embed_skip soft-block + content_flag:oversized warning.

Agent-warning channel: SearchResult.content_flag (stamped in hybridSearch +
the keyword-only search op) and a top-level content_flag on get_page.
New quarantine.ts markers, gbrain quarantine CLI (list/clear/scan), doctor
quarantined_pages + flagged_pages checks (engine.executeRaw, works on PGLite),
sources-audit disposition awareness, markup-heavy lint rule, config keys.

Security: gate-owned markers stripped from untrusted (remote MCP) frontmatter
so a write-scoped client can't hide pages or inject the warning channel.
Markers excluded from content_hash so flagged pages don't re-embed every sync.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nt-quality-gate

# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
#	src/cli.ts
…0.42.8.0

Add CLAUDE.md Key Files + Commands entries for the #1699 content-quality
gate: src/core/quarantine.ts, gbrain quarantine CLI (list/clear/scan),
the agent-warning channel (SearchResult.content_flag + get_page), doctor
quarantined_pages/flagged_pages checks, the markup-heavy lint rule,
sources-audit disposition awareness, and the three new content_sanity
config keys. Regenerate llms-full.txt from CLAUDE.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nt-quality-gate

# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
#	src/cli.ts
@garrytan garrytan merged commit 0bfe0d0 into master Jun 2, 2026
21 checks passed
mgunnin added a commit to mgunnin/gbrain that referenced this pull request Jun 3, 2026
* upstream/master:
  v0.42.8.0 feat: content-quality gate on sync — quarantine junk + flag boilerplate (garrytan#1699) (garrytan#1756)
  v0.42.7.0 feat(extract): link/timeline extraction freshness watermark — gbrain extract --stale + doctor lag check (garrytan#1696) (garrytan#1755)
  v0.42.6.0 feat(enrich): gbrain enrich --thin — brain-internal grounded synthesis for stub pages (garrytan#1700) (garrytan#1757)
  v0.42.5.0 fix(minions): RSS watchdog opacity + pooler-reap self-heal + silent lens backlog + cycle lint DB-disconnect (garrytan#1678) (garrytan#1735)
  v0.42.4.0 fix: think --model fails loud — slash-form ids + never persist empty synthesis (garrytan#1698) (garrytan#1736)
  v0.42.3.0 feat(search): autocut — score-discontinuity result-sizing (garrytan#1663 wave 1) (garrytan#1682)
  v0.42.2.0 feat: gbrain connect — one-command Claude Code onboarding from a bearer token (garrytan#1683)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant