v0.42.8.0 feat: content-quality gate on sync — quarantine junk + flag boilerplate (#1699)#1756
Merged
Merged
Conversation
…te (#1699) Three-tier disposition at the importFromContent narrow waist: - High-confidence junk (Cloudflare/CAPTCHA interstitial patterns + operator literals) -> quarantine (hidden from search, zero chunks) or reject. - Fuzzy markup-heavy (prose-vs-markup ratio, warn-tier window, code-exempt) -> content_flag marker, stays searchable, agent warned. - Oversize -> existing embed_skip soft-block + content_flag:oversized warning. Agent-warning channel: SearchResult.content_flag (stamped in hybridSearch + the keyword-only search op) and a top-level content_flag on get_page. New quarantine.ts markers, gbrain quarantine CLI (list/clear/scan), doctor quarantined_pages + flagged_pages checks (engine.executeRaw, works on PGLite), sources-audit disposition awareness, markup-heavy lint rule, config keys. Security: gate-owned markers stripped from untrusted (remote MCP) frontmatter so a write-scoped client can't hide pages or inject the warning channel. Markers excluded from content_hash so flagged pages don't re-embed every sync. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nt-quality-gate # Conflicts: # CHANGELOG.md # VERSION # package.json # src/cli.ts
…0.42.8.0 Add CLAUDE.md Key Files + Commands entries for the #1699 content-quality gate: src/core/quarantine.ts, gbrain quarantine CLI (list/clear/scan), the agent-warning channel (SearchResult.content_flag + get_page), doctor quarantined_pages/flagged_pages checks, the markup-heavy lint rule, sources-audit disposition awareness, and the three new content_sanity config keys. Regenerate llms-full.txt from CLAUDE.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nt-quality-gate # Conflicts: # CHANGELOG.md # VERSION # package.json # src/cli.ts
mgunnin
added a commit
to mgunnin/gbrain
that referenced
this pull request
Jun 3, 2026
* upstream/master: v0.42.8.0 feat: content-quality gate on sync — quarantine junk + flag boilerplate (garrytan#1699) (garrytan#1756) v0.42.7.0 feat(extract): link/timeline extraction freshness watermark — gbrain extract --stale + doctor lag check (garrytan#1696) (garrytan#1755) v0.42.6.0 feat(enrich): gbrain enrich --thin — brain-internal grounded synthesis for stub pages (garrytan#1700) (garrytan#1757) v0.42.5.0 fix(minions): RSS watchdog opacity + pooler-reap self-heal + silent lens backlog + cycle lint DB-disconnect (garrytan#1678) (garrytan#1735) v0.42.4.0 fix: think --model fails loud — slash-form ids + never persist empty synthesis (garrytan#1698) (garrytan#1736) v0.42.3.0 feat(search): autocut — score-discontinuity result-sizing (garrytan#1663 wave 1) (garrytan#1682) v0.42.2.0 feat: gbrain connect — one-command Claude Code onboarding from a bearer token (garrytan#1683)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Content-quality gate on sync (issue #1699). Two junk pages had been landing via
gbrain sync— a Cloudflare "checking your browser" interstitial ingested as the article, and an 890K-char boilerplate wall chunked + embedded at full cost. This adds a gate at theimportFromContentnarrow waist (every ingest path: sync, import, put_page, capture, webhook) with one rule: hide only the unambiguous crap; for anything fuzzy, keep it usable and warn the agent.Three-tier disposition
reject(throw → sync-failure).page_kind: codeexempt). Page stays fully searchable; the agent gets acontent_flagwarning. Legit tables / API-docs / books never disappear.embed_skipbehavior plus acontent_flag: oversizedwarning surfaced viaget_page.Agent-warning channel —
SearchResult.content_flag(stamped inhybridSearchand the keyword-onlysearchop) and a top-levelcontent_flagon theget_pageop, so an agent sees the warning whether it searches or fetches a page directly.Operator surface —
gbrain quarantine list [--include-flagged] / clear [--force] / scan [--apply]; doctorquarantined_pages+flagged_pageschecks (run on Postgres AND PGLite);gbrain sources auditdisposition-aware;gbrain lintmarkup-heavy rule; config keyscontent_sanity.junk_disposition/max_markup_ratio/prose_check_enabled.New files:
src/core/quarantine.ts(both markers, sibling ofembed-skip.ts),src/commands/quarantine.ts(CLI). No schema migration — markers are frontmatter JSONB.Test Coverage
Comprehensive. ~298 tests across the diff's files pass on the merged tree. New + extended:
test/content-sanity.test.ts(assessProse, confidence-split, A1/A2 FP guard, oversize, junk+oversize precedence),test/quarantine.test.ts(both markers + filter fragments),test/quarantine-cli.test.ts(list/clear/scan +getContentFlagsByPageIds),test/e2e/quarantine-search-exclusion.test.ts(quarantined absent / flagged present-with-content_flag / get_page surfaces content_flag / clear re-surfaces), plus extensions to import-file / audit / config / doctor / lint / sql-ranking tests.Coverage audit: 82% at first pass; the three cheapest gaps (markup-heavy lint rule,
getContentFlagsByPageIdsdirect test,max_markup_ratioconfig/env) were closed before ship.Note: the full parallel suite OOM-kills on the dev machine (PGLite WASM memory accumulation); verification was done via
bun run verify(29 guards) + typecheck + targeted/low-concurrency clean runs + full Postgres E2E, all green.Pre-Landing Review
4 specialists dispatched (testing, maintainability, security, performance). Security: clean. No CRITICAL findings from the structured pass. Fixed before ship:
QUARANTINE_FILTER_FRAGMENTwas unused whilebuildVisibilityClauserebuilt it inline → madequarantineFilterFragment(alias)the single source of truth + drift-guard test.--apply→ now resolves the same config.--limittests added.Performance findings (manual-CLI N+1, GIN-index residual filter) were low-confidence on non-hot paths; noted, not blocking.
Adversarial Review
Both Claude-adversarial and Codex run. Each caught a serious bug the structured pass and the other model missed — fixed + regression-tested:
put_pagecould plantquarantine/content_flagfrontmatter on clean content to silently hide a page or inject text into the agent-warning channel. Fix: gate-owned markers stripped from untrusted (remote) frontmatter, matching the CV6 fail-closed posture.assessed_attimestamp leaked intocontent_hash, so every flagged page re-chunked + re-embedded on every sync forever. Fix: gate markers excluded from the hash (same class as the captured_at/ingested_at fix); flagged pages now re-sync as unchanged.searchop bypassed the content_flag stamp. Fix:stampContentFlagswired into that path.Eval Results
No prompt-related files changed — evals skipped.
Plan Completion
Plan at
~/.claude/plans/system-instruction-you-are-working-functional-gem.md. All implementation items DONE or intentional-CHANGED (e.g. reject-mode reusesPAGE_JUNK_PATTERNrather than a separatePAGE_QUARANTINEtoken;quarantine clearships--force/--no-embedsemantics). The one NOT-DONE the audit flagged (get_page surfacing content_flag) was implemented before ship.Documentation
Docs updated for the v0.42.8.0 content-quality gate (#1699):
src/core/quarantine.ts,src/commands/quarantine.ts, and extension entries forcontent-sanity.ts,import-file.ts, the search/operations/types agent-warning channel, the doctor checks, and the lint/sources extensions. Commands: new v0.42.8.0 subsection covering the CLI, the automatic sync gate, the warning channel, and the three newcontent_sanityconfig keys.bun run build:llms); CI drift guardtest/build-llms.test.tspasses 7/7.Test plan
bun run verify— 29 checks green (typecheck + guards)🤖 Generated with Claude Code