Skip to content

fix: expand error_page_title pattern to catch scraper artifacts at ingest#1561

Closed
garrytan-agents wants to merge 4 commits into
garrytan:masterfrom
garrytan-agents:fix/content-sanity-titles
Closed

fix: expand error_page_title pattern to catch scraper artifacts at ingest#1561
garrytan-agents wants to merge 4 commits into
garrytan:masterfrom
garrytan-agents:fix/content-sanity-titles

Conversation

@garrytan-agents

Copy link
Copy Markdown
Contributor

232 scraper error pages (202 from straylight-brain) persisted in the DB because the error_page_title content-sanity pattern only matched bare numeric codes. Expanded to catch Just a moment, Access Denied, Error, Forbidden, Service Unavailable, Robot Check. All anchored so legitimate pages wont trip.

root and others added 4 commits May 24, 2026 09:16
The judgeSignificance trimming (slice at 4000 chars) could split a
UTF-16 surrogate pair when an emoji sits exactly at the boundary,
producing a lone high surrogate that Anthropic's JSON parser rejects
with 'no low surrogate in string'.

Add safeSliceEnd() helper that backs up by one char when the cut lands
between a high and low surrogate. Apply to:
- judgeSignificance transcript trimming (the direct cause)
- findBoundary hard-split fallback (defense-in-depth)

Fixes: dream cycle SYNTH_PHASE_FAIL on 2026-05-24 caused by
🤖 emoji at pos 3999 in telegram/2026-05-20-topic-1-topic-1.md
`gbrain dream --source default` silently ignored the --source flag.
The flag was never parsed in parseArgs and never forwarded to runCycle.
This meant the cycle completed but last_full_cycle_at was never written
to the source's config JSONB, so doctor's cycle_freshness check always
reported stale cycles — even when dream ran successfully.

Changes:
- Parse --source and --max-pages in dream's parseArgs
- Forward sourceId and maxPages to runCycle opts
- Document both flags in --help

Without this fix, only `gbrain autopilot` (which uses its own fanout
logic) could write the cycle timestamp. Running `gbrain dream --source X`
via cron or manually would never update freshness.
The existing pattern only matched bare numeric codes (403, 404, etc.)
and 'page not found'. In practice, most scraper error pages land with
titles like 'Just a moment...', 'Access Denied', 'Error', 'Forbidden',
'Service Unavailable' — none of which matched.

This allowed 232 error pages (202 from straylight-brain) to persist in
the DB, triggering doctor's content_sanity_audit_recent check on every
run and inflating page counts.

Expanded patterns:
- 'Error' (bare), 'Forbidden', 'Access Denied', 'Service Unavailable'
- 'Robot Check', 'Verify you are human'
- 'Just a moment...' (Cloudflare challenge — dedicated pattern)

All patterns are anchored (^$) so they only match when the ENTIRE title
is the error string. A legitimate page titled 'How to Handle Access
Denied Errors' won't trip.
@garrytan

Copy link
Copy Markdown
Owner

Superseded by #1571: same intent, with three differences. (1) Distinct pattern names: error_page_title keeps the bare numeric codes and the new title-only phrases (Forbidden / Access Denied / Service Unavailable / Robot Check / Verify You Are Human); the Cloudflare title gets its own name (cloudflare_challenge_title) so audit JSONL aggregation stays diagnosable (the original PR reused error_page_title for both and collapsed the audit signal). (2) Drops the bare error matcher — too aggressive on legitimate concept/taxonomy pages titled exactly 'Error'. (3) Over-match regression guard tests: 'How to Handle Access Denied Errors', 'Error Boundary in React', 'Service Unavailable Pattern', 'Forbidden Knowledge' all verified to still ingest cleanly.

The shared surrogate-safe commit ships via the canonical safeSplitIndex helper instead of safeSliceEnd — see #1571's commit 02b1f5c for the explanation.

gbrain pages audit-junk-titles legacy cleanup is filed as v0.41+ TODO-V13-C (full spec preserved). Deferred from this PR per cross-model tension review for ship-and-validate-matchers-first discipline — destructive cleanup gets its own observation window after the matcher proves itself in production for ~1 week.

Thank you for catching the bug.

@garrytan garrytan closed this May 27, 2026
garrytan added a commit that referenced this pull request May 27, 2026
…persedes #1559, #1561) (#1571)

* fix: dream --source/--source-id plumbs sourceId to runCycle (supersedes #1559)

Closes the silent-no-op class where `gbrain dream --source <id>` ran
the cycle but never wrote `last_full_cycle_at`, leaving
`gbrain doctor`'s cycle_freshness check stuck red forever.

Changes to src/commands/dream.ts:
- DreamArgs.source field; parseArgs recognizes --source <id> AND the
  --source-id alias (matches v0.37.7.0 #1167 naming across
  import/extract/graph-query)
- Argv validation: missing value → exit 2; repeated different values
  → exit 2; --source X --source-id Y conflict → exit 2; same-value
  repetition → accepted
- --help short-circuit ordering preserved with IRON-RULE comment +
  structural test guard
- runDream engine-null guard: --source requires a connected brain
- runDream resolveSourceId → archived-source guard via fetchSource
  from src/core/sources-load.ts (single-row SELECT that projects
  archived + handles pre-v0.26.5 schema via isUndefinedColumnError)
- Typed-error try/catch via isResolverUserError predicate: only
  swallows known resolver-user errors; TypeError / postgres errors
  propagate uncaught with stack trace so genuine programmer bugs
  aren't hidden behind operator-error UX
- Forwarded sourceId to runCycle; existing v0.38 writeback at
  cycle.ts:1947-1967 now actually fires
- --help text documents both flag names

Tests:
- test/dream-cli-flags.test.ts: structural assertions for new flags,
  help text, IRON-RULE comment guard, resolver/predicate wiring
- test/dream.test.ts: 13 PGLite integration cases covering happy
  path (the regression that closes PR #1559), back-compat, alias
  equivalence, all argv edge cases, engine-null, archived,
  --help short-circuit ordering, T3 typed-error propagation, and
  D5 end-to-end dream→checkCycleFreshness column-name drift guard

Plan + 11 decisions: ~/.claude/plans/system-instruction-you-are-working-starry-papert.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: judgeSignificance uses canonical safeSplitIndex (closes #1559/#1561 emoji crash)

Closes the 2026-05-24 production SYNTH_PHASE_FAIL: 🤖 (U+1F916,
surrogate pair U+D83E U+DD16) at offset 3999 in a long telegram
transcript made the raw 4000-char slice produce a lone high
surrogate; Anthropic's JSON parser rejected the payload with "no
low surrogate in string"; the synthesize phase failed.

Changes to src/core/cycle/synthesize.ts:
- judgeSignificance head+tail slice routed through safeSplitIndex
  from src/core/text-safe.ts (already imported)
- Did NOT introduce safeSliceEnd from PRs #1559+#1561 — that helper
  re-introduces the case-3 bug src/core/text-safe.ts:18-21 documents
- Did NOT touch findBoundary — master already routes through
  safeSplitIndex per the v0.42.0.0 wave

Tests in test/cycle-synthesize.test.ts:
- New describe('judgeSignificance — UTF-16 safety') block
- test.each over head boundaries (offsets 3998-4001) AND tail
  boundaries (offsets 3999-4002) for an 8001-char content with
  the robot emoji placed at each
- Primary assertion: explicit unpaired-surrogate scan over the
  captured prompt (NOT JSON.stringify per codex C-11 — V8/JSCore
  do not throw on lone surrogates, so that assertion was weak)
- Sub-8000 short-content branch case: no slicing, emoji passes
  through unchanged

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: expand error_page_title + add cloudflare_challenge_title (supersedes #1561)

Closes the bug class where scraper error pages with titles like
"Forbidden", "Access Denied", "Service Unavailable", "Robot Check",
and "Just a moment..." were slipping through the ingest gate
because the matcher only caught bare numeric codes (403/404/500...)
and "page not found". 232+ pages observed (202+ from straylight-
brain) were inflating page counts and tripping
content_sanity_audit_recent on every doctor run.

Changes to src/core/content-sanity.ts BUILT_IN_JUNK_PATTERNS:
- Expanded error_page_title regex to also catch forbidden,
  access denied, service unavailable, robot check, verify you are
  human (case-insensitive, anchored — so long-form essays about
  these topics still ingest fine)
- New cloudflare_challenge_title pattern with DISTINCT name from
  error_page_title (PR #1561 collapsed both into one name and lost
  audit signal — the new name preserves diagnosability in
  ~/.gbrain/audit/content-sanity-YYYY-Www.jsonl and doctor's
  content_sanity_audit_recent aggregation)
- Dropped PR #1561's bare-`error` matcher — too aggressive on
  legitimate concept/taxonomy pages titled exactly "Error"

Tests:
- test/content-sanity.test.ts: pattern-count locked at 7, new
  matches via test.each, over-match regression guard (legitimate
  prose titled "How to Handle Access Denied Errors" / "Error
  Boundary in React" etc. must pass), audit-name distinctness
  pinned
- test/import-file-content-sanity.test.ts: end-to-end
  ContentSanityBlockError via importFromContent for each new
  pattern family (D6 — assessor wiring coverage, not just regex)

Out of scope, filed in TODOS.md as TODO-V13-C: gbrain pages
audit-junk-titles legacy-cleanup command. Dropped from this PR
per codex outside-voice tension (T1) for ship-and-validate-
matchers-first discipline. The 200+ pre-existing scraper pages
already in the DB will get the destructive-cleanup operator
surface after ~1 week of production observation against this
matcher.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump v0.41.23.0 + CHANGELOG + follow-up TODOs

VERSION + package.json bump to 0.41.23.0.

CHANGELOG voice: ELI10 lead naming the bug ("`gbrain dream --source
<id>` finally counts as a cycle"), then per-fix detail, then a
"To take advantage of v0.41.23.0" operator-action block and itemized
changes.

TODOS.md v0.41.23.x follow-ups:
- TODO-V13-A (P2): --max-pages plumbing (PR #1559's flag, deferred
  because CycleOpts has no maxPages field today)
- TODO-V13-B (P3): --source vs --source-id flag-name unification
  across all CLI commands
- TODO-V13-C (P2): gbrain pages audit-junk-titles legacy cleanup
  (deferred for ~1 week of matcher production observation)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump v0.41.25.0 → v0.41.26.0 (leave headroom for in-flight PR)

Master shipped v0.41.23.0 + v0.41.24.0 mid-review; this branch
originally bumped to v0.41.25.0 post-merge. User flagged v0.41.26.0
to leave a slot open for another in-flight PR. No code changes;
VERSION + package.json + CHANGELOG header + "To take advantage"
section updated in lockstep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mgunnin added a commit to mgunnin/gbrain that referenced this pull request May 28, 2026
* upstream/master:
  v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes garrytan#1567) (garrytan#1572)
  v0.41.26.0 fix: dream --source + ingest junk titles + emoji-crash (supersedes garrytan#1559, garrytan#1561) (garrytan#1571)
  v0.41.25.0 perf(sync): batched deletes + global page-generation clock (supersedes garrytan#1538) (garrytan#1566)
  v0.41.24.0 fix(conversation-parser): threshold gates + bold-paren-time pattern — 20,167 Circleback messages unblocked (closes garrytan#1533) (garrytan#1543)
  v0.41.23.0 feat: extract operator surfaces + pack-driven extractables (garrytan#1541)
  v0.41.22.1 feat: brainstorm/lsd judge fixes (closes garrytan#1540 end-to-end) (garrytan#1562)
  v0.41.22.0 feat: type-unification cathedral — 94 types → 15 canonical (closes garrytan#1479) (garrytan#1542)
  v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave (garrytan#1545)
  v0.41.20.0 feat: gbrain status + doctor --scope=brain (fix wave 2: items garrytan#6 + garrytan#7) (garrytan#1544)
  feat: v0.41.19.0 Supavisor Retry Cathedral (garrytan#1537)
  v0.41.18.0: gbrain onboard — the activation surface gbrain didn't have before (garrytan#1521)
  v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity (garrytan#1519)
  v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes garrytan#1461) (garrytan#1510)
  v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes garrytan#1472 RFC) (garrytan#1506)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants