fix: expand error_page_title pattern to catch scraper artifacts at ingest#1561
fix: expand error_page_title pattern to catch scraper artifacts at ingest#1561garrytan-agents wants to merge 4 commits into
Conversation
The judgeSignificance trimming (slice at 4000 chars) could split a UTF-16 surrogate pair when an emoji sits exactly at the boundary, producing a lone high surrogate that Anthropic's JSON parser rejects with 'no low surrogate in string'. Add safeSliceEnd() helper that backs up by one char when the cut lands between a high and low surrogate. Apply to: - judgeSignificance transcript trimming (the direct cause) - findBoundary hard-split fallback (defense-in-depth) Fixes: dream cycle SYNTH_PHASE_FAIL on 2026-05-24 caused by 🤖 emoji at pos 3999 in telegram/2026-05-20-topic-1-topic-1.md
`gbrain dream --source default` silently ignored the --source flag. The flag was never parsed in parseArgs and never forwarded to runCycle. This meant the cycle completed but last_full_cycle_at was never written to the source's config JSONB, so doctor's cycle_freshness check always reported stale cycles — even when dream ran successfully. Changes: - Parse --source and --max-pages in dream's parseArgs - Forward sourceId and maxPages to runCycle opts - Document both flags in --help Without this fix, only `gbrain autopilot` (which uses its own fanout logic) could write the cycle timestamp. Running `gbrain dream --source X` via cron or manually would never update freshness.
The existing pattern only matched bare numeric codes (403, 404, etc.) and 'page not found'. In practice, most scraper error pages land with titles like 'Just a moment...', 'Access Denied', 'Error', 'Forbidden', 'Service Unavailable' — none of which matched. This allowed 232 error pages (202 from straylight-brain) to persist in the DB, triggering doctor's content_sanity_audit_recent check on every run and inflating page counts. Expanded patterns: - 'Error' (bare), 'Forbidden', 'Access Denied', 'Service Unavailable' - 'Robot Check', 'Verify you are human' - 'Just a moment...' (Cloudflare challenge — dedicated pattern) All patterns are anchored (^$) so they only match when the ENTIRE title is the error string. A legitimate page titled 'How to Handle Access Denied Errors' won't trip.
|
Superseded by #1571: same intent, with three differences. (1) Distinct pattern names: The shared surrogate-safe commit ships via the canonical
Thank you for catching the bug. |
…persedes #1559, #1561) (#1571) * fix: dream --source/--source-id plumbs sourceId to runCycle (supersedes #1559) Closes the silent-no-op class where `gbrain dream --source <id>` ran the cycle but never wrote `last_full_cycle_at`, leaving `gbrain doctor`'s cycle_freshness check stuck red forever. Changes to src/commands/dream.ts: - DreamArgs.source field; parseArgs recognizes --source <id> AND the --source-id alias (matches v0.37.7.0 #1167 naming across import/extract/graph-query) - Argv validation: missing value → exit 2; repeated different values → exit 2; --source X --source-id Y conflict → exit 2; same-value repetition → accepted - --help short-circuit ordering preserved with IRON-RULE comment + structural test guard - runDream engine-null guard: --source requires a connected brain - runDream resolveSourceId → archived-source guard via fetchSource from src/core/sources-load.ts (single-row SELECT that projects archived + handles pre-v0.26.5 schema via isUndefinedColumnError) - Typed-error try/catch via isResolverUserError predicate: only swallows known resolver-user errors; TypeError / postgres errors propagate uncaught with stack trace so genuine programmer bugs aren't hidden behind operator-error UX - Forwarded sourceId to runCycle; existing v0.38 writeback at cycle.ts:1947-1967 now actually fires - --help text documents both flag names Tests: - test/dream-cli-flags.test.ts: structural assertions for new flags, help text, IRON-RULE comment guard, resolver/predicate wiring - test/dream.test.ts: 13 PGLite integration cases covering happy path (the regression that closes PR #1559), back-compat, alias equivalence, all argv edge cases, engine-null, archived, --help short-circuit ordering, T3 typed-error propagation, and D5 end-to-end dream→checkCycleFreshness column-name drift guard Plan + 11 decisions: ~/.claude/plans/system-instruction-you-are-working-starry-papert.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: judgeSignificance uses canonical safeSplitIndex (closes #1559/#1561 emoji crash) Closes the 2026-05-24 production SYNTH_PHASE_FAIL: 🤖 (U+1F916, surrogate pair U+D83E U+DD16) at offset 3999 in a long telegram transcript made the raw 4000-char slice produce a lone high surrogate; Anthropic's JSON parser rejected the payload with "no low surrogate in string"; the synthesize phase failed. Changes to src/core/cycle/synthesize.ts: - judgeSignificance head+tail slice routed through safeSplitIndex from src/core/text-safe.ts (already imported) - Did NOT introduce safeSliceEnd from PRs #1559+#1561 — that helper re-introduces the case-3 bug src/core/text-safe.ts:18-21 documents - Did NOT touch findBoundary — master already routes through safeSplitIndex per the v0.42.0.0 wave Tests in test/cycle-synthesize.test.ts: - New describe('judgeSignificance — UTF-16 safety') block - test.each over head boundaries (offsets 3998-4001) AND tail boundaries (offsets 3999-4002) for an 8001-char content with the robot emoji placed at each - Primary assertion: explicit unpaired-surrogate scan over the captured prompt (NOT JSON.stringify per codex C-11 — V8/JSCore do not throw on lone surrogates, so that assertion was weak) - Sub-8000 short-content branch case: no slicing, emoji passes through unchanged Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: expand error_page_title + add cloudflare_challenge_title (supersedes #1561) Closes the bug class where scraper error pages with titles like "Forbidden", "Access Denied", "Service Unavailable", "Robot Check", and "Just a moment..." were slipping through the ingest gate because the matcher only caught bare numeric codes (403/404/500...) and "page not found". 232+ pages observed (202+ from straylight- brain) were inflating page counts and tripping content_sanity_audit_recent on every doctor run. Changes to src/core/content-sanity.ts BUILT_IN_JUNK_PATTERNS: - Expanded error_page_title regex to also catch forbidden, access denied, service unavailable, robot check, verify you are human (case-insensitive, anchored — so long-form essays about these topics still ingest fine) - New cloudflare_challenge_title pattern with DISTINCT name from error_page_title (PR #1561 collapsed both into one name and lost audit signal — the new name preserves diagnosability in ~/.gbrain/audit/content-sanity-YYYY-Www.jsonl and doctor's content_sanity_audit_recent aggregation) - Dropped PR #1561's bare-`error` matcher — too aggressive on legitimate concept/taxonomy pages titled exactly "Error" Tests: - test/content-sanity.test.ts: pattern-count locked at 7, new matches via test.each, over-match regression guard (legitimate prose titled "How to Handle Access Denied Errors" / "Error Boundary in React" etc. must pass), audit-name distinctness pinned - test/import-file-content-sanity.test.ts: end-to-end ContentSanityBlockError via importFromContent for each new pattern family (D6 — assessor wiring coverage, not just regex) Out of scope, filed in TODOS.md as TODO-V13-C: gbrain pages audit-junk-titles legacy-cleanup command. Dropped from this PR per codex outside-voice tension (T1) for ship-and-validate- matchers-first discipline. The 200+ pre-existing scraper pages already in the DB will get the destructive-cleanup operator surface after ~1 week of production observation against this matcher. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump v0.41.23.0 + CHANGELOG + follow-up TODOs VERSION + package.json bump to 0.41.23.0. CHANGELOG voice: ELI10 lead naming the bug ("`gbrain dream --source <id>` finally counts as a cycle"), then per-fix detail, then a "To take advantage of v0.41.23.0" operator-action block and itemized changes. TODOS.md v0.41.23.x follow-ups: - TODO-V13-A (P2): --max-pages plumbing (PR #1559's flag, deferred because CycleOpts has no maxPages field today) - TODO-V13-B (P3): --source vs --source-id flag-name unification across all CLI commands - TODO-V13-C (P2): gbrain pages audit-junk-titles legacy cleanup (deferred for ~1 week of matcher production observation) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: bump v0.41.25.0 → v0.41.26.0 (leave headroom for in-flight PR) Master shipped v0.41.23.0 + v0.41.24.0 mid-review; this branch originally bumped to v0.41.25.0 post-merge. User flagged v0.41.26.0 to leave a slot open for another in-flight PR. No code changes; VERSION + package.json + CHANGELOG header + "To take advantage" section updated in lockstep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* upstream/master: v0.41.26.1 fix: lock-renewal cathedral — closes ~39 worker crashes/day (supersedes garrytan#1567) (garrytan#1572) v0.41.26.0 fix: dream --source + ingest junk titles + emoji-crash (supersedes garrytan#1559, garrytan#1561) (garrytan#1571) v0.41.25.0 perf(sync): batched deletes + global page-generation clock (supersedes garrytan#1538) (garrytan#1566) v0.41.24.0 fix(conversation-parser): threshold gates + bold-paren-time pattern — 20,167 Circleback messages unblocked (closes garrytan#1533) (garrytan#1543) v0.41.23.0 feat: extract operator surfaces + pack-driven extractables (garrytan#1541) v0.41.22.1 feat: brainstorm/lsd judge fixes (closes garrytan#1540 end-to-end) (garrytan#1562) v0.41.22.0 feat: type-unification cathedral — 94 types → 15 canonical (closes garrytan#1479) (garrytan#1542) v0.41.21.0 feat(ops): 5 daily-driver pains fixed in one wave (garrytan#1545) v0.41.20.0 feat: gbrain status + doctor --scope=brain (fix wave 2: items garrytan#6 + garrytan#7) (garrytan#1544) feat: v0.41.19.0 Supavisor Retry Cathedral (garrytan#1537) v0.41.18.0: gbrain onboard — the activation surface gbrain didn't have before (garrytan#1521) v0.41.17.0 feat: --workers N on every bulk command + facts dim doctor parity (garrytan#1519) v0.41.16.0 feat: conversation parser cathedral + progressive-batch primitive (closes garrytan#1461) (garrytan#1510) v0.41.15.0 feat(sync): --timeout + --max-age + partial status (closes garrytan#1472 RFC) (garrytan#1506)
232 scraper error pages (202 from straylight-brain) persisted in the DB because the error_page_title content-sanity pattern only matched bare numeric codes. Expanded to catch Just a moment, Access Denied, Error, Forbidden, Service Unavailable, Robot Check. All anchored so legitimate pages wont trip.