fix: add timeout to doctor frontmatter_integrity check#1287
Closed
garrytan-agents wants to merge 1 commit into
Closed
fix: add timeout to doctor frontmatter_integrity check#1287garrytan-agents wants to merge 1 commit into
garrytan-agents wants to merge 1 commit into
Conversation
On brains with 200K+ pages, the frontmatter scan walks every .md file on disk across all registered sources. This synchronous FS walk can take minutes (observed: >60s on a 216K-page brain with 3 sources), causing the doctor command to appear hung. scanBrainSources already supports an AbortSignal via opts.signal — the walkDir callback checks signal.aborted on every file, and the source loop breaks on abort. This commit passes AbortSignal.timeout (default 30s) from the doctor caller so the check degrades gracefully instead of blocking the entire health report. Configurable via GBRAIN_DOCTOR_FM_TIMEOUT_MS for brains that need more or less time. When the timeout fires, doctor reports a warn with instructions to run the full scan directly.
Merged
7 tasks
garrytan
added a commit
that referenced
this pull request
May 22, 2026
…cing (supersedes #1287) (#1297) * fix(frontmatter): prune vendor dirs at descent + bounded wall-clock with partial-state surfacing Two production-grade fixes for the v0.38.2.0 wave (supersedes PR #1287). Root cause Fix 1 (the bug that hung gbrain doctor on 216K-page brains): both brain-writer.ts:walkDir and frontmatter.ts:collectFiles recursed into every subdirectory without calling pruneDir, the canonical descent-time pruner used by sync/extract/transcript-discovery since v0.35.5.0. On brains that double as code workspaces, the walkers stat'd hundreds of thousands of entries under node_modules / .git / .obsidian / *.raw / ops that isSyncable filtered out at the leaf — paying the IO cost for nothing. Wiring pruneDir at descent (with the v0.37.7.0 #1169 submodule-gitfile check) eliminates the bulk of the wall-clock pain. Fix 2 (codex outside-voice C1): AbortSignal.timeout cannot interrupt the synchronous walker — readdirSync / lstatSync / readFileSync block the event loop, so timer callbacks never fire mid-walk. The load-bearing wall-clock bound is now a deadline check inside scanOneSource's visit callback (Date.now() > opts.deadline). AbortSignal still works at source boundaries. Shape changes (codex C2 + C4): - ScanOpts: + deadline?: number, + dbPageCountForSource hook, + visitDir test seam - PerSourceReport: + status: 'scanned' | 'partial' | 'skipped', + files_scanned, + db_page_count - AuditReport: + partial: boolean, + aborted_at_source: string | null - ok = grandTotal === 0 && !partial (a clean prefix from a timed-out scan no longer falsely reports clean) walkDir + collectFiles now exported with an optional visitDir callback for the regression suite. Production callers don't pass it. Tests: - test/brain-writer-walk-prune.test.ts (new, 12 cases): visitDir-based descent-time pruning assertions for both walkers. Pins the property output-based tests can't catch (isSyncable rejects vendor files at the leaf — so a test checking only output passes under the original bug). - test/brain-writer-partial-scan.test.ts (new, 5 cases): deadline + partial state + ok-after-abort + numerator/denominator coverage. Uses deadline, NOT AbortSignal, since codex C1 proved abort can't interrupt sync. - test/brain-writer.test.ts: existing "abort mid-scan" test refit to the new partial-state contract (per_source has 'skipped' entries instead of being empty — gives doctor visibility into which sources weren't checked). - test/migrations-v0_22_4.test.ts: AuditReport fixture extended with the new required fields. Plan + cross-model review: ~/.claude/plans/system-instruction-you-are-working-hidden-lollipop.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(doctor): wire deadline + partial-state into frontmatter_integrity check Adopts the v0.38.2.0 ScanBrainSources surface in doctor's frontmatter_integrity check. - AbortSignal.timeout(fmTimeoutMs) for between-source bound. - deadline = Date.now() + fmTimeoutMs (the load-bearing mid-walk bound — codex C1 caught that AbortSignal alone can't fire inside the sync walker). - GBRAIN_DOCTOR_FM_TIMEOUT_MS env override (default 30000ms; invalid values fall back to default rather than crash). - Per-source DB denominator via SELECT COUNT(*) FROM pages WHERE source_id = $1 AND deleted_at IS NULL (codex C3: deleted_at filter so soft-deleted pages don't inflate the count). - Honest partial-render: "PARTIAL — scanned ~N files (source has ~M pages in DB), K issue(s) so far" instead of "scanned ~N of M pages" (codex C3 — the two populations are overlapping but not identical sets). - "NOT SCANNED (timeout — run gbrain frontmatter validate <id>)" per skipped source so the user knows which sources didn't get checked. - Catch block simplified to "unexpected error only" (codex D4 — the AbortError special case from PR #1287 was unreachable in a sync walker). Tests: test/doctor-frontmatter-partial.test.ts (new, 11 cases) — structural source-grep pins on every load-bearing render string plus the simplified- catch contract. Behavioral coverage is deferred to the heavy script (tests/heavy/frontmatter_scan_wallclock.sh, T6) because runDoctor calls process.exit unconditionally and can't be driven from bun:test directly; refactoring runDoctor to return rather than exit is a separate TODO. Plan + cross-model review: ~/.claude/plans/system-instruction-you-are-working-hidden-lollipop.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: v0.38.2.0 release notes, Phase 2 design sketch, heavy wall-clock smoke - CHANGELOG.md: ELI10-lead-first release entry per CLAUDE.md voice rules. Names the user-visible behavior change, the per-source partial render, the performance numbers table, the "things to watch" caveats. Credits @garrytan-agents for PR #1287's diagnosis. - VERSION + package.json: 0.37.11.0 -> 0.38.2.0. - docs/architecture/frontmatter-scan-incremental.md: Phase 2 design sketch for DB-backed scan state. Schema, migration shape, writer paths (sync-side UPSERT + incremental scan + autopilot cycle phase), doctor reader, sequencing concerns, two-phase rollout plan. Starting point for the follow-up PR — sub-second steady-state doctor needs incremental state, but the schema migration carries its own contract surface (forward-reference bootstrap, schema-drift E2E, PGLite-vs-Postgres parity) that deserves its own focused PR. - tests/heavy/frontmatter_scan_wallclock.sh (new, manual / nightly per tests/heavy/README.md): seeds a synthetic 60K-file brain (10K real + 50K under node_modules/) and asserts gbrain doctor completes in <15s with frontmatter_integrity: ok. Codex C7 caught that the original plan's 1500-file budget was too small to be a meaningful guard — at that scale the test passes BEFORE AND AFTER the fix, proving nothing. 60K is the minimum that catches the descent-into-vendor-trees regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: adversarial review followups — CLI hint, deadline-vs-await race, between-source breadcrumb Codex adversarial review caught 4 real bugs in the v0.38.2.0 wave. All four fixed before ship. #1 (user-facing): `gbrain frontmatter validate` takes a filesystem PATH, not a source id. Pre-fix the NOT SCANNED hint pointed users at `gbrain frontmatter validate src-a` — which would fail with "no such directory", breaking the very remediation this PR ships to give them. Fix: render `src.source_path` instead. #2 (correctness): between sources, `await dbPageCountForSource(src.id)` ran unchecked. A slow query could blow past the deadline, then scanOneSource was still called and returned `status='partial'` with `files_scanned=0` — misleading ("partial scan" when actually zero files were scanned). Fix: add a post-await deadline re-check; mark source + remainder as 'skipped' if the budget already burned. #3 (UX): when the outer-loop deadline check fired BETWEEN sources, `aborted_at_source` stayed null and the doctor message said "PARTIAL SCAN" with no source name. Fix: stamp `aborted_at_source` with the source we were about to start. #4 (correctness): the COUNT query had no per-call deadline. A wedged Postgres pool could make a single COUNT hang past the budget and defeat the wall-clock guarantee. Fix: Promise.race against the remaining deadline; on timeout, resolve null and the post-await re-check (#2) marks the source skipped. Tests: 3 new regression cases in brain-writer-partial-scan.test.ts pinning the fixed contracts (skipped-vs-partial under slow COUNT, hanging COUNT within deadline, aborted_at_source before any source starts). 8648 pass / 0 fail across the full suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(README): refresh production-brain stats — 8.2x pages, 5.6x people, 7.4x companies Pre-update line (months stale): "17,888 pages, 4,383 people, 723 companies, 21 cron jobs running autonomously, built in 12 days." Fresh counts from ~/git/brain (the wintermute production brain): - pages: 17,888 → 146,646 (8.2x) - people: 4,383 → 24,585 (5.6x) - companies: 723 → 5,339 (7.4x) - cron jobs running: 21 → 66 (113 total, 66 enabled per ~/git/wintermute/workspace/ops/cron-snapshot.json) Dropped "built in 12 days" — at 146K pages the initial-velocity claim is stale narrative that no longer matches the current scale story. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Owner
|
Closing as superseded — this fix shipped in v0.38.2.0 via #1297 ("fix(doctor): bounded frontmatter scan + partial-state surfacing (supersedes #1287)"), which lands the same change in |
mgunnin
added a commit
to mgunnin/gbrain
that referenced
this pull request
May 28, 2026
* upstream/master: v0.38.2.0 fix(doctor): bounded frontmatter scan + partial-state surfacing (supersedes garrytan#1287) (garrytan#1297) v0.38.1.0 feat(agents): provider-agnostic subagent loop + remote MCP dispatch + budget meter (garrytan#1289) v0.38.0.0 ingestion cathedral — gbrain capture + write-through + IngestionSource contract (garrytan#1275) v0.37.11.0: fresh-install PGLite embedding setup fix wave (garrytan#1286) v0.37.10.0 feat(init): env-detection + interactive picker + preflight invariants (garrytan#1278) v0.37.9.0 fix(frontmatter): canonical-style normalization for tag arrays (garrytan#1252) v0.37.8.0 feat: voyage-code-3 discoverability + reindex-code cost-preview fix (garrytan#1267) v0.37.7.0 fix wave: federated brains + autopilot safety + OAuth confidential clients (garrytan#1253) v0.37.6.0 feat(ai): OpenRouter recipe + generic default_headers seam (cherry-pick garrytan#1210) (garrytan#1246) v0.37.5.0 fix(markdown): YAML-aware NESTED_QUOTES validator (stops flagging valid YAML) (garrytan#1229) feat: pgGraph-inspired CI scaffolding wave (v0.37.4.0) (garrytan#1228) v0.37.3.0 feat: skill_brain_first doctor check + auto-fix + declarative opt-out (supersedes garrytan#1206) (garrytan#1215) v0.37.2.0: takes_resolution_consistency CHECK accepts 'unresolvable' (garrytan#1211) v0.37.1.0 feat: brainstorm + lsd — bisociation idea generator grounded in your own brain (garrytan#1214) v0.37.0.0 feat(skillpack): registry cathedral — third-party publish + install + 10/10 quality bar (garrytan#1208) v0.36.6.0 feat: cross-modal search wave (text↔image + unified column + LLM intent) (garrytan#1165)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On brains with 200K+ pages,
gbrain doctorhangs indefinitely during thefrontmatter_integritycheck. The check callsscanBrainSources, which synchronously walks every.mdfile on disk across all registered federated sources. On a production brain with 216K pages and 3 sources (default + zion-brain + media-corpus), this walk takes >60s and makes the doctor command appear hung.The monitoring system that calls
gbrain doctoron a cron has a 60s timeout — so the frontmatter check causes the entire health report to fail, masking all other checks.Root Cause
scanBrainSourcesalready supports anAbortSignalviaopts.signal— thewalkDircallback checkssignal.abortedon every file (line 430 of brain-writer.ts), and the source loop breaks on abort (line 382). However, the doctor caller indoctor.tsnever passes a signal, so the scan runs unbounded.Solution
Pass
AbortSignal.timeout(30000)from the doctor caller toscanBrainSources. When the timeout fires:warnstatus with instructions to run the full scan directlyThe timeout is configurable via
GBRAIN_DOCTOR_FM_TIMEOUT_MS(default: 30s) for brains that need more or less time.Changes
src/commands/doctor.ts(frontmatter_integrity section):AbortSignal.timeout(fmTimeoutMs)before the scanscanBrainSources(engine, { signal: fmAbort })AbortErrorin catch block and report actionable messageResults
gbrain frontmatter validatedirectlyTesting
gbrain frontmatter validate <path>still works independently for targeted repair