v0.38.2.0 fix(doctor): bounded frontmatter scan + partial-state surfacing (supersedes #1287)#1297
Merged
Merged
Conversation
…ith partial-state surfacing Two production-grade fixes for the v0.38.2.0 wave (supersedes PR #1287). Root cause Fix 1 (the bug that hung gbrain doctor on 216K-page brains): both brain-writer.ts:walkDir and frontmatter.ts:collectFiles recursed into every subdirectory without calling pruneDir, the canonical descent-time pruner used by sync/extract/transcript-discovery since v0.35.5.0. On brains that double as code workspaces, the walkers stat'd hundreds of thousands of entries under node_modules / .git / .obsidian / *.raw / ops that isSyncable filtered out at the leaf — paying the IO cost for nothing. Wiring pruneDir at descent (with the v0.37.7.0 #1169 submodule-gitfile check) eliminates the bulk of the wall-clock pain. Fix 2 (codex outside-voice C1): AbortSignal.timeout cannot interrupt the synchronous walker — readdirSync / lstatSync / readFileSync block the event loop, so timer callbacks never fire mid-walk. The load-bearing wall-clock bound is now a deadline check inside scanOneSource's visit callback (Date.now() > opts.deadline). AbortSignal still works at source boundaries. Shape changes (codex C2 + C4): - ScanOpts: + deadline?: number, + dbPageCountForSource hook, + visitDir test seam - PerSourceReport: + status: 'scanned' | 'partial' | 'skipped', + files_scanned, + db_page_count - AuditReport: + partial: boolean, + aborted_at_source: string | null - ok = grandTotal === 0 && !partial (a clean prefix from a timed-out scan no longer falsely reports clean) walkDir + collectFiles now exported with an optional visitDir callback for the regression suite. Production callers don't pass it. Tests: - test/brain-writer-walk-prune.test.ts (new, 12 cases): visitDir-based descent-time pruning assertions for both walkers. Pins the property output-based tests can't catch (isSyncable rejects vendor files at the leaf — so a test checking only output passes under the original bug). - test/brain-writer-partial-scan.test.ts (new, 5 cases): deadline + partial state + ok-after-abort + numerator/denominator coverage. Uses deadline, NOT AbortSignal, since codex C1 proved abort can't interrupt sync. - test/brain-writer.test.ts: existing "abort mid-scan" test refit to the new partial-state contract (per_source has 'skipped' entries instead of being empty — gives doctor visibility into which sources weren't checked). - test/migrations-v0_22_4.test.ts: AuditReport fixture extended with the new required fields. Plan + cross-model review: ~/.claude/plans/system-instruction-you-are-working-hidden-lollipop.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… check Adopts the v0.38.2.0 ScanBrainSources surface in doctor's frontmatter_integrity check. - AbortSignal.timeout(fmTimeoutMs) for between-source bound. - deadline = Date.now() + fmTimeoutMs (the load-bearing mid-walk bound — codex C1 caught that AbortSignal alone can't fire inside the sync walker). - GBRAIN_DOCTOR_FM_TIMEOUT_MS env override (default 30000ms; invalid values fall back to default rather than crash). - Per-source DB denominator via SELECT COUNT(*) FROM pages WHERE source_id = $1 AND deleted_at IS NULL (codex C3: deleted_at filter so soft-deleted pages don't inflate the count). - Honest partial-render: "PARTIAL — scanned ~N files (source has ~M pages in DB), K issue(s) so far" instead of "scanned ~N of M pages" (codex C3 — the two populations are overlapping but not identical sets). - "NOT SCANNED (timeout — run gbrain frontmatter validate <id>)" per skipped source so the user knows which sources didn't get checked. - Catch block simplified to "unexpected error only" (codex D4 — the AbortError special case from PR #1287 was unreachable in a sync walker). Tests: test/doctor-frontmatter-partial.test.ts (new, 11 cases) — structural source-grep pins on every load-bearing render string plus the simplified- catch contract. Behavioral coverage is deferred to the heavy script (tests/heavy/frontmatter_scan_wallclock.sh, T6) because runDoctor calls process.exit unconditionally and can't be driven from bun:test directly; refactoring runDoctor to return rather than exit is a separate TODO. Plan + cross-model review: ~/.claude/plans/system-instruction-you-are-working-hidden-lollipop.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…k smoke - CHANGELOG.md: ELI10-lead-first release entry per CLAUDE.md voice rules. Names the user-visible behavior change, the per-source partial render, the performance numbers table, the "things to watch" caveats. Credits @garrytan-agents for PR #1287's diagnosis. - VERSION + package.json: 0.37.11.0 -> 0.38.2.0. - docs/architecture/frontmatter-scan-incremental.md: Phase 2 design sketch for DB-backed scan state. Schema, migration shape, writer paths (sync-side UPSERT + incremental scan + autopilot cycle phase), doctor reader, sequencing concerns, two-phase rollout plan. Starting point for the follow-up PR — sub-second steady-state doctor needs incremental state, but the schema migration carries its own contract surface (forward-reference bootstrap, schema-drift E2E, PGLite-vs-Postgres parity) that deserves its own focused PR. - tests/heavy/frontmatter_scan_wallclock.sh (new, manual / nightly per tests/heavy/README.md): seeds a synthetic 60K-file brain (10K real + 50K under node_modules/) and asserts gbrain doctor completes in <15s with frontmatter_integrity: ok. Codex C7 caught that the original plan's 1500-file budget was too small to be a meaningful guard — at that scale the test passes BEFORE AND AFTER the fix, proving nothing. 60K is the minimum that catches the descent-into-vendor-trees regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…doctor-hang # Conflicts: # CHANGELOG.md # VERSION # package.json
… between-source breadcrumb Codex adversarial review caught 4 real bugs in the v0.38.2.0 wave. All four fixed before ship. #1 (user-facing): `gbrain frontmatter validate` takes a filesystem PATH, not a source id. Pre-fix the NOT SCANNED hint pointed users at `gbrain frontmatter validate src-a` — which would fail with "no such directory", breaking the very remediation this PR ships to give them. Fix: render `src.source_path` instead. #2 (correctness): between sources, `await dbPageCountForSource(src.id)` ran unchecked. A slow query could blow past the deadline, then scanOneSource was still called and returned `status='partial'` with `files_scanned=0` — misleading ("partial scan" when actually zero files were scanned). Fix: add a post-await deadline re-check; mark source + remainder as 'skipped' if the budget already burned. #3 (UX): when the outer-loop deadline check fired BETWEEN sources, `aborted_at_source` stayed null and the doctor message said "PARTIAL SCAN" with no source name. Fix: stamp `aborted_at_source` with the source we were about to start. #4 (correctness): the COUNT query had no per-call deadline. A wedged Postgres pool could make a single COUNT hang past the budget and defeat the wall-clock guarantee. Fix: Promise.race against the remaining deadline; on timeout, resolve null and the post-await re-check (#2) marks the source skipped. Tests: 3 new regression cases in brain-writer-partial-scan.test.ts pinning the fixed contracts (skipped-vs-partial under slow COUNT, hanging COUNT within deadline, aborted_at_source before any source starts). 8648 pass / 0 fail across the full suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…doctor-hang # Conflicts: # CHANGELOG.md # VERSION # package.json
…e, 7.4x companies Pre-update line (months stale): "17,888 pages, 4,383 people, 723 companies, 21 cron jobs running autonomously, built in 12 days." Fresh counts from ~/git/brain (the wintermute production brain): - pages: 17,888 → 146,646 (8.2x) - people: 4,383 → 24,585 (5.6x) - companies: 723 → 5,339 (7.4x) - cron jobs running: 21 → 66 (113 total, 66 enabled per ~/git/wintermute/workspace/ops/cron-snapshot.json) Dropped "built in 12 days" — at 146K pages the initial-velocity claim is stale narrative that no longer matches the current scale story. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan
added a commit
that referenced
this pull request
May 22, 2026
… v86) Master shipped v0.38.2.0 (#1297 doctor frontmatter scan) and v0.39.0.0 (#1283 brainstorm cost cathedral). v0.39.0.0 claimed migration v86 with `page_links_view_alias`. The v0.40.2.0 trajectory-routing wave's `facts_event_type_column` migration renumbers v86 → v87. All references updated in: - src/core/migrate.ts: migration entry now v87, renumber comment notes the full v81→v82→v86→v87 history across three master merges. - src/core/engine.ts, src/core/pglite-engine.ts, src/core/postgres-engine.ts: inline comments bumped to v87. - test/migrate.test.ts: my describe blocks (11 structural + 4 round-trip cases) bumped to v87. LATEST_VERSION assertion bumped to >= 87. - CLAUDE.md: v0.40.2.0 entry mentions v87. Master's v0.39.0.0 references to v86 (page_links_view_alias) preserved intact. - CHANGELOG: reconstructed cleanly — v0.40.2.0 entry at top with v87 reference, master's v0.39.0.0 + v0.38.2.0 + v0.38.1.0 entries inserted in order below. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 24, 2026
mgunnin
added a commit
to mgunnin/gbrain
that referenced
this pull request
May 28, 2026
* upstream/master: v0.38.2.0 fix(doctor): bounded frontmatter scan + partial-state surfacing (supersedes garrytan#1287) (garrytan#1297) v0.38.1.0 feat(agents): provider-agnostic subagent loop + remote MCP dispatch + budget meter (garrytan#1289) v0.38.0.0 ingestion cathedral — gbrain capture + write-through + IngestionSource contract (garrytan#1275) v0.37.11.0: fresh-install PGLite embedding setup fix wave (garrytan#1286) v0.37.10.0 feat(init): env-detection + interactive picker + preflight invariants (garrytan#1278) v0.37.9.0 fix(frontmatter): canonical-style normalization for tag arrays (garrytan#1252) v0.37.8.0 feat: voyage-code-3 discoverability + reindex-code cost-preview fix (garrytan#1267) v0.37.7.0 fix wave: federated brains + autopilot safety + OAuth confidential clients (garrytan#1253) v0.37.6.0 feat(ai): OpenRouter recipe + generic default_headers seam (cherry-pick garrytan#1210) (garrytan#1246) v0.37.5.0 fix(markdown): YAML-aware NESTED_QUOTES validator (stops flagging valid YAML) (garrytan#1229) feat: pgGraph-inspired CI scaffolding wave (v0.37.4.0) (garrytan#1228) v0.37.3.0 feat: skill_brain_first doctor check + auto-fix + declarative opt-out (supersedes garrytan#1206) (garrytan#1215) v0.37.2.0: takes_resolution_consistency CHECK accepts 'unresolvable' (garrytan#1211) v0.37.1.0 feat: brainstorm + lsd — bisociation idea generator grounded in your own brain (garrytan#1214) v0.37.0.0 feat(skillpack): registry cathedral — third-party publish + install + 10/10 quality bar (garrytan#1208) v0.36.6.0 feat: cross-modal search wave (text↔image + unified column + LLM intent) (garrytan#1165)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the production hang where
gbrain doctorfroze indefinitely on large brains (216K+ pages reported in #1287) and either timed out the cron monitor or made the rest of the health report unreadable. Supersedes #1287's 10-lineAbortSignal.timeoutband-aid with a root-cause fix plus a real bounded-wallclock safety net.Root cause: the disk walker in
brain-writer.ts:walkDir(and its twinfrontmatter.ts:collectFiles) didn't callpruneDir, the canonical descent-time pruner that sync/extract/transcript-discovery have used since v0.35.5.0. Both walkers descended intonode_modules/,.git/,.obsidian/,*.raw/, andops/on every doctor tick, stat'ing hundreds of thousands of vendor entries thatisSyncablethen filtered at the leaf — pure IO waste.Bounded wallclock:
AbortSignal.timeoutalone can't interrupt the synchronous walker (syncreaddirSync/lstatSync/readFileSyncblock the event loop, so timer callbacks never fire mid-walk — codex outside-voice caught this during plan-eng-review). The load-bearing bound is nowdeadline?: numberplumbed intoScanOptsand checked per-file insidescanOneSource. AbortSignal stays as the between-source backstop.Honest partial-state signal: when the deadline fires, per-source
status: 'scanned' | 'partial' | 'skipped'+files_scannednumerator + DBdb_page_countdenominator give doctor everything it needs to rendersrc-b: PARTIAL — scanned ~42000 files (source has ~200000 pages in DB), 14 issue(s) so farandsrc-c: NOT SCANNED (timeout — run gbrain frontmatter validate <path>). No more black-box "scan timed out."Test Coverage
Coverage gate: PASS (100%).
Pre-Landing Review
No structural issues. SQL parameterized (
$1binding, no interpolation). Defensive error handling on the COUNT query (returns null on failure, doctor renders bare counts). Catch block simplified to "unexpected error only" — codex D4 caught the pre-existing AbortError branch was unreachable in a sync walker.Adversarial Review
Codex caught 4 real bugs in the v0.38.2.0 wave during /ship Step 11. All four fixed before this PR was opened (commit b1e9778):
NOT SCANNEDmessage saidgbrain frontmatter validate ${src.source_id}but the command takes a filesystem PATH. Pre-fix the remediation hint would have failed with "no such directory" — breaking the exact users this PR ships to help. Fixed: rendersrc.source_pathinstead.await dbPageCountForSource()ran unchecked. A slow COUNT (saturated pool, missing index) could blow past the deadline, thenscanOneSourcewas called anyway and reportedstatus='partial'withfiles_scanned=0— misleading. Fixed: post-await deadline re-check; mark source + remainder as 'skipped' when the budget burned during the await.aborted_at_sourcenull when outer-loop deadline fired. When the deadline fired BETWEEN sources, the breadcrumb stayed null and doctor's "PARTIAL SCAN" message had no source name. Fixed: stamp the source we were about to start.Promise.raceagainst the remaining deadline; on timeout, resolve null and the post-await re-check (feat: GBrain v0.2.0 — incremental sync, file storage, install skill #2) marks the source skipped.All four pinned by regression tests in
test/brain-writer-partial-scan.test.ts.Plan Completion
src/core/brain-writer.ts:walkDirexports + pruneDir + visitDirsrc/commands/frontmatter.ts:collectFilesexports + pruneDir + visitDirokcalculation fixtests/heavy/frontmatter_scan_wallclock.sh(60K-file synthetic brain)7/8 DONE, 1 deferred-by-design.
NOT in scope (Phase 2 follow-ups)
docs/architecture/frontmatter-scan-incremental.mdfor the follow-up PR.walkBrainTreeand have walkDir + collectFiles + any future caller share it. Right long-term answer per the v0.35.5.0 walker-unification pattern.AbortSignal.timeoutwork properly inside the walker. The deadline-check approach gives the same wall-clock guarantee without rippling through callers.runDoctorrefactor to return rather thanprocess.exit. Would let unit tests drive runDoctor end-to-end. Current unit coverage is structural (source-grep) + the heavy script's subprocess run.TODOs
No TODOS.md items completed by this diff.
Documentation
Doc sync via
/document-releasesubagent hit a fast-mode rate limit during /ship. Will run/document-releasemanually after this PR merges to refresh CLAUDE.md key-files annotations for brain-writer.ts / frontmatter.ts / doctor.ts.Test plan
bun run typecheckcleanbun run verifyclean (all 17 pre-test gates)bun run test— 8648 pass / 0 fail across 8-shard parallel + serialtests/heavy/frontmatter_scan_wallclock.shagainst synthetic 60K-file brain (manual pre-ship — script ready, runs in <60s)🤖 Generated with Claude Code