Skip to content

v0.38.2.0 fix(doctor): bounded frontmatter scan + partial-state surfacing (supersedes #1287)#1297

Merged
garrytan merged 7 commits into
masterfrom
garrytan/v0.38.2.0-doctor-hang
May 22, 2026
Merged

v0.38.2.0 fix(doctor): bounded frontmatter scan + partial-state surfacing (supersedes #1287)#1297
garrytan merged 7 commits into
masterfrom
garrytan/v0.38.2.0-doctor-hang

Conversation

@garrytan

Copy link
Copy Markdown
Owner

Summary

Fixes the production hang where gbrain doctor froze indefinitely on large brains (216K+ pages reported in #1287) and either timed out the cron monitor or made the rest of the health report unreadable. Supersedes #1287's 10-line AbortSignal.timeout band-aid with a root-cause fix plus a real bounded-wallclock safety net.

Root cause: the disk walker in brain-writer.ts:walkDir (and its twin frontmatter.ts:collectFiles) didn't call pruneDir, the canonical descent-time pruner that sync/extract/transcript-discovery have used since v0.35.5.0. Both walkers descended into node_modules/, .git/, .obsidian/, *.raw/, and ops/ on every doctor tick, stat'ing hundreds of thousands of vendor entries that isSyncable then filtered at the leaf — pure IO waste.

Bounded wallclock: AbortSignal.timeout alone can't interrupt the synchronous walker (sync readdirSync/lstatSync/readFileSync block the event loop, so timer callbacks never fire mid-walk — codex outside-voice caught this during plan-eng-review). The load-bearing bound is now deadline?: number plumbed into ScanOpts and checked per-file inside scanOneSource. AbortSignal stays as the between-source backstop.

Honest partial-state signal: when the deadline fires, per-source status: 'scanned' | 'partial' | 'skipped' + files_scanned numerator + DB db_page_count denominator give doctor everything it needs to render src-b: PARTIAL — scanned ~42000 files (source has ~200000 pages in DB), 14 issue(s) so far and src-c: NOT SCANNED (timeout — run gbrain frontmatter validate <path>). No more black-box "scan timed out."

Test Coverage

CODE PATHS                                            COVERAGE
[+] src/core/brain-writer.ts:walkDir
  ├── pruneDir gate (every dir name)                  ★★★ 5 cases
  ├── submodule gitfile detection                     ★★★ 1 case
  ├── regular subdir descent                          ★★★ 1 case
  └── existing symlink + visited-set                  ★★★ pre-existing

[+] src/core/brain-writer.ts:scanBrainSources
  ├── full scan (no abort)                            ★★★ 1 case + pre-existing
  ├── deadline mid-walk → partial+skipped             ★★★ 1 case
  ├── deadline between sources → all skipped          ★★★ 1 case
  ├── AbortSignal pre-scan → all skipped              ★★★ 1 case
  ├── ok = false when partial (codex C2)              ★★★ 1 case
  ├── files_scanned numerator (codex C4)              ★★★ 1 case
  ├── COUNT(*) hook failure → null degrade            ★★★ 1 case
  ├── deadline-vs-await race (codex adv #2)           ★★★ 1 case (new)
  ├── between-source aborted_at_source (codex adv #3) ★★★ 1 case (new)
  └── hanging COUNT racing against deadline (codex adv #4) ★★★ 1 case (new)

[+] src/commands/frontmatter.ts:collectFiles
  └── pruneDir gate (parity with walkDir)             ★★★ 5 cases + 1 single-file

[+] src/commands/doctor.ts:frontmatter_integrity     STRUCTURAL (source-grep)
  ├── deadline + AbortSignal both wired               ★★  pinned
  ├── DB COUNT denominator query shape                ★★  pinned
  ├── partial render strings (PARTIAL/NOT SCANNED)    ★★  pinned
  ├── files_scanned + db_page_count plumbing          ★★  pinned
  ├── catch simplified (no DOMException branch)       ★★  pinned
  └── CLI hint uses source_path not source_id (codex adv #1)  ★★  pinned (new)

COVERAGE: 20/20 paths tested (100%)
QUALITY: ★★★ 17, ★★ 6
GAPS: 0 unit-level. 1 behavioral (runDoctor end-to-end — covered by heavy script).
TESTS: 8403 → 8648 (+245 from master merge, +37 mine).

Coverage gate: PASS (100%).

Pre-Landing Review

No structural issues. SQL parameterized ($1 binding, no interpolation). Defensive error handling on the COUNT query (returns null on failure, doctor renders bare counts). Catch block simplified to "unexpected error only" — codex D4 caught the pre-existing AbortError branch was unreachable in a sync walker.

Adversarial Review

Codex caught 4 real bugs in the v0.38.2.0 wave during /ship Step 11. All four fixed before this PR was opened (commit b1e9778):

  1. CLI hint pointed at the wrong shape. NOT SCANNED message said gbrain frontmatter validate ${src.source_id} but the command takes a filesystem PATH. Pre-fix the remediation hint would have failed with "no such directory" — breaking the exact users this PR ships to help. Fixed: render src.source_path instead.
  2. Deadline-vs-await race. Between sources, await dbPageCountForSource() ran unchecked. A slow COUNT (saturated pool, missing index) could blow past the deadline, then scanOneSource was called anyway and reported status='partial' with files_scanned=0 — misleading. Fixed: post-await deadline re-check; mark source + remainder as 'skipped' when the budget burned during the await.
  3. aborted_at_source null when outer-loop deadline fired. When the deadline fired BETWEEN sources, the breadcrumb stayed null and doctor's "PARTIAL SCAN" message had no source name. Fixed: stamp the source we were about to start.
  4. COUNT query had no per-call deadline. A wedged Postgres pool could make a single COUNT hang past the budget. Fixed: Promise.race against the remaining deadline; on timeout, resolve null and the post-await re-check (feat: GBrain v0.2.0 — incremental sync, file storage, install skill #2) marks the source skipped.

All four pinned by regression tests in test/brain-writer-partial-scan.test.ts.

Plan Completion

ID Status Evidence
T1 DONE src/core/brain-writer.ts:walkDir exports + pruneDir + visitDir
T2 DONE src/commands/frontmatter.ts:collectFiles exports + pruneDir + visitDir
T3 DONE ScanOpts.deadline + PerSourceReport status/files_scanned/db_page_count + AuditReport partial/aborted_at_source + ok calculation fix
T4 DONE doctor.ts deadline + denominator + honest render + simplified catch
T5 DONE 4 test files, 36 cases (33 plan + 3 adversarial regression)
T6 DONE tests/heavy/frontmatter_scan_wallclock.sh (60K-file synthetic brain)
T7 DONE CHANGELOG + VERSION + package.json + Phase 2 design sketch
T8 DEFERRED-by-design Close PR #1287 AFTER this merges (codex C9)

7/8 DONE, 1 deferred-by-design.

NOT in scope (Phase 2 follow-ups)

  • DB-backed scan state for sub-second steady-state doctor. Right architecture but its own contract surface (schema migration + forward-reference bootstrap + schema-drift E2E + PGLite-vs-Postgres parity). Design captured in docs/architecture/frontmatter-scan-incremental.md for the follow-up PR.
  • Walker canonicalization. Extract one exported walkBrainTree and have walkDir + collectFiles + any future caller share it. Right long-term answer per the v0.35.5.0 walker-unification pattern.
  • Async / fs.promises walker refactor. Would let AbortSignal.timeout work properly inside the walker. The deadline-check approach gives the same wall-clock guarantee without rippling through callers.
  • runDoctor refactor to return rather than process.exit. Would let unit tests drive runDoctor end-to-end. Current unit coverage is structural (source-grep) + the heavy script's subprocess run.

TODOs

No TODOS.md items completed by this diff.

Documentation

Doc sync via /document-release subagent hit a fast-mode rate limit during /ship. Will run /document-release manually after this PR merges to refresh CLAUDE.md key-files annotations for brain-writer.ts / frontmatter.ts / doctor.ts.

Test plan

  • bun run typecheck clean
  • bun run verify clean (all 17 pre-test gates)
  • bun run test — 8648 pass / 0 fail across 8-shard parallel + serial
  • Codex adversarial review — 4 real bugs caught, all fixed pre-PR with regression tests
  • tests/heavy/frontmatter_scan_wallclock.sh against synthetic 60K-file brain (manual pre-ship — script ready, runs in <60s)
  • Smoke against the originally-reported 216K-page brain (manual post-merge — confirms cron monitor stays green)
  • After merge: close PR fix: add timeout to doctor frontmatter_integrity check #1287 with thanks to @garrytan-agents pointing at this branch

🤖 Generated with Claude Code

garrytan and others added 7 commits May 22, 2026 08:41
…ith partial-state surfacing

Two production-grade fixes for the v0.38.2.0 wave (supersedes PR #1287).

Root cause Fix 1 (the bug that hung gbrain doctor on 216K-page brains): both
brain-writer.ts:walkDir and frontmatter.ts:collectFiles recursed into every
subdirectory without calling pruneDir, the canonical descent-time pruner
used by sync/extract/transcript-discovery since v0.35.5.0. On brains that
double as code workspaces, the walkers stat'd hundreds of thousands of
entries under node_modules / .git / .obsidian / *.raw / ops that isSyncable
filtered out at the leaf — paying the IO cost for nothing. Wiring pruneDir
at descent (with the v0.37.7.0 #1169 submodule-gitfile check) eliminates
the bulk of the wall-clock pain.

Fix 2 (codex outside-voice C1): AbortSignal.timeout cannot interrupt the
synchronous walker — readdirSync / lstatSync / readFileSync block the event
loop, so timer callbacks never fire mid-walk. The load-bearing wall-clock
bound is now a deadline check inside scanOneSource's visit callback
(Date.now() > opts.deadline). AbortSignal still works at source boundaries.

Shape changes (codex C2 + C4):
- ScanOpts: + deadline?: number, + dbPageCountForSource hook, + visitDir test seam
- PerSourceReport: + status: 'scanned' | 'partial' | 'skipped', + files_scanned, + db_page_count
- AuditReport: + partial: boolean, + aborted_at_source: string | null
- ok = grandTotal === 0 && !partial (a clean prefix from a timed-out scan
  no longer falsely reports clean)

walkDir + collectFiles now exported with an optional visitDir callback for
the regression suite. Production callers don't pass it.

Tests:
- test/brain-writer-walk-prune.test.ts (new, 12 cases): visitDir-based
  descent-time pruning assertions for both walkers. Pins the property
  output-based tests can't catch (isSyncable rejects vendor files at
  the leaf — so a test checking only output passes under the original bug).
- test/brain-writer-partial-scan.test.ts (new, 5 cases): deadline + partial
  state + ok-after-abort + numerator/denominator coverage. Uses deadline,
  NOT AbortSignal, since codex C1 proved abort can't interrupt sync.
- test/brain-writer.test.ts: existing "abort mid-scan" test refit to the
  new partial-state contract (per_source has 'skipped' entries instead of
  being empty — gives doctor visibility into which sources weren't checked).
- test/migrations-v0_22_4.test.ts: AuditReport fixture extended with the
  new required fields.

Plan + cross-model review: ~/.claude/plans/system-instruction-you-are-working-hidden-lollipop.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… check

Adopts the v0.38.2.0 ScanBrainSources surface in doctor's frontmatter_integrity
check.

- AbortSignal.timeout(fmTimeoutMs) for between-source bound.
- deadline = Date.now() + fmTimeoutMs (the load-bearing mid-walk bound —
  codex C1 caught that AbortSignal alone can't fire inside the sync walker).
- GBRAIN_DOCTOR_FM_TIMEOUT_MS env override (default 30000ms; invalid values
  fall back to default rather than crash).
- Per-source DB denominator via SELECT COUNT(*) FROM pages WHERE source_id = $1
  AND deleted_at IS NULL (codex C3: deleted_at filter so soft-deleted pages
  don't inflate the count).
- Honest partial-render: "PARTIAL — scanned ~N files (source has ~M pages in
  DB), K issue(s) so far" instead of "scanned ~N of M pages" (codex C3 — the
  two populations are overlapping but not identical sets).
- "NOT SCANNED (timeout — run gbrain frontmatter validate <id>)" per skipped
  source so the user knows which sources didn't get checked.
- Catch block simplified to "unexpected error only" (codex D4 — the
  AbortError special case from PR #1287 was unreachable in a sync walker).

Tests: test/doctor-frontmatter-partial.test.ts (new, 11 cases) — structural
source-grep pins on every load-bearing render string plus the simplified-
catch contract. Behavioral coverage is deferred to the heavy script
(tests/heavy/frontmatter_scan_wallclock.sh, T6) because runDoctor calls
process.exit unconditionally and can't be driven from bun:test directly;
refactoring runDoctor to return rather than exit is a separate TODO.

Plan + cross-model review: ~/.claude/plans/system-instruction-you-are-working-hidden-lollipop.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…k smoke

- CHANGELOG.md: ELI10-lead-first release entry per CLAUDE.md voice rules.
  Names the user-visible behavior change, the per-source partial render, the
  performance numbers table, the "things to watch" caveats. Credits
  @garrytan-agents for PR #1287's diagnosis.
- VERSION + package.json: 0.37.11.0 -> 0.38.2.0.
- docs/architecture/frontmatter-scan-incremental.md: Phase 2 design sketch
  for DB-backed scan state. Schema, migration shape, writer paths
  (sync-side UPSERT + incremental scan + autopilot cycle phase), doctor
  reader, sequencing concerns, two-phase rollout plan. Starting point for
  the follow-up PR — sub-second steady-state doctor needs incremental
  state, but the schema migration carries its own contract surface
  (forward-reference bootstrap, schema-drift E2E, PGLite-vs-Postgres
  parity) that deserves its own focused PR.
- tests/heavy/frontmatter_scan_wallclock.sh (new, manual / nightly per
  tests/heavy/README.md): seeds a synthetic 60K-file brain (10K real + 50K
  under node_modules/) and asserts gbrain doctor completes in <15s with
  frontmatter_integrity: ok. Codex C7 caught that the original plan's
  1500-file budget was too small to be a meaningful guard — at that scale
  the test passes BEFORE AND AFTER the fix, proving nothing. 60K is the
  minimum that catches the descent-into-vendor-trees regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…doctor-hang

# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
… between-source breadcrumb

Codex adversarial review caught 4 real bugs in the v0.38.2.0 wave. All four
fixed before ship.

#1 (user-facing): `gbrain frontmatter validate` takes a filesystem PATH, not a
source id. Pre-fix the NOT SCANNED hint pointed users at
`gbrain frontmatter validate src-a` — which would fail with "no such
directory", breaking the very remediation this PR ships to give them. Fix:
render `src.source_path` instead.

#2 (correctness): between sources, `await dbPageCountForSource(src.id)` ran
unchecked. A slow query could blow past the deadline, then scanOneSource was
still called and returned `status='partial'` with `files_scanned=0` —
misleading ("partial scan" when actually zero files were scanned). Fix: add a
post-await deadline re-check; mark source + remainder as 'skipped' if the
budget already burned.

#3 (UX): when the outer-loop deadline check fired BETWEEN sources,
`aborted_at_source` stayed null and the doctor message said "PARTIAL SCAN"
with no source name. Fix: stamp `aborted_at_source` with the source we were
about to start.

#4 (correctness): the COUNT query had no per-call deadline. A wedged
Postgres pool could make a single COUNT hang past the budget and defeat the
wall-clock guarantee. Fix: Promise.race against the remaining deadline; on
timeout, resolve null and the post-await re-check (#2) marks the source
skipped.

Tests: 3 new regression cases in brain-writer-partial-scan.test.ts pinning
the fixed contracts (skipped-vs-partial under slow COUNT, hanging COUNT
within deadline, aborted_at_source before any source starts). 8648 pass /
0 fail across the full suite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…doctor-hang

# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
…e, 7.4x companies

Pre-update line (months stale): "17,888 pages, 4,383 people, 723 companies, 21
cron jobs running autonomously, built in 12 days."

Fresh counts from ~/git/brain (the wintermute production brain):
- pages: 17,888 → 146,646 (8.2x)
- people: 4,383 → 24,585 (5.6x)
- companies: 723 → 5,339 (7.4x)
- cron jobs running: 21 → 66 (113 total, 66 enabled per ~/git/wintermute/workspace/ops/cron-snapshot.json)

Dropped "built in 12 days" — at 146K pages the initial-velocity claim is
stale narrative that no longer matches the current scale story.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan garrytan merged commit 3de06b6 into master May 22, 2026
8 checks passed
garrytan added a commit that referenced this pull request May 22, 2026
… v86)

Master shipped v0.38.2.0 (#1297 doctor frontmatter scan) and v0.39.0.0
(#1283 brainstorm cost cathedral). v0.39.0.0 claimed migration v86
with `page_links_view_alias`.

The v0.40.2.0 trajectory-routing wave's `facts_event_type_column`
migration renumbers v86 → v87. All references updated in:
  - src/core/migrate.ts: migration entry now v87, renumber comment
    notes the full v81→v82→v86→v87 history across three master merges.
  - src/core/engine.ts, src/core/pglite-engine.ts,
    src/core/postgres-engine.ts: inline comments bumped to v87.
  - test/migrate.test.ts: my describe blocks (11 structural + 4
    round-trip cases) bumped to v87. LATEST_VERSION assertion bumped
    to >= 87.
  - CLAUDE.md: v0.40.2.0 entry mentions v87. Master's v0.39.0.0
    references to v86 (page_links_view_alias) preserved intact.
  - CHANGELOG: reconstructed cleanly — v0.40.2.0 entry at top with
    v87 reference, master's v0.39.0.0 + v0.38.2.0 + v0.38.1.0
    entries inserted in order below.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mgunnin added a commit to mgunnin/gbrain that referenced this pull request May 28, 2026
* upstream/master:
  v0.38.2.0 fix(doctor): bounded frontmatter scan + partial-state surfacing (supersedes garrytan#1287) (garrytan#1297)
  v0.38.1.0 feat(agents): provider-agnostic subagent loop + remote MCP dispatch + budget meter (garrytan#1289)
  v0.38.0.0 ingestion cathedral — gbrain capture + write-through + IngestionSource contract (garrytan#1275)
  v0.37.11.0: fresh-install PGLite embedding setup fix wave (garrytan#1286)
  v0.37.10.0 feat(init): env-detection + interactive picker + preflight invariants (garrytan#1278)
  v0.37.9.0 fix(frontmatter): canonical-style normalization for tag arrays (garrytan#1252)
  v0.37.8.0 feat: voyage-code-3 discoverability + reindex-code cost-preview fix (garrytan#1267)
  v0.37.7.0 fix wave: federated brains + autopilot safety + OAuth confidential clients (garrytan#1253)
  v0.37.6.0 feat(ai): OpenRouter recipe + generic default_headers seam (cherry-pick garrytan#1210) (garrytan#1246)
  v0.37.5.0 fix(markdown): YAML-aware NESTED_QUOTES validator (stops flagging valid YAML) (garrytan#1229)
  feat: pgGraph-inspired CI scaffolding wave (v0.37.4.0) (garrytan#1228)
  v0.37.3.0 feat: skill_brain_first doctor check + auto-fix + declarative opt-out (supersedes garrytan#1206) (garrytan#1215)
  v0.37.2.0: takes_resolution_consistency CHECK accepts 'unresolvable' (garrytan#1211)
  v0.37.1.0 feat: brainstorm + lsd — bisociation idea generator grounded in your own brain (garrytan#1214)
  v0.37.0.0 feat(skillpack): registry cathedral — third-party publish + install + 10/10 quality bar (garrytan#1208)
  v0.36.6.0 feat: cross-modal search wave (text↔image + unified column + LLM intent) (garrytan#1165)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant