Skip to content

fix: add timeout to doctor frontmatter_integrity check#1287

Closed
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/doctor-frontmatter-timeout
Closed

fix: add timeout to doctor frontmatter_integrity check#1287
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/doctor-frontmatter-timeout

Conversation

@garrytan-agents

Copy link
Copy Markdown
Contributor

Problem

On brains with 200K+ pages, gbrain doctor hangs indefinitely during the frontmatter_integrity check. The check calls scanBrainSources, which synchronously walks every .md file on disk across all registered federated sources. On a production brain with 216K pages and 3 sources (default + zion-brain + media-corpus), this walk takes >60s and makes the doctor command appear hung.

The monitoring system that calls gbrain doctor on a cron has a 60s timeout — so the frontmatter check causes the entire health report to fail, masking all other checks.

Root Cause

scanBrainSources already supports an AbortSignal via opts.signal — the walkDir callback checks signal.aborted on every file (line 430 of brain-writer.ts), and the source loop breaks on abort (line 382). However, the doctor caller in doctor.ts never passes a signal, so the scan runs unbounded.

Solution

Pass AbortSignal.timeout(30000) from the doctor caller to scanBrainSources. When the timeout fires:

  • The walk stops cleanly at the next file boundary
  • Doctor reports a warn status with instructions to run the full scan directly
  • All other doctor checks continue normally

The timeout is configurable via GBRAIN_DOCTOR_FM_TIMEOUT_MS (default: 30s) for brains that need more or less time.

Changes

src/commands/doctor.ts (frontmatter_integrity section):

  • Create AbortSignal.timeout(fmTimeoutMs) before the scan
  • Pass it to scanBrainSources(engine, { signal: fmAbort })
  • Detect AbortError in catch block and report actionable message

Results

Metric Before After
Doctor with 216K pages Hangs >60s, killed by monitoring timeout Completes in ~45s, frontmatter reports warn
Doctor with <50K pages ~30s (no issue) ~30s (timeout never fires)
gbrain frontmatter validate directly Unaffected Unaffected

Testing

  • Verified on a production brain with 216K pages across 3 federated sources
  • Doctor completes in <50s with the timeout
  • Frontmatter check reports warn with actionable fix instructions
  • Full gbrain frontmatter validate <path> still works independently for targeted repair

On brains with 200K+ pages, the frontmatter scan walks every .md file
on disk across all registered sources. This synchronous FS walk can
take minutes (observed: >60s on a 216K-page brain with 3 sources),
causing the doctor command to appear hung.

scanBrainSources already supports an AbortSignal via opts.signal —
the walkDir callback checks signal.aborted on every file, and the
source loop breaks on abort. This commit passes AbortSignal.timeout
(default 30s) from the doctor caller so the check degrades gracefully
instead of blocking the entire health report.

Configurable via GBRAIN_DOCTOR_FM_TIMEOUT_MS for brains that need
more or less time. When the timeout fires, doctor reports a warn
with instructions to run the full scan directly.
garrytan added a commit that referenced this pull request May 22, 2026
…cing (supersedes #1287) (#1297)

* fix(frontmatter): prune vendor dirs at descent + bounded wall-clock with partial-state surfacing

Two production-grade fixes for the v0.38.2.0 wave (supersedes PR #1287).

Root cause Fix 1 (the bug that hung gbrain doctor on 216K-page brains): both
brain-writer.ts:walkDir and frontmatter.ts:collectFiles recursed into every
subdirectory without calling pruneDir, the canonical descent-time pruner
used by sync/extract/transcript-discovery since v0.35.5.0. On brains that
double as code workspaces, the walkers stat'd hundreds of thousands of
entries under node_modules / .git / .obsidian / *.raw / ops that isSyncable
filtered out at the leaf — paying the IO cost for nothing. Wiring pruneDir
at descent (with the v0.37.7.0 #1169 submodule-gitfile check) eliminates
the bulk of the wall-clock pain.

Fix 2 (codex outside-voice C1): AbortSignal.timeout cannot interrupt the
synchronous walker — readdirSync / lstatSync / readFileSync block the event
loop, so timer callbacks never fire mid-walk. The load-bearing wall-clock
bound is now a deadline check inside scanOneSource's visit callback
(Date.now() > opts.deadline). AbortSignal still works at source boundaries.

Shape changes (codex C2 + C4):
- ScanOpts: + deadline?: number, + dbPageCountForSource hook, + visitDir test seam
- PerSourceReport: + status: 'scanned' | 'partial' | 'skipped', + files_scanned, + db_page_count
- AuditReport: + partial: boolean, + aborted_at_source: string | null
- ok = grandTotal === 0 && !partial (a clean prefix from a timed-out scan
  no longer falsely reports clean)

walkDir + collectFiles now exported with an optional visitDir callback for
the regression suite. Production callers don't pass it.

Tests:
- test/brain-writer-walk-prune.test.ts (new, 12 cases): visitDir-based
  descent-time pruning assertions for both walkers. Pins the property
  output-based tests can't catch (isSyncable rejects vendor files at
  the leaf — so a test checking only output passes under the original bug).
- test/brain-writer-partial-scan.test.ts (new, 5 cases): deadline + partial
  state + ok-after-abort + numerator/denominator coverage. Uses deadline,
  NOT AbortSignal, since codex C1 proved abort can't interrupt sync.
- test/brain-writer.test.ts: existing "abort mid-scan" test refit to the
  new partial-state contract (per_source has 'skipped' entries instead of
  being empty — gives doctor visibility into which sources weren't checked).
- test/migrations-v0_22_4.test.ts: AuditReport fixture extended with the
  new required fields.

Plan + cross-model review: ~/.claude/plans/system-instruction-you-are-working-hidden-lollipop.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(doctor): wire deadline + partial-state into frontmatter_integrity check

Adopts the v0.38.2.0 ScanBrainSources surface in doctor's frontmatter_integrity
check.

- AbortSignal.timeout(fmTimeoutMs) for between-source bound.
- deadline = Date.now() + fmTimeoutMs (the load-bearing mid-walk bound —
  codex C1 caught that AbortSignal alone can't fire inside the sync walker).
- GBRAIN_DOCTOR_FM_TIMEOUT_MS env override (default 30000ms; invalid values
  fall back to default rather than crash).
- Per-source DB denominator via SELECT COUNT(*) FROM pages WHERE source_id = $1
  AND deleted_at IS NULL (codex C3: deleted_at filter so soft-deleted pages
  don't inflate the count).
- Honest partial-render: "PARTIAL — scanned ~N files (source has ~M pages in
  DB), K issue(s) so far" instead of "scanned ~N of M pages" (codex C3 — the
  two populations are overlapping but not identical sets).
- "NOT SCANNED (timeout — run gbrain frontmatter validate <id>)" per skipped
  source so the user knows which sources didn't get checked.
- Catch block simplified to "unexpected error only" (codex D4 — the
  AbortError special case from PR #1287 was unreachable in a sync walker).

Tests: test/doctor-frontmatter-partial.test.ts (new, 11 cases) — structural
source-grep pins on every load-bearing render string plus the simplified-
catch contract. Behavioral coverage is deferred to the heavy script
(tests/heavy/frontmatter_scan_wallclock.sh, T6) because runDoctor calls
process.exit unconditionally and can't be driven from bun:test directly;
refactoring runDoctor to return rather than exit is a separate TODO.

Plan + cross-model review: ~/.claude/plans/system-instruction-you-are-working-hidden-lollipop.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: v0.38.2.0 release notes, Phase 2 design sketch, heavy wall-clock smoke

- CHANGELOG.md: ELI10-lead-first release entry per CLAUDE.md voice rules.
  Names the user-visible behavior change, the per-source partial render, the
  performance numbers table, the "things to watch" caveats. Credits
  @garrytan-agents for PR #1287's diagnosis.
- VERSION + package.json: 0.37.11.0 -> 0.38.2.0.
- docs/architecture/frontmatter-scan-incremental.md: Phase 2 design sketch
  for DB-backed scan state. Schema, migration shape, writer paths
  (sync-side UPSERT + incremental scan + autopilot cycle phase), doctor
  reader, sequencing concerns, two-phase rollout plan. Starting point for
  the follow-up PR — sub-second steady-state doctor needs incremental
  state, but the schema migration carries its own contract surface
  (forward-reference bootstrap, schema-drift E2E, PGLite-vs-Postgres
  parity) that deserves its own focused PR.
- tests/heavy/frontmatter_scan_wallclock.sh (new, manual / nightly per
  tests/heavy/README.md): seeds a synthetic 60K-file brain (10K real + 50K
  under node_modules/) and asserts gbrain doctor completes in <15s with
  frontmatter_integrity: ok. Codex C7 caught that the original plan's
  1500-file budget was too small to be a meaningful guard — at that scale
  the test passes BEFORE AND AFTER the fix, proving nothing. 60K is the
  minimum that catches the descent-into-vendor-trees regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: adversarial review followups — CLI hint, deadline-vs-await race, between-source breadcrumb

Codex adversarial review caught 4 real bugs in the v0.38.2.0 wave. All four
fixed before ship.

#1 (user-facing): `gbrain frontmatter validate` takes a filesystem PATH, not a
source id. Pre-fix the NOT SCANNED hint pointed users at
`gbrain frontmatter validate src-a` — which would fail with "no such
directory", breaking the very remediation this PR ships to give them. Fix:
render `src.source_path` instead.

#2 (correctness): between sources, `await dbPageCountForSource(src.id)` ran
unchecked. A slow query could blow past the deadline, then scanOneSource was
still called and returned `status='partial'` with `files_scanned=0` —
misleading ("partial scan" when actually zero files were scanned). Fix: add a
post-await deadline re-check; mark source + remainder as 'skipped' if the
budget already burned.

#3 (UX): when the outer-loop deadline check fired BETWEEN sources,
`aborted_at_source` stayed null and the doctor message said "PARTIAL SCAN"
with no source name. Fix: stamp `aborted_at_source` with the source we were
about to start.

#4 (correctness): the COUNT query had no per-call deadline. A wedged
Postgres pool could make a single COUNT hang past the budget and defeat the
wall-clock guarantee. Fix: Promise.race against the remaining deadline; on
timeout, resolve null and the post-await re-check (#2) marks the source
skipped.

Tests: 3 new regression cases in brain-writer-partial-scan.test.ts pinning
the fixed contracts (skipped-vs-partial under slow COUNT, hanging COUNT
within deadline, aborted_at_source before any source starts). 8648 pass /
0 fail across the full suite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(README): refresh production-brain stats — 8.2x pages, 5.6x people, 7.4x companies

Pre-update line (months stale): "17,888 pages, 4,383 people, 723 companies, 21
cron jobs running autonomously, built in 12 days."

Fresh counts from ~/git/brain (the wintermute production brain):
- pages: 17,888 → 146,646 (8.2x)
- people: 4,383 → 24,585 (5.6x)
- companies: 723 → 5,339 (7.4x)
- cron jobs running: 21 → 66 (113 total, 66 enabled per ~/git/wintermute/workspace/ops/cron-snapshot.json)

Dropped "built in 12 days" — at 146K pages the initial-velocity claim is
stale narrative that no longer matches the current scale story.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan

Copy link
Copy Markdown
Owner

Closing as superseded — this fix shipped in v0.38.2.0 via #1297 ("fix(doctor): bounded frontmatter scan + partial-state surfacing (supersedes #1287)"), which lands the same change in src/commands/doctor.ts:3697-3705 (fmTimeoutMs + AbortSignal.timeout(fmTimeoutMs) + report.partial partial-scan surfacing). Thanks for the production-brain timeout catch.

@garrytan garrytan closed this May 24, 2026
mgunnin added a commit to mgunnin/gbrain that referenced this pull request May 28, 2026
* upstream/master:
  v0.38.2.0 fix(doctor): bounded frontmatter scan + partial-state surfacing (supersedes garrytan#1287) (garrytan#1297)
  v0.38.1.0 feat(agents): provider-agnostic subagent loop + remote MCP dispatch + budget meter (garrytan#1289)
  v0.38.0.0 ingestion cathedral — gbrain capture + write-through + IngestionSource contract (garrytan#1275)
  v0.37.11.0: fresh-install PGLite embedding setup fix wave (garrytan#1286)
  v0.37.10.0 feat(init): env-detection + interactive picker + preflight invariants (garrytan#1278)
  v0.37.9.0 fix(frontmatter): canonical-style normalization for tag arrays (garrytan#1252)
  v0.37.8.0 feat: voyage-code-3 discoverability + reindex-code cost-preview fix (garrytan#1267)
  v0.37.7.0 fix wave: federated brains + autopilot safety + OAuth confidential clients (garrytan#1253)
  v0.37.6.0 feat(ai): OpenRouter recipe + generic default_headers seam (cherry-pick garrytan#1210) (garrytan#1246)
  v0.37.5.0 fix(markdown): YAML-aware NESTED_QUOTES validator (stops flagging valid YAML) (garrytan#1229)
  feat: pgGraph-inspired CI scaffolding wave (v0.37.4.0) (garrytan#1228)
  v0.37.3.0 feat: skill_brain_first doctor check + auto-fix + declarative opt-out (supersedes garrytan#1206) (garrytan#1215)
  v0.37.2.0: takes_resolution_consistency CHECK accepts 'unresolvable' (garrytan#1211)
  v0.37.1.0 feat: brainstorm + lsd — bisociation idea generator grounded in your own brain (garrytan#1214)
  v0.37.0.0 feat(skillpack): registry cathedral — third-party publish + install + 10/10 quality bar (garrytan#1208)
  v0.36.6.0 feat: cross-modal search wave (text↔image + unified column + LLM intent) (garrytan#1165)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants