v0.41.27.0 fix(doctor): git-aware sync_freshness (supersedes #1564)#1573
Merged
Conversation
src/core/git-head.ts is a single-responsibility module: probe git HEAD + working-tree clean state for a local_path, compare against the last_commit SHA the DB stored at last sync completion. Designed for reuse (autopilot's per-source dispatch will want the same gate, filed as v0.41.27.1+ TODO). Shell-injection safe: uses execFileSync with array args. The superseded community PR #1564 used execSync through /bin/sh -c with JSON.stringify for shell-escape, which is unsafe — JSON.stringify escapes for JSON, not shell. Fail-open contract: every error path returns false, preserving the caller's prior time-based behavior. Two test seams (_setGitHeadProbeForTests, _setGitCleanProbeForTests) match the last-retrieved.ts precedent so unit tests stay parallel-eligible (no mock.module per CLAUDE.md R2). Companion test suite: 14 cases including a load-bearing shell-injection regression guard that runs real execFileSync against '/nonexistent/$(touch <sentinel>)/repo' and asserts the sentinel file is never created.
checkSyncFreshness gains opts.localOnly: boolean. When local CLI caller passes true, doctor short-circuits the staleness warning iff HEAD == sources.last_commit AND working tree is clean AND sources.chunker_version matches CURRENT (mirrors sync.ts:1057+1075's own "do work?" predicate, so doctor and sync agree). Inline SELECT widens to carry last_commit + chunker_version (columns already exist; no schema migration). Three-bucket count math (unchanged_count + synced_recently_count + stale_count === sources.length) populates Check.details for dashboards / JSON consumers. OK-message reshape: - all-unchanged → "All N up to date (no new commits since last sync)" - mixed → "N source(s): X synced recently, Y unchanged since last sync" - all-recent → "All N synced recently" (back-compat). Trust boundary preserved (Codex P0-1): runDoctor (local CLI, trusted) passes localOnly: true; doctorReportRemote (HTTP MCP, untrusted) keeps the default false. Default-false is fail-closed — a future caller that forgets the opt gets the safe no-probe behavior. checkCycleFreshness is INTENTIONALLY NOT touched (Codex P0-2): last_commit == HEAD answers "new commits to sync?" but cannot answer "did the full cycle complete?" — silencing cycle warns on git-clean sources would mask the case where sync ran but extract/embed/ consolidate/synthesize failed. Test coverage: 9 new cases in test/doctor.test.ts including - HEAD-match short-circuit + cold-path message - HEAD-mismatch warn - NULL last_commit (legacy data) → warn - non-git local_path (probe returns null) → warn (fail-open) - 3-source mixed bucket invariant (sum === length) - chunker-version mismatch warn (dirty-tree-clean still gates) - dirty-tree warn (HEAD-match still gates) - D4 regression guard: localOnly=false (default) NEVER calls probes All 87 tests across git-head + doctor + cycle-freshness suites pass. bun run verify clean (28/28 checks). Co-Authored-By: garrytan-agents <me@garrytan.com>
Supersedes community PR #1564. Co-Authored-By preserved. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ness-git-check # Conflicts: # CHANGELOG.md # VERSION # package.json
CI failed on the v0.41.27.0 ship after merging origin/master because
test/query-cache-knobs-hash.test.ts predates the v0.36.2.0 flip of
DEFAULT_EMBEDDING_DIMENSIONS from 1536 → 1280 (ZE Matryoshka).
Without an explicit gateway pin, initSchema() sizes query_cache.embedding
at halfvec(1280), but the test fixture (makeEmbedding at line 44) emits
1536-dim unit vectors. Result: 5 cases in the
"SemanticQueryCache cross-mode isolation (CDX-4 hotfix)" describe
crashed with "expected 1280 dimensions, not 1536".
Fix mirrors test/consolidate-valid-until.test.ts (the canonical
gateway-pin pattern that landed when the default flipped). resetGateway()
+ configureGateway({embedding_dimensions: 1536}) in beforeAll forces the
schema to size at halfvec(1536) regardless of cross-file gateway state.
resetGateway() in afterAll restores defaults so the next file in the
shard isn't poisoned.
Verified: 9/9 cases pass; bun run verify 29/29 clean.
garrytan
added a commit
that referenced
this pull request
May 29, 2026
Master shipped v0.41.27.0 (#1573 git-aware sync_freshness check) which claimed the version slot this branch originally targeted. This branch already rebumped to v0.41.28.0 in the prior commit; merging master in: - VERSION stays 0.41.28.0 (next slot above master's 0.41.27.0) - CHANGELOG.md: my v0.41.28.0 entry on top; master's v0.41.27.0 entry preserved verbatim below - package.json: synced to 0.41.28.0 - src/commands/doctor.ts, CLAUDE.md, llms-full.txt: auto-merged (master's sync_freshness check + my batch_retry_health disconnect audit extension both land; no overlap) - New from master: src/core/git-head.ts + test/core/git-head.test.ts No #1570-surface conflicts — master's sync_freshness work is orthogonal to the retry/facts/disconnect-audit changes.
garrytan
added a commit
that referenced
this pull request
May 29, 2026
… drain + disconnect audit (closes #1570) (#1608) * merge master: rebump v0.41.25.0 → v0.41.27.0 (queue collision) Master shipped v0.41.25.0 (#1538 batched sync deletes) and v0.41.26.0 (#1571 dream --source fix) while this branch was in flight. Conflict resolution rebumps to the next available slot. - VERSION: 0.41.25.0 → 0.41.27.0 - package.json: synced - CHANGELOG.md: my v0.41.27.0 entry placed above master's v0.41.26.0 and v0.41.25.0; in-entry version references updated 0.41.25.0 → 0.41.27.0 and forward-references bumped to v0.41.28+. - TODOS.md: kept master's v0.41.20.x section + my v0.41.27.0+ follow-ups No source-file conflicts during the merge. * feat(diagnostics): db-disconnect audit + doctor surface (v0.41.27.0) Instruments every db.disconnect() and PostgresEngine.disconnect() call with a JSONL audit record so the next user-reported #1570 cycle gives us the offender's caller stack instead of the symptomatic "No database connection" error. Audit shape (~/.gbrain/audit/db-disconnect-YYYY-Www.jsonl): {ts, engine_kind, connection_style, caller_stack[], command, pid} - src/core/audit/db-disconnect-audit.ts (NEW): the audit writer, built on the v0.40.4.0 createAuditWriter cathedral. Captures a 6-frame stack via new Error().stack so the offender is readable without spending stderr noise. - src/core/db.ts: logDbDisconnect call at the top of disconnect() (best-effort; never blocks the real teardown). - src/core/postgres-engine.ts: same instrumentation in PostgresEngine.disconnect() — distinguishes 'module' vs 'instance' connection_style so we can tell legitimate worker-pool teardowns apart from the load-bearing module-singleton class. - src/commands/doctor.ts: extends batch_retry_health to surface 24h disconnect count + most-recent caller stack. Warns when the caller frame isn't a known CLI-exit frame (e.g. cli.ts's finally block at the end of an op-dispatch). This is the diagnostic that tells v0.41.28+ where to apply the real ownership fix. - test/db-disconnect-audit.test.ts: unit coverage for the audit writer + caller-stack capture + JSONL shape. - test/e2e/db-singleton-shared-recovery.test.ts: real-Postgres regression that exercises the singleton-null path end-to-end. Refs #1570 * feat(retry): self-heal on null singleton — closes #1570 symptom (v0.41.27.0) withRetry gains an opt-in reconnect callback that fires between the isRetryableConnError classification and the inter-attempt sleep. PostgresEngine.batchRetry injects this.reconnect() — race-safe via the existing _reconnecting guard, handles module and instance pools. Closes the production loss reported in #1570: dream cycles on Supabase no longer drop ~150 link rows per cycle when the singleton goes null mid-batch. The retry now rebuilds the connection between attempts so the second try has somewhere to write to. - src/core/retry.ts: WithRetryOpts gains `reconnect?: () => Promise<void>`. Awaited in the catch branch. onRetry is also now awaited (back-compat- safe: every existing in-tree caller is a sync arrow). Reconnect failures propagate as the real cause — replaces the symptomatic "No database connection" error with whatever the connect() throw was, so operators see the truth. - src/core/postgres-engine.ts:batchRetry — injects `reconnect: () => this.reconnect()`. Covers all 9 batch-retry call sites (addLinksBatch, addTimelineEntriesBatch, upsertChunks, plus the 6 caller-supplied auditSite labels in extract / sync / reindex). - test/core/retry-reconnect.test.ts: 8 hermetic cases pinning the contract — reconnect fires before sleep, only on retryable errors, back-compat when omitted, signal-aborted bypasses reconnect, onRetry is awaited, full success path end-to-end. The deeper bug (who's calling disconnect mid-cycle) is left unaddressed in this commit by design — the diagnostic instrumentation in the prior commit will tell us in the next production run. Refs #1570 * feat(facts): drainPending() + CLI await before disconnect (v0.41.27.0) Closes the silent 'No database connection' tail-end errors after gbrain capture / put_page: the facts:absorb fire-and-forget queue sometimes outlived the CLI process's connection lifetime, so absorb attempts after engine.disconnect() landed in stderr as the GBrainError shape. - src/core/facts/queue.ts: new drainPending({timeout: 1000}) method distinct from shutdown(). Stops accepting new enqueues, awaits in-flight settle, bounded by timeout, returns count of unfinished. Semantically different from shutdown() (which aborts in-flight) so the symptom — drop work that hasn't started yet but let in-flight work finish — matches what CLI exit actually needs. - src/cli.ts: op-dispatch finally block awaits the drain BEFORE engine.disconnect(). Bounded 1s. Opt-out env GBRAIN_NO_FACTS_DRAIN for callers that don't enqueue (keeps fast-exit paths fast). Mirrors the v0.41.8.0 awaitPendingLastRetrievedWrites pattern. - test/facts-queue-drain-pending.test.ts: 6 hermetic cases — empty drain returns immediately, single in-flight settles, timeout bounds wait, shutdown-after-drain is idempotent, post-drain enqueues are dropped, signal-aborted skips waiting. Refs #1570 * docs: update project documentation for v0.41.27.0 README.md: added troubleshooting entry for the v0.41.27.0 retry-reconnect + facts:absorb drain fix (closes #1570), pointing operators at `gbrain doctor --json` to find the offending disconnect caller. CLAUDE.md: extended `src/core/retry.ts` entry with the new optional `reconnect` callback (v0.41.27.0); added two new Key Files entries for `src/core/audit/db-disconnect-audit.ts` (the diagnostic half of the "instrument first, fix later" pivot) and `FactsQueue.drainPending`; extended `doctor.ts:checkBatchRetryHealth` entry with the in-place extension that surfaces 24h disconnect-call count. llms-full.txt: regenerated to absorb CLAUDE.md edits. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore: rebump v0.41.27.0 → v0.41.28.0 (queue collision with #1573) Master shipped v0.41.27.0 (#1573 git-aware sync_freshness) claiming the same slot. Rebump to the next available version. - VERSION + package.json → 0.41.28.0 - CHANGELOG.md: my entry header + in-entry refs 0.41.27.0 → 0.41.28.0 - TODOS.md: my #1570 follow-up section header + body refs bumped * test: pin gateway in put-page-provenance + embedding-dim-check (CI shard fix) Both files failed on CI shards 1 and 8 under the cross-file gateway-state leak class (CLAUDE.md "Test-isolation lint and helpers"). The v0.41.28.0 merge reshuffled the weight-based shard bin-packing, landing a gateway-mutating sibling ahead of these two victims in the same `bun test` process. Mechanism: - put-page-provenance: put_page embeds via the gateway. A sibling left the gateway configured with OpenAI + the CI placeholder `sk-test` (captured at configureGateway time, survives the withEnv restore as cached gateway state). put_page's embed then fired against live OpenAI and 401'd. The bunfig legacy-embedding preload's beforeEach only re-applies legacy when the gateway was RESET — it does NOT correct a sibling that configured a different LIVE config. - embedding-dim-check: initSchema builds the content_chunks vector column at the gateway's configured dim. A sibling leaking ZE/1280 made the column 1280-d, so `expect(dims).toBe(1536)` failed. Fix (victim-side pinning, the escape hatch the preload documents): - Both: configure the gateway explicitly in beforeAll BEFORE initSchema (OpenAI/1536), resetGateway() in afterAll so neither leaks onward. - put-page-provenance also stubs the embed transport via __setEmbedTransportForTests so embed is deterministic and offline; a dummy OPENAI_API_KEY is supplied in the gateway env because instantiateEmbedding builds the OpenAI client (key check) BEFORE the stubbed transport is reached — the stub then intercepts the actual call so the key never leaves the process. Verified: CI shards 1 (1337 pass) + 8 (905 pass) green with OPENAI_API_KEY unset, plus adversarial sibling orderings (gateway.test / doctor-ze-checks preceding). Typecheck + check-test-isolation clean. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
mgunnin
added a commit
to mgunnin/gbrain
that referenced
this pull request
Jun 3, 2026
* upstream/master: v0.41.29.0 feat(conversation-parser): bold-name-no-time builtin + fix(orphans): source-scoped orphan_ratio (supersedes garrytan#1613) (garrytan#1620) v0.41.27.0 fix: withRetry self-heals on null singleton + facts:absorb drain + disconnect audit (closes garrytan#1570) (garrytan#1608) v0.41.27.0 fix(doctor): git-aware sync_freshness (supersedes garrytan#1564) (garrytan#1573)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
gbrain doctorstops crying wolf about sources that have no new commits. Production-quality rebuild of community PR #1564 with the following structural additions Codex outside-voice caught:Foundation primitive (
src/core/git-head.ts) — single sharedisSourceUnchangedSinceSync(localPath, lastCommit, opts?)module with two probe seams. Shell-injection safe viaexecFileSyncwith array args (the superseded PR usedexecSyncthrough/bin/sh -cwithJSON.stringifyfor shell-escape — unsafe). Fail-open on every error. Reusable by autopilot's per-source dispatch (filed as v0.41.27.1+ TODO).Doctor wire (
src/commands/doctor.ts:checkSyncFreshness) —opts.localOnly: booleanthread;runDoctoropts in,doctorReportRemote(HTTP MCP) keeps defaultfalse(preserves v0.32.4 trust boundary). Narrowed predicate mirrorssync.ts:1057+1075: HEAD == last_commit AND working-tree clean AND chunker version matches CURRENT — so doctor and sync agree on "is there work to do?". Three-bucket count math populatesCheck.detailsfor dashboards. OK message reshape: all-unchanged → "All N up to date (no new commits since last sync)"; mixed → "N source(s): X synced recently, Y unchanged since last sync".checkCycleFreshnessintentionally NOT touched (Codex P0-2): HEAD-match cannot answer "did the full cycle complete?" — silencing cycle warns on git-clean sources would mask the case where sync ran but extract/embed/consolidate/synthesize failed.Test Coverage
test/core/git-head.test.ts(NEW)test/doctor.test.ts(sync_freshness git short-circuitdescribe)test/doctor-cycle-freshness.test.ts87/87 pass on changed surface.
bun run verify28/28 green.Pre-Landing Review
Eng review CLEAR (PLAN); Codex review CLEAR (after 5 findings absorbed). Plan + decision trail at
~/.claude/plans/system-instruction-you-are-working-eager-bird.md. 7 architectural decisions (D1-D7) locked via per-finding AskUserQuestion gates:src/core/git-head.ts) for reuseCheck.detailslocalOnlyoptPlan Completion
All 6 plan tasks (T1-T6) complete: foundation primitive, unit tests, doctor wire, integration tests, version bump + lockfile + docs regen, CLAUDE.md key-files annotation.
TODOS
Three deferred-by-design follow-ups documented in plan (none filed as TODOS.md entries; conditional on real demand):
isSourceUnchangedSinceSync(waits for measurement showing daemon CPU matters on unchanged sources)gbrain status/gbrain sources statussurfaces the per-bucket countsGBRAIN_DOCTOR_IGNORE_DIRTY_TREE=1opt-outSupersedes
Closes #1564 (
@garrytan-agents). Co-Authored-By preserved on commit 2. The surrogate-pair fix from commit78b93f3fin that PR is NOT brought over — already on master viasafeSplitIndexatsynthesize.ts:192(v0.42.0.0 wave).Test plan
bun test test/core/git-head.test.ts— 14/14 passbun test test/doctor.test.ts— 64/64 pass (12 existing sync_freshness + 9 new + 43 unrelated, no regressions)bun test test/doctor-cycle-freshness.test.ts— back-compat preservedbun run verify— 28/28 pre-test gate checks greenbun run typecheck— clean🤖 Generated with Claude Code