Skip to content

v0.41.27.0 fix(doctor): git-aware sync_freshness (supersedes #1564)#1573

Merged
garrytan merged 5 commits into
masterfrom
garrytan/sync-freshness-git-check
May 29, 2026
Merged

v0.41.27.0 fix(doctor): git-aware sync_freshness (supersedes #1564)#1573
garrytan merged 5 commits into
masterfrom
garrytan/sync-freshness-git-check

Conversation

@garrytan

Copy link
Copy Markdown
Owner

Summary

gbrain doctor stops crying wolf about sources that have no new commits. Production-quality rebuild of community PR #1564 with the following structural additions Codex outside-voice caught:

Foundation primitive (src/core/git-head.ts) — single shared isSourceUnchangedSinceSync(localPath, lastCommit, opts?) module with two probe seams. Shell-injection safe via execFileSync with array args (the superseded PR used execSync through /bin/sh -c with JSON.stringify for shell-escape — unsafe). Fail-open on every error. Reusable by autopilot's per-source dispatch (filed as v0.41.27.1+ TODO).

Doctor wire (src/commands/doctor.ts:checkSyncFreshness)opts.localOnly: boolean thread; runDoctor opts in, doctorReportRemote (HTTP MCP) keeps default false (preserves v0.32.4 trust boundary). Narrowed predicate mirrors sync.ts:1057+1075: HEAD == last_commit AND working-tree clean AND chunker version matches CURRENT — so doctor and sync agree on "is there work to do?". Three-bucket count math populates Check.details for dashboards. OK message reshape: all-unchanged → "All N up to date (no new commits since last sync)"; mixed → "N source(s): X synced recently, Y unchanged since last sync".

checkCycleFreshness intentionally NOT touched (Codex P0-2): HEAD-match cannot answer "did the full cycle complete?" — silencing cycle warns on git-clean sources would mask the case where sync ran but extract/embed/consolidate/synthesize failed.

Test Coverage

Surface Cases Status
test/core/git-head.test.ts (NEW) 14 incl. real-execFileSync shell-injection regression guard
test/doctor.test.ts (sync_freshness git short-circuit describe) 9 incl. D4 fail-closed-default regression + 3-source bucket invariant
test/doctor-cycle-freshness.test.ts unchanged (D5: cycle untouched) ✓ back-compat

87/87 pass on changed surface. bun run verify 28/28 green.

Pre-Landing Review

Eng review CLEAR (PLAN); Codex review CLEAR (after 5 findings absorbed). Plan + decision trail at ~/.claude/plans/system-instruction-you-are-working-eager-bird.md. 7 architectural decisions (D1-D7) locked via per-finding AskUserQuestion gates:

  • D1 module location: own file (src/core/git-head.ts) for reuse
  • D2 ok-message reshape + Check.details
  • D3 dissolved by D5 (no longer needed)
  • D4 — Codex P0-1: trust-boundary preservation via localOnly opt
  • D5 — Codex P0-2: git short-circuit NEVER applies to cycle freshness
  • D6 — Codex P1-5: three-bucket honest count math, invariant pinned
  • D7 — Codex P1-4: narrowed predicate (chunker + working-tree-clean)

Plan Completion

All 6 plan tasks (T1-T6) complete: foundation primitive, unit tests, doctor wire, integration tests, version bump + lockfile + docs regen, CLAUDE.md key-files annotation.

TODOS

Three deferred-by-design follow-ups documented in plan (none filed as TODOS.md entries; conditional on real demand):

  • Autopilot per-source dispatch consumes isSourceUnchangedSinceSync (waits for measurement showing daemon CPU matters on unchanged sources)
  • gbrain status / gbrain sources status surfaces the per-bucket counts
  • GBRAIN_DOCTOR_IGNORE_DIRTY_TREE=1 opt-out

Supersedes

Closes #1564 (@garrytan-agents). Co-Authored-By preserved on commit 2. The surrogate-pair fix from commit 78b93f3f in that PR is NOT brought over — already on master via safeSplitIndex at synthesize.ts:192 (v0.42.0.0 wave).

Test plan

  • bun test test/core/git-head.test.ts — 14/14 pass
  • bun test test/doctor.test.ts — 64/64 pass (12 existing sync_freshness + 9 new + 43 unrelated, no regressions)
  • bun test test/doctor-cycle-freshness.test.ts — back-compat preserved
  • bun run verify — 28/28 pre-test gate checks green
  • bun run typecheck — clean

🤖 Generated with Claude Code

garrytan and others added 5 commits May 27, 2026 10:16
src/core/git-head.ts is a single-responsibility module: probe git HEAD +
working-tree clean state for a local_path, compare against the
last_commit SHA the DB stored at last sync completion. Designed for
reuse (autopilot's per-source dispatch will want the same gate, filed
as v0.41.27.1+ TODO).

Shell-injection safe: uses execFileSync with array args. The
superseded community PR #1564 used execSync through /bin/sh -c with
JSON.stringify for shell-escape, which is unsafe — JSON.stringify
escapes for JSON, not shell.

Fail-open contract: every error path returns false, preserving the
caller's prior time-based behavior. Two test seams
(_setGitHeadProbeForTests, _setGitCleanProbeForTests) match the
last-retrieved.ts precedent so unit tests stay parallel-eligible (no
mock.module per CLAUDE.md R2).

Companion test suite: 14 cases including a load-bearing
shell-injection regression guard that runs real execFileSync against
'/nonexistent/$(touch <sentinel>)/repo' and asserts the sentinel
file is never created.
checkSyncFreshness gains opts.localOnly: boolean. When local CLI
caller passes true, doctor short-circuits the staleness warning iff
HEAD == sources.last_commit AND working tree is clean AND
sources.chunker_version matches CURRENT (mirrors sync.ts:1057+1075's
own "do work?" predicate, so doctor and sync agree).

Inline SELECT widens to carry last_commit + chunker_version (columns
already exist; no schema migration). Three-bucket count math
(unchanged_count + synced_recently_count + stale_count ===
sources.length) populates Check.details for dashboards / JSON
consumers. OK-message reshape:
 - all-unchanged → "All N up to date (no new commits since last sync)"
 - mixed        → "N source(s): X synced recently, Y unchanged since last sync"
 - all-recent   → "All N synced recently" (back-compat).

Trust boundary preserved (Codex P0-1): runDoctor (local CLI, trusted)
passes localOnly: true; doctorReportRemote (HTTP MCP, untrusted)
keeps the default false. Default-false is fail-closed — a future
caller that forgets the opt gets the safe no-probe behavior.

checkCycleFreshness is INTENTIONALLY NOT touched (Codex P0-2):
last_commit == HEAD answers "new commits to sync?" but cannot answer
"did the full cycle complete?" — silencing cycle warns on git-clean
sources would mask the case where sync ran but extract/embed/
consolidate/synthesize failed.

Test coverage: 9 new cases in test/doctor.test.ts including
 - HEAD-match short-circuit + cold-path message
 - HEAD-mismatch warn
 - NULL last_commit (legacy data) → warn
 - non-git local_path (probe returns null) → warn (fail-open)
 - 3-source mixed bucket invariant (sum === length)
 - chunker-version mismatch warn (dirty-tree-clean still gates)
 - dirty-tree warn (HEAD-match still gates)
 - D4 regression guard: localOnly=false (default) NEVER calls probes

All 87 tests across git-head + doctor + cycle-freshness suites pass.
bun run verify clean (28/28 checks).

Co-Authored-By: garrytan-agents <me@garrytan.com>
Supersedes community PR #1564. Co-Authored-By preserved.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ness-git-check

# Conflicts:
#	CHANGELOG.md
#	VERSION
#	package.json
CI failed on the v0.41.27.0 ship after merging origin/master because
test/query-cache-knobs-hash.test.ts predates the v0.36.2.0 flip of
DEFAULT_EMBEDDING_DIMENSIONS from 1536 → 1280 (ZE Matryoshka).

Without an explicit gateway pin, initSchema() sizes query_cache.embedding
at halfvec(1280), but the test fixture (makeEmbedding at line 44) emits
1536-dim unit vectors. Result: 5 cases in the
"SemanticQueryCache cross-mode isolation (CDX-4 hotfix)" describe
crashed with "expected 1280 dimensions, not 1536".

Fix mirrors test/consolidate-valid-until.test.ts (the canonical
gateway-pin pattern that landed when the default flipped). resetGateway()
+ configureGateway({embedding_dimensions: 1536}) in beforeAll forces the
schema to size at halfvec(1536) regardless of cross-file gateway state.
resetGateway() in afterAll restores defaults so the next file in the
shard isn't poisoned.

Verified: 9/9 cases pass; bun run verify 29/29 clean.
@garrytan garrytan merged commit cb1b5f9 into master May 29, 2026
21 checks passed
garrytan added a commit that referenced this pull request May 29, 2026
Master shipped v0.41.27.0 (#1573 git-aware sync_freshness) claiming the
same slot. Rebump to the next available version.

- VERSION + package.json → 0.41.28.0
- CHANGELOG.md: my entry header + in-entry refs 0.41.27.0 → 0.41.28.0
- TODOS.md: my #1570 follow-up section header + body refs bumped
garrytan added a commit that referenced this pull request May 29, 2026
Master shipped v0.41.27.0 (#1573 git-aware sync_freshness check) which
claimed the version slot this branch originally targeted. This branch
already rebumped to v0.41.28.0 in the prior commit; merging master in:

- VERSION stays 0.41.28.0 (next slot above master's 0.41.27.0)
- CHANGELOG.md: my v0.41.28.0 entry on top; master's v0.41.27.0 entry
  preserved verbatim below
- package.json: synced to 0.41.28.0
- src/commands/doctor.ts, CLAUDE.md, llms-full.txt: auto-merged
  (master's sync_freshness check + my batch_retry_health disconnect
  audit extension both land; no overlap)
- New from master: src/core/git-head.ts + test/core/git-head.test.ts

No #1570-surface conflicts — master's sync_freshness work is orthogonal
to the retry/facts/disconnect-audit changes.
garrytan added a commit that referenced this pull request May 29, 2026
… drain + disconnect audit (closes #1570) (#1608)

* merge master: rebump v0.41.25.0 → v0.41.27.0 (queue collision)

Master shipped v0.41.25.0 (#1538 batched sync deletes) and v0.41.26.0
(#1571 dream --source fix) while this branch was in flight. Conflict
resolution rebumps to the next available slot.

- VERSION: 0.41.25.0 → 0.41.27.0
- package.json: synced
- CHANGELOG.md: my v0.41.27.0 entry placed above master's v0.41.26.0
  and v0.41.25.0; in-entry version references updated 0.41.25.0 →
  0.41.27.0 and forward-references bumped to v0.41.28+.
- TODOS.md: kept master's v0.41.20.x section + my v0.41.27.0+ follow-ups

No source-file conflicts during the merge.

* feat(diagnostics): db-disconnect audit + doctor surface (v0.41.27.0)

Instruments every db.disconnect() and PostgresEngine.disconnect() call
with a JSONL audit record so the next user-reported #1570 cycle gives
us the offender's caller stack instead of the symptomatic
"No database connection" error.

Audit shape (~/.gbrain/audit/db-disconnect-YYYY-Www.jsonl):
  {ts, engine_kind, connection_style, caller_stack[], command, pid}

- src/core/audit/db-disconnect-audit.ts (NEW): the audit writer,
  built on the v0.40.4.0 createAuditWriter cathedral. Captures a
  6-frame stack via new Error().stack so the offender is readable
  without spending stderr noise.
- src/core/db.ts: logDbDisconnect call at the top of disconnect()
  (best-effort; never blocks the real teardown).
- src/core/postgres-engine.ts: same instrumentation in
  PostgresEngine.disconnect() — distinguishes 'module' vs 'instance'
  connection_style so we can tell legitimate worker-pool teardowns
  apart from the load-bearing module-singleton class.
- src/commands/doctor.ts: extends batch_retry_health to surface
  24h disconnect count + most-recent caller stack. Warns when the
  caller frame isn't a known CLI-exit frame (e.g. cli.ts's finally
  block at the end of an op-dispatch). This is the diagnostic that
  tells v0.41.28+ where to apply the real ownership fix.
- test/db-disconnect-audit.test.ts: unit coverage for the audit
  writer + caller-stack capture + JSONL shape.
- test/e2e/db-singleton-shared-recovery.test.ts: real-Postgres
  regression that exercises the singleton-null path end-to-end.

Refs #1570

* feat(retry): self-heal on null singleton — closes #1570 symptom (v0.41.27.0)

withRetry gains an opt-in reconnect callback that fires between the
isRetryableConnError classification and the inter-attempt sleep.
PostgresEngine.batchRetry injects this.reconnect() — race-safe via
the existing _reconnecting guard, handles module and instance pools.

Closes the production loss reported in #1570: dream cycles on Supabase
no longer drop ~150 link rows per cycle when the singleton goes null
mid-batch. The retry now rebuilds the connection between attempts so
the second try has somewhere to write to.

- src/core/retry.ts: WithRetryOpts gains `reconnect?: () => Promise<void>`.
  Awaited in the catch branch. onRetry is also now awaited (back-compat-
  safe: every existing in-tree caller is a sync arrow). Reconnect
  failures propagate as the real cause — replaces the symptomatic
  "No database connection" error with whatever the connect() throw
  was, so operators see the truth.
- src/core/postgres-engine.ts:batchRetry — injects
  `reconnect: () => this.reconnect()`. Covers all 9 batch-retry call
  sites (addLinksBatch, addTimelineEntriesBatch, upsertChunks, plus
  the 6 caller-supplied auditSite labels in extract / sync / reindex).
- test/core/retry-reconnect.test.ts: 8 hermetic cases pinning the
  contract — reconnect fires before sleep, only on retryable errors,
  back-compat when omitted, signal-aborted bypasses reconnect,
  onRetry is awaited, full success path end-to-end.

The deeper bug (who's calling disconnect mid-cycle) is left
unaddressed in this commit by design — the diagnostic instrumentation
in the prior commit will tell us in the next production run.

Refs #1570

* feat(facts): drainPending() + CLI await before disconnect (v0.41.27.0)

Closes the silent 'No database connection' tail-end errors after
gbrain capture / put_page: the facts:absorb fire-and-forget queue
sometimes outlived the CLI process's connection lifetime, so absorb
attempts after engine.disconnect() landed in stderr as the
GBrainError shape.

- src/core/facts/queue.ts: new drainPending({timeout: 1000}) method
  distinct from shutdown(). Stops accepting new enqueues, awaits
  in-flight settle, bounded by timeout, returns count of unfinished.
  Semantically different from shutdown() (which aborts in-flight)
  so the symptom — drop work that hasn't started yet but let
  in-flight work finish — matches what CLI exit actually needs.
- src/cli.ts: op-dispatch finally block awaits the drain BEFORE
  engine.disconnect(). Bounded 1s. Opt-out env GBRAIN_NO_FACTS_DRAIN
  for callers that don't enqueue (keeps fast-exit paths fast).
  Mirrors the v0.41.8.0 awaitPendingLastRetrievedWrites pattern.
- test/facts-queue-drain-pending.test.ts: 6 hermetic cases — empty
  drain returns immediately, single in-flight settles, timeout
  bounds wait, shutdown-after-drain is idempotent, post-drain
  enqueues are dropped, signal-aborted skips waiting.

Refs #1570

* docs: update project documentation for v0.41.27.0

README.md: added troubleshooting entry for the v0.41.27.0 retry-reconnect
+ facts:absorb drain fix (closes #1570), pointing operators at
`gbrain doctor --json` to find the offending disconnect caller.

CLAUDE.md: extended `src/core/retry.ts` entry with the new optional
`reconnect` callback (v0.41.27.0); added two new Key Files entries for
`src/core/audit/db-disconnect-audit.ts` (the diagnostic half of the
"instrument first, fix later" pivot) and `FactsQueue.drainPending`;
extended `doctor.ts:checkBatchRetryHealth` entry with the in-place
extension that surfaces 24h disconnect-call count.

llms-full.txt: regenerated to absorb CLAUDE.md edits.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore: rebump v0.41.27.0 → v0.41.28.0 (queue collision with #1573)

Master shipped v0.41.27.0 (#1573 git-aware sync_freshness) claiming the
same slot. Rebump to the next available version.

- VERSION + package.json → 0.41.28.0
- CHANGELOG.md: my entry header + in-entry refs 0.41.27.0 → 0.41.28.0
- TODOS.md: my #1570 follow-up section header + body refs bumped

* test: pin gateway in put-page-provenance + embedding-dim-check (CI shard fix)

Both files failed on CI shards 1 and 8 under the cross-file gateway-state
leak class (CLAUDE.md "Test-isolation lint and helpers"). The v0.41.28.0
merge reshuffled the weight-based shard bin-packing, landing a
gateway-mutating sibling ahead of these two victims in the same `bun test`
process.

Mechanism:
- put-page-provenance: put_page embeds via the gateway. A sibling left
  the gateway configured with OpenAI + the CI placeholder `sk-test`
  (captured at configureGateway time, survives the withEnv restore as
  cached gateway state). put_page's embed then fired against live OpenAI
  and 401'd. The bunfig legacy-embedding preload's beforeEach only
  re-applies legacy when the gateway was RESET — it does NOT correct a
  sibling that configured a different LIVE config.
- embedding-dim-check: initSchema builds the content_chunks vector column
  at the gateway's configured dim. A sibling leaking ZE/1280 made the
  column 1280-d, so `expect(dims).toBe(1536)` failed.

Fix (victim-side pinning, the escape hatch the preload documents):
- Both: configure the gateway explicitly in beforeAll BEFORE initSchema
  (OpenAI/1536), resetGateway() in afterAll so neither leaks onward.
- put-page-provenance also stubs the embed transport via
  __setEmbedTransportForTests so embed is deterministic and offline; a
  dummy OPENAI_API_KEY is supplied in the gateway env because
  instantiateEmbedding builds the OpenAI client (key check) BEFORE the
  stubbed transport is reached — the stub then intercepts the actual
  call so the key never leaves the process.

Verified: CI shards 1 (1337 pass) + 8 (905 pass) green with
OPENAI_API_KEY unset, plus adversarial sibling orderings (gateway.test /
doctor-ze-checks preceding). Typecheck + check-test-isolation clean.

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
mgunnin added a commit to mgunnin/gbrain that referenced this pull request Jun 3, 2026
* upstream/master:
  v0.41.29.0 feat(conversation-parser): bold-name-no-time builtin + fix(orphans): source-scoped orphan_ratio (supersedes garrytan#1613) (garrytan#1620)
  v0.41.27.0 fix: withRetry self-heals on null singleton + facts:absorb drain + disconnect audit (closes garrytan#1570) (garrytan#1608)
  v0.41.27.0 fix(doctor): git-aware sync_freshness (supersedes garrytan#1564) (garrytan#1573)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant