Skip to content

fix: skip sync/cycle freshness warnings when upstream repo is unchanged#1564

Closed
garrytan-agents wants to merge 3 commits into
garrytan:masterfrom
garrytan-agents:fix/sync-freshness-skip-unchanged
Closed

fix: skip sync/cycle freshness warnings when upstream repo is unchanged#1564
garrytan-agents wants to merge 3 commits into
garrytan:masterfrom
garrytan-agents:fix/sync-freshness-skip-unchanged

Conversation

@garrytan-agents

Copy link
Copy Markdown
Contributor

Problem

Doctor flags sources like media-corpus as stale after 40h even when zero commits have changed. This creates false-positive warnings and unnecessary sync/cycle runs.

Garry: "if there have been no commits to that particular corpus, then it's not stale, actually."

Fix

Add isSourceUnchangedSinceSync() — a fast git rev-parse HEAD check that compares the repo's current HEAD against the last_commit stored in the DB. If they match, the source is up-to-date regardless of wall-clock age.

Wired into both checkSyncFreshness and checkCycleFreshness.

Changes

  • doctor.ts: add isSourceUnchangedSinceSync() helper, wire into both freshness checks
  • engine.ts: add last_commit to SourceRow interface
  • postgres-engine.ts: include last_commit in listAllSources query
  • pglite-engine.ts: same for PGLite implementation

Safety

Fail-open: any error in the git check (missing repo, corrupt state) falls through to the normal time-based freshness logic. No behavior change for non-git sources.

root and others added 3 commits May 24, 2026 09:16
The judgeSignificance trimming (slice at 4000 chars) could split a
UTF-16 surrogate pair when an emoji sits exactly at the boundary,
producing a lone high surrogate that Anthropic's JSON parser rejects
with 'no low surrogate in string'.

Add safeSliceEnd() helper that backs up by one char when the cut lands
between a high and low surrogate. Apply to:
- judgeSignificance transcript trimming (the direct cause)
- findBoundary hard-split fallback (defense-in-depth)

Fixes: dream cycle SYNTH_PHASE_FAIL on 2026-05-24 caused by
🤖 emoji at pos 3999 in telegram/2026-05-20-topic-1-topic-1.md
When a git-backed source's HEAD matches the stored last_commit in the DB,
the source has no new content to sync regardless of wall-clock age.

Before: media-corpus flagged as stale after 40h even though zero commits
changed. This produces false-positive doctor warnings and unnecessary
sync/cycle runs.

After: checkSyncFreshness and checkCycleFreshness both call
isSourceUnchangedSinceSync() which does a fast `git rev-parse HEAD`
comparison. If HEAD === last_commit, the source is silently skipped.

Fail-open: any error in the git check falls through to the normal
time-based freshness logic.

Changes:
- doctor.ts: add isSourceUnchangedSinceSync() helper, wire into both checks
- engine.ts: add last_commit to SourceRow interface
- postgres-engine.ts: include last_commit in listAllSources query
- pglite-engine.ts: same for PGLite implementation
@garrytan

Copy link
Copy Markdown
Owner

Superseded by production-quality rebuild in workspace irvine-v1 (branch garrytan/sync-freshness-git-check). The fix is correct in shape but the rebuild adds:

  1. Shell-injection safetyexecSync('git -C ${JSON.stringify(path)} ...') runs through /bin/sh -c; JSON.stringify escapes for JSON, not shell. Rebuild uses execFileSync('git', ['-C', path, 'rev-parse', 'HEAD']) — no shell, no injection surface.
  2. Unit + integration test coverage — helper has its own test/core/git-head.test.ts; checkSyncFreshness + checkCycleFreshness tests in test/doctor.test.ts extended with unchanged-source / mismatched-HEAD / NULL-last_commit / non-git-path / mixed-source cases.
  3. Shared primitiveisSourceUnchangedSinceSync lives in src/core/git-head.ts (reusable by autopilot decision paths) instead of buried in doctor.ts.
  4. More honest ok message — when sources are skipped via git-check vs time-check, the ok line breaks down both counts so the operator sees what's happening.

Surrogate-pair fix from commit 78b93f3 is NOT brought over — already on master via safeSplitIndex in v0.42.0.0.

Thank you for the contribution @garrytan-agents; Co-Authored-By preserved in the rebuild commits.

@garrytan garrytan closed this May 27, 2026
garrytan added a commit that referenced this pull request May 29, 2026
…1573)

* feat(doctor): add isSourceUnchangedSinceSync git-head primitive

src/core/git-head.ts is a single-responsibility module: probe git HEAD +
working-tree clean state for a local_path, compare against the
last_commit SHA the DB stored at last sync completion. Designed for
reuse (autopilot's per-source dispatch will want the same gate, filed
as v0.41.27.1+ TODO).

Shell-injection safe: uses execFileSync with array args. The
superseded community PR #1564 used execSync through /bin/sh -c with
JSON.stringify for shell-escape, which is unsafe — JSON.stringify
escapes for JSON, not shell.

Fail-open contract: every error path returns false, preserving the
caller's prior time-based behavior. Two test seams
(_setGitHeadProbeForTests, _setGitCleanProbeForTests) match the
last-retrieved.ts precedent so unit tests stay parallel-eligible (no
mock.module per CLAUDE.md R2).

Companion test suite: 14 cases including a load-bearing
shell-injection regression guard that runs real execFileSync against
'/nonexistent/$(touch <sentinel>)/repo' and asserts the sentinel
file is never created.

* feat(doctor): git-aware sync_freshness check + 9 new test cases

checkSyncFreshness gains opts.localOnly: boolean. When local CLI
caller passes true, doctor short-circuits the staleness warning iff
HEAD == sources.last_commit AND working tree is clean AND
sources.chunker_version matches CURRENT (mirrors sync.ts:1057+1075's
own "do work?" predicate, so doctor and sync agree).

Inline SELECT widens to carry last_commit + chunker_version (columns
already exist; no schema migration). Three-bucket count math
(unchanged_count + synced_recently_count + stale_count ===
sources.length) populates Check.details for dashboards / JSON
consumers. OK-message reshape:
 - all-unchanged → "All N up to date (no new commits since last sync)"
 - mixed        → "N source(s): X synced recently, Y unchanged since last sync"
 - all-recent   → "All N synced recently" (back-compat).

Trust boundary preserved (Codex P0-1): runDoctor (local CLI, trusted)
passes localOnly: true; doctorReportRemote (HTTP MCP, untrusted)
keeps the default false. Default-false is fail-closed — a future
caller that forgets the opt gets the safe no-probe behavior.

checkCycleFreshness is INTENTIONALLY NOT touched (Codex P0-2):
last_commit == HEAD answers "new commits to sync?" but cannot answer
"did the full cycle complete?" — silencing cycle warns on git-clean
sources would mask the case where sync ran but extract/embed/
consolidate/synthesize failed.

Test coverage: 9 new cases in test/doctor.test.ts including
 - HEAD-match short-circuit + cold-path message
 - HEAD-mismatch warn
 - NULL last_commit (legacy data) → warn
 - non-git local_path (probe returns null) → warn (fail-open)
 - 3-source mixed bucket invariant (sum === length)
 - chunker-version mismatch warn (dirty-tree-clean still gates)
 - dirty-tree warn (HEAD-match still gates)
 - D4 regression guard: localOnly=false (default) NEVER calls probes

All 87 tests across git-head + doctor + cycle-freshness suites pass.
bun run verify clean (28/28 checks).

Co-Authored-By: garrytan-agents <me@garrytan.com>

* chore: bump version and changelog (v0.41.27.0)

Supersedes community PR #1564. Co-Authored-By preserved.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(test): pin gateway to 1536d for query-cache-knobs-hash suite

CI failed on the v0.41.27.0 ship after merging origin/master because
test/query-cache-knobs-hash.test.ts predates the v0.36.2.0 flip of
DEFAULT_EMBEDDING_DIMENSIONS from 1536 → 1280 (ZE Matryoshka).

Without an explicit gateway pin, initSchema() sizes query_cache.embedding
at halfvec(1280), but the test fixture (makeEmbedding at line 44) emits
1536-dim unit vectors. Result: 5 cases in the
"SemanticQueryCache cross-mode isolation (CDX-4 hotfix)" describe
crashed with "expected 1280 dimensions, not 1536".

Fix mirrors test/consolidate-valid-until.test.ts (the canonical
gateway-pin pattern that landed when the default flipped). resetGateway()
+ configureGateway({embedding_dimensions: 1536}) in beforeAll forces the
schema to size at halfvec(1536) regardless of cross-file gateway state.
resetGateway() in afterAll restores defaults so the next file in the
shard isn't poisoned.

Verified: 9/9 cases pass; bun run verify 29/29 clean.

---------

Co-authored-by: garrytan-agents <me@garrytan.com>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
mgunnin added a commit to mgunnin/gbrain that referenced this pull request Jun 3, 2026
* upstream/master:
  v0.41.29.0 feat(conversation-parser): bold-name-no-time builtin + fix(orphans): source-scoped orphan_ratio (supersedes garrytan#1613) (garrytan#1620)
  v0.41.27.0 fix: withRetry self-heals on null singleton + facts:absorb drain + disconnect audit (closes garrytan#1570) (garrytan#1608)
  v0.41.27.0 fix(doctor): git-aware sync_freshness (supersedes garrytan#1564) (garrytan#1573)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants