fix: skip sync/cycle freshness warnings when upstream repo is unchanged#1564
Closed
garrytan-agents wants to merge 3 commits into
Closed
fix: skip sync/cycle freshness warnings when upstream repo is unchanged#1564garrytan-agents wants to merge 3 commits into
garrytan-agents wants to merge 3 commits into
Conversation
The judgeSignificance trimming (slice at 4000 chars) could split a UTF-16 surrogate pair when an emoji sits exactly at the boundary, producing a lone high surrogate that Anthropic's JSON parser rejects with 'no low surrogate in string'. Add safeSliceEnd() helper that backs up by one char when the cut lands between a high and low surrogate. Apply to: - judgeSignificance transcript trimming (the direct cause) - findBoundary hard-split fallback (defense-in-depth) Fixes: dream cycle SYNTH_PHASE_FAIL on 2026-05-24 caused by 🤖 emoji at pos 3999 in telegram/2026-05-20-topic-1-topic-1.md
When a git-backed source's HEAD matches the stored last_commit in the DB, the source has no new content to sync regardless of wall-clock age. Before: media-corpus flagged as stale after 40h even though zero commits changed. This produces false-positive doctor warnings and unnecessary sync/cycle runs. After: checkSyncFreshness and checkCycleFreshness both call isSourceUnchangedSinceSync() which does a fast `git rev-parse HEAD` comparison. If HEAD === last_commit, the source is silently skipped. Fail-open: any error in the git check falls through to the normal time-based freshness logic. Changes: - doctor.ts: add isSourceUnchangedSinceSync() helper, wire into both checks - engine.ts: add last_commit to SourceRow interface - postgres-engine.ts: include last_commit in listAllSources query - pglite-engine.ts: same for PGLite implementation
Owner
|
Superseded by production-quality rebuild in workspace
Surrogate-pair fix from commit 78b93f3 is NOT brought over — already on master via Thank you for the contribution @garrytan-agents; Co-Authored-By preserved in the rebuild commits. |
5 tasks
garrytan
added a commit
that referenced
this pull request
May 29, 2026
…1573) * feat(doctor): add isSourceUnchangedSinceSync git-head primitive src/core/git-head.ts is a single-responsibility module: probe git HEAD + working-tree clean state for a local_path, compare against the last_commit SHA the DB stored at last sync completion. Designed for reuse (autopilot's per-source dispatch will want the same gate, filed as v0.41.27.1+ TODO). Shell-injection safe: uses execFileSync with array args. The superseded community PR #1564 used execSync through /bin/sh -c with JSON.stringify for shell-escape, which is unsafe — JSON.stringify escapes for JSON, not shell. Fail-open contract: every error path returns false, preserving the caller's prior time-based behavior. Two test seams (_setGitHeadProbeForTests, _setGitCleanProbeForTests) match the last-retrieved.ts precedent so unit tests stay parallel-eligible (no mock.module per CLAUDE.md R2). Companion test suite: 14 cases including a load-bearing shell-injection regression guard that runs real execFileSync against '/nonexistent/$(touch <sentinel>)/repo' and asserts the sentinel file is never created. * feat(doctor): git-aware sync_freshness check + 9 new test cases checkSyncFreshness gains opts.localOnly: boolean. When local CLI caller passes true, doctor short-circuits the staleness warning iff HEAD == sources.last_commit AND working tree is clean AND sources.chunker_version matches CURRENT (mirrors sync.ts:1057+1075's own "do work?" predicate, so doctor and sync agree). Inline SELECT widens to carry last_commit + chunker_version (columns already exist; no schema migration). Three-bucket count math (unchanged_count + synced_recently_count + stale_count === sources.length) populates Check.details for dashboards / JSON consumers. OK-message reshape: - all-unchanged → "All N up to date (no new commits since last sync)" - mixed → "N source(s): X synced recently, Y unchanged since last sync" - all-recent → "All N synced recently" (back-compat). Trust boundary preserved (Codex P0-1): runDoctor (local CLI, trusted) passes localOnly: true; doctorReportRemote (HTTP MCP, untrusted) keeps the default false. Default-false is fail-closed — a future caller that forgets the opt gets the safe no-probe behavior. checkCycleFreshness is INTENTIONALLY NOT touched (Codex P0-2): last_commit == HEAD answers "new commits to sync?" but cannot answer "did the full cycle complete?" — silencing cycle warns on git-clean sources would mask the case where sync ran but extract/embed/ consolidate/synthesize failed. Test coverage: 9 new cases in test/doctor.test.ts including - HEAD-match short-circuit + cold-path message - HEAD-mismatch warn - NULL last_commit (legacy data) → warn - non-git local_path (probe returns null) → warn (fail-open) - 3-source mixed bucket invariant (sum === length) - chunker-version mismatch warn (dirty-tree-clean still gates) - dirty-tree warn (HEAD-match still gates) - D4 regression guard: localOnly=false (default) NEVER calls probes All 87 tests across git-head + doctor + cycle-freshness suites pass. bun run verify clean (28/28 checks). Co-Authored-By: garrytan-agents <me@garrytan.com> * chore: bump version and changelog (v0.41.27.0) Supersedes community PR #1564. Co-Authored-By preserved. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(test): pin gateway to 1536d for query-cache-knobs-hash suite CI failed on the v0.41.27.0 ship after merging origin/master because test/query-cache-knobs-hash.test.ts predates the v0.36.2.0 flip of DEFAULT_EMBEDDING_DIMENSIONS from 1536 → 1280 (ZE Matryoshka). Without an explicit gateway pin, initSchema() sizes query_cache.embedding at halfvec(1280), but the test fixture (makeEmbedding at line 44) emits 1536-dim unit vectors. Result: 5 cases in the "SemanticQueryCache cross-mode isolation (CDX-4 hotfix)" describe crashed with "expected 1280 dimensions, not 1536". Fix mirrors test/consolidate-valid-until.test.ts (the canonical gateway-pin pattern that landed when the default flipped). resetGateway() + configureGateway({embedding_dimensions: 1536}) in beforeAll forces the schema to size at halfvec(1536) regardless of cross-file gateway state. resetGateway() in afterAll restores defaults so the next file in the shard isn't poisoned. Verified: 9/9 cases pass; bun run verify 29/29 clean. --------- Co-authored-by: garrytan-agents <me@garrytan.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
mgunnin
added a commit
to mgunnin/gbrain
that referenced
this pull request
Jun 3, 2026
* upstream/master: v0.41.29.0 feat(conversation-parser): bold-name-no-time builtin + fix(orphans): source-scoped orphan_ratio (supersedes garrytan#1613) (garrytan#1620) v0.41.27.0 fix: withRetry self-heals on null singleton + facts:absorb drain + disconnect audit (closes garrytan#1570) (garrytan#1608) v0.41.27.0 fix(doctor): git-aware sync_freshness (supersedes garrytan#1564) (garrytan#1573)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Doctor flags sources like
media-corpusas stale after 40h even when zero commits have changed. This creates false-positive warnings and unnecessary sync/cycle runs.Garry: "if there have been no commits to that particular corpus, then it's not stale, actually."
Fix
Add
isSourceUnchangedSinceSync()— a fastgit rev-parse HEADcheck that compares the repo's current HEAD against thelast_commitstored in the DB. If they match, the source is up-to-date regardless of wall-clock age.Wired into both
checkSyncFreshnessandcheckCycleFreshness.Changes
doctor.ts: addisSourceUnchangedSinceSync()helper, wire into both freshness checksengine.ts: addlast_committoSourceRowinterfacepostgres-engine.ts: includelast_commitinlistAllSourcesquerypglite-engine.ts: same for PGLite implementationSafety
Fail-open: any error in the git check (missing repo, corrupt state) falls through to the normal time-based freshness logic. No behavior change for non-git sources.