Skip to content

fix(staleness): measure sync lag relative to newest committed content (not wall-clock)#1623

Closed
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/content-relative-staleness
Closed

fix(staleness): measure sync lag relative to newest committed content (not wall-clock)#1623
garrytan-agents wants to merge 1 commit into
garrytan:masterfrom
garrytan-agents:fix/content-relative-staleness

Conversation

@garrytan-agents

Copy link
Copy Markdown
Contributor

Problem

Source staleness in gbrain doctor (and sources status) was measured as raw wall-clock time since the last sync:

lag_seconds = now - last_sync_at

This flags quiet, fully-caught-up repos as severely stale even when nothing new has been committed since the last sync. A federated source that hasn't received a commit in days is not stale for search purposes — the sync has everything the repo contains. But the wall-clock metric kept escalating, producing false SEVERELY STALE alerts.

Error Log (anonymized)

A daily health check repeatedly fired against a low-churn source:

GBrain Doctor — Score: 0/100
Source Health:
  - source-x — SEVERELY STALE (~86h since last sync, last seen <date> 21:14 UTC)
Threshold exceeded: >72h

Yet the underlying repo was caught up:

last_sync_at:  <date+1> 21:14 UTC
HEAD commit:   <date> 17:48 UTC   (older than last_sync_at → nothing new to sync)
working tree:  only untracked dirs (?? companies/, ?? media/) — never committed

The newest committed content predated the last sync by ~28h, so the source was fully synced. The alert was pure wall-clock noise.

Root Cause

Three independent consumers all derived staleness from now - last_sync_at:

  1. buildSyncStatusReport (sync.ts) → staleness_hours / dashboard
  2. computeAllSourceMetrics + isSourceStale (source-health.ts) → lag_seconds, federation health
  3. checkSyncFreshness + checkCycleFreshness (doctor.ts)

None compared against what the repo actually contains.

What We Tried

  • Considered raising the threshold — rejected. A higher threshold just delays the false positive; it doesn't fix the semantics. A genuinely stale active repo would also be masked.
  • Considered counting untracked files as "content" — rejected after observing that untracked dirs (?? companies/, ?? media/) inflated the lag. Untracked files are not part of the repo; "last committed to the repo" must mean committed/tracked state only.
  • Final approach — make lag content-relative: 0 when the newest committed/tracked content is at or before last_sync_at, else the wall-clock delta. Fall back to wall-clock when content can't be probed (non-git / unreadable path) so detection never regresses.

Solution

New helpers in src/core/source-health.ts:

  • newestContentMs(localPath) — newest committed/tracked mtime: git log -1 --format=%ct (HEAD commit time) combined with the newest tracked working-tree modification (git status --porcelain -z, excluding untracked ?? entries). Returns null for non-git/unreadable paths.
  • contentRelativeLagSeconds(localPath, lastSyncMs, nowMs):
    • null when last_sync_at is unknown.
    • Negative wall-clock (future last_sync_at) is surfaced as-is so clock-skew detection upstream still fires.
    • When content can be probed: 0 if newest content ≤ last sync (caught up), else wall-clock delta.
    • When content can't be probed: wall-clock fallback.

Wired into all three consumers. In checkSyncFreshness, it composes cleanly with the v0.41.27.0 localOnly git short-circuit — the short-circuit is a strict "definitely unchanged" gate (HEAD === last_commit + clean tree + chunker match) for the local CLI; content-relative lag covers the general/remote path where the short-circuit doesn't fire. cycle_freshness is also made content-relative (a source with no new committed content doesn't need re-cycling).

Behavior Matrix

Scenario Newest committed content Old metric New metric
Quiet repo, caught up ≤ last_sync grows forever → stale 0 (ok)
Active repo, behind > last_sync wall-clock wall-clock (stale)
Non-git / unreadable path n/a wall-clock wall-clock (unchanged)
Future last_sync (clock skew) any negative → skew warn negative → skew warn (unchanged)

Results

Re-running sources status --json against a real low-churn source after the fix:

before:  lag ≈ 87.6h  → SEVERELY STALE (false positive)
after:   lag = 0.0h   → ok (newest commit predates last sync = caught up)

Other active sources continue to report correct non-zero lag.

Testing

  • tsc --noEmit: clean.
  • New regression tests in test/source-health.test.ts (quiet-repo → 0, behind-repo → wall-clock, non-git fallback, tracked-edit counts, untracked excluded) and test/sync-all-parallel.test.ts (caught-up → 0 staleness).
  • Full local run of source-health, sync-all-parallel, doctor-cycle-freshness, doctor: 115 pass / 0 fail.

Source staleness previously used wall-clock (now - last_sync_at), which
flagged quiet/caught-up repos as severely stale even when nothing new had
been committed since the last sync.

Lag is now content-relative: 0 when the newest tracked/committed content is
at or before last_sync_at; otherwise the wall-clock delta. Untracked files
(git status '??') are excluded — they are not part of the repo. Falls back
to wall-clock when content can't be probed (non-git/unreadable path) so
detection never regresses. Negative wall-clock (future last_sync_at) is
surfaced for clock-skew detection.

Wired into buildSyncStatusReport, computeAllSourceMetrics/isSourceStale,
and checkSyncFreshness/checkCycleFreshness. Regression tests added.
@garrytan

Copy link
Copy Markdown
Owner

Superseded by #1656, re-implemented in the base repo (garrytan-agents PRs run in a fork without secret access, so CI can't gate them — base-repo branch fixes that).

Two substantive changes vs this PR:

  1. Correctness: staleness keys off the commit HASH (HEAD == last_commit, untracked ignored), not a content timestamp. The timestamp comparison here false-reports "caught up" when HEAD moves to an older-dated commit (rebase preserving dates, branch rewind, imported old commit). v0.41.32.0 fix(staleness): commit-relative sync staleness (supersedes #1623) #1656 also drops the fragile git status --porcelain mtime parse.
  2. Trust boundary: this PR wired a live git subprocess into three remote-reachable consumers (doctorReportRemote's checkSyncFreshness + federation_health, and the get_status_snapshot MCP op), re-opening the v0.41.27.0 boundary (no subprocess on a DB-supplied local_path from remote callers). v0.41.32.0 fix(staleness): commit-relative sync staleness (supersedes #1623) #1656 keeps local on live git and routes remote through a durable sources.newest_content_at column written at sync time — so the boundary stays intact AND the remote false-positive is still fixed.

Your root-cause analysis (untracked dirs defeating the clean-tree gate) was exactly right and is the headline fix. Thanks for it.

@garrytan garrytan closed this May 30, 2026
garrytan added a commit that referenced this pull request May 30, 2026
…1623) (#1656)

* fix(staleness): commit-relative sync staleness (HEAD-hash local, durable column remote)

Quiet, fully-caught-up repos no longer false-alarm as SEVERELY STALE in
gbrain doctor / sources status. Staleness now means "is there committed
content the sync hasn't ingested?" not raw wall-clock since the last sync.

- git-head.ts: requireCleanWorkingTree gains 'ignore-untracked' mode (git
  status --porcelain --untracked-files=no). Untracked dirs no longer defeat
  the freshness short-circuit — sync's incremental path keys off the commit
  diff and never imports untracked files, so doctor agrees with sync.
- source-health.ts: newestCommitMs (HEAD committer time) + pure
  lagFromContentMs comparator; computeAllSourceMetrics {probeContent} routes
  local→live commit-hash, remote→stored column. Dead isSourceStale removed.
- migration v108 sources.newest_content_at + fresh-schema blobs.
- sync.ts: writeSyncAnchor stamps newest_content_at atomically with
  last_commit/last_sync_at; buildSyncStatusReport (remote get_status_snapshot)
  reads the column — no git subprocess (v0.41.27.0 trust boundary intact).
- doctor.ts: checkSyncFreshness short-circuit ignores untracked; remote path
  reads the column; clock-skew check stays on raw wall-clock.

Local consumers probe live git (catch HEAD moving to an old-dated commit, which
a timestamp compare would miss); remote consumers read the durable column so a
remote-callable endpoint never shells out to a DB-supplied local_path.

Supersedes #1623 (re-implemented in base repo with the trust boundary preserved).

Co-Authored-By: t <t@t>

* chore(ci): offload tests to on-demand cloud runners from a local CLI

scripts/ship-remote-tests.sh pushes the branch, dispatches the test workflow,
and blocks on `gh run watch --exit-status` — a local caller (human or agent)
awaits the GitHub run exactly like a local `bun run test`, with a real pass/fail
exit code. Frees a load-saturated local machine (many Conductor agents running
their own bun-test suites at once → load avg 120 on 16 cores → PGLite OOM/crawl).

test.yml gains workflow_dispatch so the suite can be triggered from any branch.

* chore: bump version and changelog (v0.41.32.0)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: t <t@t>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
rayers added a commit to rayers/gbrain that referenced this pull request May 31, 2026
Resolve writeSyncAnchor signature conflict: PR garrytan#1430's pullFailed param and
upstream garrytan#1623's newestContentEpochMs param both added a 5th positional arg to
the same function. Merged to take both — newest_content_at (git-intrinsic HEAD
committer time) stamps regardless of pull outcome; last_sync_at (the
"observed upstream" freshness signal) stays gated on pullFailed. All 3 call
sites pass both args. tsc --noEmit clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mgunnin added a commit to mgunnin/gbrain that referenced this pull request Jun 3, 2026
* upstream/master:
  v0.41.36.0 feat(mcp): publish agent skills (list_skills / get_skill) for thin clients (garrytan#1661)
  v0.41.35.0 feat(guardrails): vendor-neutral content guardrail seams (supersedes garrytan#1652) (garrytan#1660)
  v0.41.34.0 feat(search): retrieval cathedral — max-pool + title + alias + evidence (garrytan#1657)
  v0.41.33.0 feat(search): intent-aware adaptive return-sizing + agent-facing query param (garrytan#1640)
  v0.41.32.0 fix(staleness): commit-relative sync staleness (supersedes garrytan#1623) (garrytan#1656)
  v0.41.31.0 feat(embed): delta-aware sync --all cost gate + real stale-embedding semantics (garrytan#1632)
  v0.41.30.0 fix(brainstorm/lsd): --save writes the advertised .md file via canonical ingestion path (garrytan#1655)

# Conflicts:
#	src/core/operations.ts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants