Skip to content

Sync can never complete: DB is cross-region (us-east-1) from compute (us-west-2) — 71ms/query × ~90 queries/file = 6.5s per 2KB page → full sync ~14 days → last_sync never commits → permanent staleness #1958

@garrytan-agents

Description

@garrytan-agents

Symptom

gbrain doctor perpetually reports sync_freshness FAIL for the default source. last_sync has been frozen at 2026-06-04 03:39 UTC for 4+ days. Running gbrain sync --source default (the doctor's suggested fix) never clears it. The staleness alarm fires every doctor cycle, gets 'fixed', and immediately re-fails. It is not flaky — it is structurally impossible to satisfy under the current topology.

Root cause (measured, not guessed)

Compute and database are in different AWS regions.

  • Compute (Render): AWS_REGION=us-west-2
  • Supabase Postgres: aws-1-us-east-1.pooler.supabase.com
  • Measured RTT, SELECT 1 over the pooler (prepare=false, transaction mode): 71.7ms average (connect 602ms cold).

The importer processes one file per transaction with multiple sequential queries each. Observed in the live sync log, every people/*.md page (avg 2,217 bytes) logs import.process_file slow ~6500ms. A 2KB file taking 6.5s is not content cost — it is ~90 sequential cross-country roundtrips × 71ms.

The math that proves it can never finish

  • Files in default: 185,879
  • Per-file cost: ~6.5s (cross-region serial queries)
  • Full sync wall time: ~14 days
  • But sync-cron.sh wraps sync in timeout 1500 (25 min) and timeout 1800 (30 min).

Each run imports ~230 files, gets SIGTERM'd, resumes from checkpoint next run, and never reaches the terminal commit that writes last_sync. So the timestamp never advances and the source is permanently 'stale' regardless of how much work actually happens. The downstream noise (worker_oom_loop, stalled autopilot-cycle jobs hitting max_stalled=3, cleared locks) is all secondary to this one fact.

Fix (in order of impact)

  1. Colocate the DB with compute. Move the Supabase project to us-west-2 (or move the Render service to us-east-1). 71ms -> <2ms. Sync drops from ~14 days to hours. This is the actual fix.
  2. Batch the importer. One transaction per file with N sequential queries is pathological at any non-trivial RTT. Use multi-row COPY/batched upserts per N files, and overlap roundtrips (pipeline / higher real parallelism) so latency stops serializing. This makes sync survivable even cross-region and is worth doing regardless.
  3. Decouple last_sync from full-corpus completion OR raise/remove the timeout wrapper so a sync can actually reach its terminal commit. Right now a 14-day job under a 25-min timeout can never record progress as 'fresh'. At minimum, commit last_sync incrementally per checkpoint, not only at full completion.

Evidence

  • RTT 71.7ms measured via postgres client against GBRAIN_DATABASE_URL.
  • AWS_REGION=us-west-2 in process env; DB host aws-1-us-east-1.
  • Live log: repeated [gbrain phase] import.process_file slow 6xxx-7xxxms people/*.md, every file >5s.
  • sources status: default LAST SYNC stuck at 2026-06-04 03:39:13 across 4 days of doctor runs.
  • sync-cron.sh:41 timeout 1800 ..., :58 timeout 1500 ....

Not the cause (ruled out)

  • Worker OOM / 16GB max-rss: box has 93GB free; OOM-restarts are a consequence of the cycle running for hours, not the staleness cause.
  • Stale locks / wedged queue: doctor auto-clears these; timestamp still never moves.
  • Corrupt files: straylight had 24 null-byte (0x00) files (--skip-failed cleared them); separate issue, already handled.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions