Symptom
gbrain doctor perpetually reports sync_freshness FAIL for the default source. last_sync has been frozen at 2026-06-04 03:39 UTC for 4+ days. Running gbrain sync --source default (the doctor's suggested fix) never clears it. The staleness alarm fires every doctor cycle, gets 'fixed', and immediately re-fails. It is not flaky — it is structurally impossible to satisfy under the current topology.
Root cause (measured, not guessed)
Compute and database are in different AWS regions.
- Compute (Render):
AWS_REGION=us-west-2
- Supabase Postgres:
aws-1-us-east-1.pooler.supabase.com
- Measured RTT,
SELECT 1 over the pooler (prepare=false, transaction mode): 71.7ms average (connect 602ms cold).
The importer processes one file per transaction with multiple sequential queries each. Observed in the live sync log, every people/*.md page (avg 2,217 bytes) logs import.process_file slow ~6500ms. A 2KB file taking 6.5s is not content cost — it is ~90 sequential cross-country roundtrips × 71ms.
The math that proves it can never finish
- Files in
default: 185,879
- Per-file cost: ~6.5s (cross-region serial queries)
- Full sync wall time: ~14 days
- But
sync-cron.sh wraps sync in timeout 1500 (25 min) and timeout 1800 (30 min).
Each run imports ~230 files, gets SIGTERM'd, resumes from checkpoint next run, and never reaches the terminal commit that writes last_sync. So the timestamp never advances and the source is permanently 'stale' regardless of how much work actually happens. The downstream noise (worker_oom_loop, stalled autopilot-cycle jobs hitting max_stalled=3, cleared locks) is all secondary to this one fact.
Fix (in order of impact)
- Colocate the DB with compute. Move the Supabase project to
us-west-2 (or move the Render service to us-east-1). 71ms -> <2ms. Sync drops from ~14 days to hours. This is the actual fix.
- Batch the importer. One transaction per file with N sequential queries is pathological at any non-trivial RTT. Use multi-row
COPY/batched upserts per N files, and overlap roundtrips (pipeline / higher real parallelism) so latency stops serializing. This makes sync survivable even cross-region and is worth doing regardless.
- Decouple
last_sync from full-corpus completion OR raise/remove the timeout wrapper so a sync can actually reach its terminal commit. Right now a 14-day job under a 25-min timeout can never record progress as 'fresh'. At minimum, commit last_sync incrementally per checkpoint, not only at full completion.
Evidence
- RTT 71.7ms measured via
postgres client against GBRAIN_DATABASE_URL.
AWS_REGION=us-west-2 in process env; DB host aws-1-us-east-1.
- Live log: repeated
[gbrain phase] import.process_file slow 6xxx-7xxxms people/*.md, every file >5s.
sources status: default LAST SYNC stuck at 2026-06-04 03:39:13 across 4 days of doctor runs.
sync-cron.sh:41 timeout 1800 ..., :58 timeout 1500 ....
Not the cause (ruled out)
- Worker OOM / 16GB max-rss: box has 93GB free; OOM-restarts are a consequence of the cycle running for hours, not the staleness cause.
- Stale locks / wedged queue: doctor auto-clears these; timestamp still never moves.
- Corrupt files: straylight had 24 null-byte (0x00) files (
--skip-failed cleared them); separate issue, already handled.
Symptom
gbrain doctorperpetually reportssync_freshnessFAIL for thedefaultsource.last_synchas been frozen at2026-06-04 03:39 UTCfor 4+ days. Runninggbrain sync --source default(the doctor's suggested fix) never clears it. The staleness alarm fires every doctor cycle, gets 'fixed', and immediately re-fails. It is not flaky — it is structurally impossible to satisfy under the current topology.Root cause (measured, not guessed)
Compute and database are in different AWS regions.
AWS_REGION=us-west-2aws-1-us-east-1.pooler.supabase.comSELECT 1over the pooler (prepare=false, transaction mode): 71.7ms average (connect 602ms cold).The importer processes one file per transaction with multiple sequential queries each. Observed in the live sync log, every
people/*.mdpage (avg 2,217 bytes) logsimport.process_file slow ~6500ms. A 2KB file taking 6.5s is not content cost — it is ~90 sequential cross-country roundtrips × 71ms.The math that proves it can never finish
default: 185,879sync-cron.shwraps sync intimeout 1500(25 min) andtimeout 1800(30 min).Each run imports ~230 files, gets SIGTERM'd, resumes from checkpoint next run, and never reaches the terminal commit that writes
last_sync. So the timestamp never advances and the source is permanently 'stale' regardless of how much work actually happens. The downstream noise (worker_oom_loop, stalled autopilot-cycle jobs hittingmax_stalled=3, cleared locks) is all secondary to this one fact.Fix (in order of impact)
us-west-2(or move the Render service tous-east-1). 71ms -> <2ms. Sync drops from ~14 days to hours. This is the actual fix.COPY/batched upserts per N files, and overlap roundtrips (pipeline / higher real parallelism) so latency stops serializing. This makes sync survivable even cross-region and is worth doing regardless.last_syncfrom full-corpus completion OR raise/remove thetimeoutwrapper so a sync can actually reach its terminal commit. Right now a 14-day job under a 25-min timeout can never record progress as 'fresh'. At minimum, commitlast_syncincrementally per checkpoint, not only at full completion.Evidence
postgresclient againstGBRAIN_DATABASE_URL.AWS_REGION=us-west-2in process env; DB hostaws-1-us-east-1.[gbrain phase] import.process_file slow 6xxx-7xxxms people/*.md, every file >5s.sources status:defaultLAST SYNC stuck at2026-06-04 03:39:13across 4 days of doctor runs.sync-cron.sh:41timeout 1800 ...,:58timeout 1500 ....Not the cause (ruled out)
--skip-failedcleared them); separate issue, already handled.