fix: embed --stale pulls all chunks every cycle (3 TB egress regression)#775
fix: embed --stale pulls all chunks every cycle (3 TB egress regression)#775kyledeanjackson wants to merge 3 commits into
embed --stale pulls all chunks every cycle (3 TB egress regression)#775Conversation
…PageSlugs Two related changes that lay the groundwork for fixing a hot-loop egress regression in `embed --stale`: 1. `getChunks(slug)` previously did `SELECT cc.*` which includes the 1536-dim `embedding` vector column. `rowToChunk()` already defaults to `includeEmbedding=false` and discards the column on parse, so the bytes were pulled across the wire only to be thrown away. Switched to an explicit projection that excludes `embedding`. Callers that actually need the vector (re-rank, similarity) already have a dedicated `getChunksWithEmbeddings()` method. 2. New `listStalePageSlugs()` returns the slugs of pages with at least one chunk where `embedding IS NULL`. Used by the next commit's embed --stale fast-path to avoid iterating every page in the brain when nothing is stale. Both engines updated for parity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`gbrain embed --stale` previously iterated every page in the brain and called `getChunks(slug)` on each one before filtering chunks where `embedded_at IS NULL`. On a steady-state brain (everything already embedded) this is wasted work — every cycle does N getChunks round-trips just to discover there's nothing to do. In production this manifested as 3 TB/month of Postgres egress on a fully-embedded brain (~682 pages) when the daemon polled rapidly. The autopilot plist's KeepAlive=true (separate fix) was the trigger; this is the underlying multiplier. Fix: when `staleOnly=true`, query `listStalePageSlugs()` first. If empty, return immediately. If non-empty, iterate only those pages — not every page in the brain. Behavior changes intentionally: - `pages_processed` and `total_chunks` in `--stale` mode now reflect the filtered (stale-only) set, not the entire brain. Test updated to assert the new semantics. The non-`--stale` path (`--all`) is unchanged. Combined with the `getChunks` projection fix in the previous commit, egress per cycle drops from ~100 MB to a few hundred bytes when the brain is fully embedded. Verified on a 682-page brain: cycle time 0.9s, log shows "Embedded 0 chunks across 0 pages" instead of "Embedded 0 chunks across 682 pages". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…=true The launchd plist generated by `gbrain autopilot --install` set KeepAlive=true with no internal sleep in the wrapper script. launchd restarts the wrapper as soon as it exits, so each cycle runs back- to-back with effectively no delay. On one machine this produced ~4200 cycles/day (~one cycle every 20s) against a fully-embedded brain — combined with two query-side bugs (fixed in the prior two commits), it drove a 3 TB/month Postgres egress overage. Switching to StartInterval=300 caps the cadence at 288 cycles/day (one cycle per 5 minutes) regardless of how fast a single cycle exits. This is the correct launchd primitive for "run periodically on a schedule" — KeepAlive is for "respawn if the process dies", which the wrapper isn't. Tunable per-user by editing the generated plist directly. Existing user installs need to either re-run `gbrain autopilot --install` (which regenerates the plist) or hand-edit the deployed plist to swap KeepAlive for StartInterval. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for this contribution — and apologies for the slow triage. We did a full pass over the entire PR backlog. gbrain has moved fast, and the maintainer's larger "cathedral" rewrites have superseded a big share of community PRs: the AI gateway + recipes + user_provided_models system replaced almost all individual provider PRs; #1805 fixed the whole Postgres module-singleton class; #1542 unified the type taxonomy; #1657 the retrieval path; #1802 the doctor; and so on. We're closing this one in that cleanup — either the fix already landed on master, it duplicates another PR or merged change, or it's outside the current merge bar. Where a closed PR carried a genuinely valuable idea, we've recorded it in docs/designs/COMMUNITY_IDEAS.md so nothing good is lost (a few may graduate into TODOs). Please don't read the close as a judgment of the work — thank you for contributing. If you believe the underlying issue is still live on the latest master, reopen with a quick note and we'll take another look. 🙏 |
TL;DR
Three compounding bugs in
embed --stale+ the autopilot launchd plistcaused ~3 TB/month of Postgres egress on a fully-embedded brain
(~682 pages) because every autopilot cycle re-pulled all chunks across
the wire just to discover there was nothing to embed. This PR fixes
all three layers and adds an early-exit fast path so steady-state
brains do near-zero work per cycle.
Verified on production data — autopilot cycle time goes from ~10s
fetching ~100 MB of vectors to <1s pulling a few hundred bytes.
What I observed
Supabase pooler-egress on a 329-page brain (~1443 chunks) climbed to
3,062 GB / 250 GB quota = 1,225% in a single billing cycle. Chart
showed a sharp transition: zero egress before the day the brain hit
100% embedded coverage, then 400–600 GB/day sustained.
Cached egress: 0 (nothing was being served from the pooler cache).
Storage: 0 (not file traffic). Realtime: 0 (no WebSocket fanout).
Edge functions: 0. Pure PostgREST/pooler row traffic.
Realtime concurrent connections peaked at 4 — so it wasn't volume
from many clients, it was a small number of clients pulling the same
rows over and over.
Root cause
Three independent contributors, each amplifying the others:
1.
KeepAlive=trueautopilot is a hot loop, not a periodic taskThe plist generated by
gbrain autopilot --installhad noStartInterval, no internalsleep, justKeepAlive=true. launchdrestarts the wrapper as soon as it exits, so cycles run back-to-back.
On one user's machine: 254,050 cycles in ~60 days = ~4,233 per day
= one every ~20 seconds.
Log evidence (every cycle):
KeepAliveis the wrong launchd primitive for a periodic task — that'swhat
StartIntervalis for.2.
embedAllcallsgetChunksfor every page, then filters in memoryWhen
staleOnlyis true and the brain is fully embedded,toEmbedis empty for every page — but every page's chunks are pulled across
the wire first, just to be discarded.
3.
getChunksSELECTs the embedding column unnecessarilyrowToChunk()defaults toincludeEmbedding=falseand discards thevector after fetching. So the bytes were pulled across the network
only to be thrown away. A separate
getChunksWithEmbeddings()alreadyexists for the legitimate caller (
migrate-engine.ts).The fix
Three small commits, each addresses one layer:
fix(engine)—getChunksprojects only the columnsrowToChunkactually reads. AddslistStalePageSlugs()enginemethod (one query:
SELECT DISTINCT page_id FROM content_chunks WHERE embedding IS NULL).fix(embed)—embedAll(staleOnly=true)callslistStalePageSlugs()first. If empty, log + return. Otherwisefilter
pagesto only those with stale chunks, then iteratenormally. The non-
--stalepath is unchanged.fix(autopilot)— plist template usesStartInterval=300instead ofKeepAlive=true. ~288 cycles/daymax. Tunable per-user.
Verification
Tests
Full suite: 2086 pass / 18 fail / 3 errors. The 18 failures are all
pre-existing
beforeEach hook timed out/PGLite not connectedflakes in
dream.test.ts,orphans.test.ts,multi-source- integration.test.ts, etc. — confirmed by re-running onmain(cleantree) where the same files all pass in isolation. None touch any of
the files in this PR.
Real-world
Tested against the production brain (682 pages, fully embedded):
Six pages had transient stale chunks from a recent ingest. Old code
would have done 682 round-trips returning vectors; new code did one
small query and returned in <1s. After re-embedding those 6 pages,
subsequent runs early-exit:
Production cutover
Re-enabled the autopilot with the patched code on the same machine
that was bleeding 400–600 GB/day. The Supabase egress chart will
confirm the fix over the next 24h; the chart and a follow-up will
land in the linked issue.
Behavior changes (intentional)
EmbedResultsemantics change in--stalemode:--stale)--stale)pages_processedtotal_chunksskippedembeddedwould_embed(dry-run)The new semantics are arguably more useful — they describe work done,
not work considered. The
--allpath is unchanged.JSDoc on
EmbedResultupdated to call this out.Migration notes for existing installs
The
--installtemplate change only affects new installs. Existingusers have plists with
KeepAlive=truealready deployed. They can:gbrain autopilot --uninstall && gbrain autopilot --installto regenerate from the new template, OR
~/Library/LaunchAgents/com.gbrain.autopilot.plistto replace
<key>KeepAlive</key><true/>with<key>StartInterval</key><integer>300</integer>, thenlaunchctl bootout gui/$UID/com.gbrain.autopilot && launchctl bootstrap gui/$UID ~/Library/LaunchAgents/com.gbrain.autopilot.plist.A doctor / heal step that detects the legacy plist and rewrites it
in place would be a nice follow-up. Happy to add it if you'd like.
Out of scope
searchVector()already projects correctly,no change needed there
fix is to not fetch the column at all
Need help on this PR? Tag
@codesmithwith what you need.