fix: `embed --stale` pulls all chunks every cycle (3 TB egress regression) by kyledeanjackson · Pull Request #775 · garrytan/gbrain

kyledeanjackson · 2026-05-09T05:18:51Z

TL;DR

Three compounding bugs in embed --stale + the autopilot launchd plist
caused ~3 TB/month of Postgres egress on a fully-embedded brain
(~682 pages) because every autopilot cycle re-pulled all chunks across
the wire just to discover there was nothing to embed. This PR fixes
all three layers and adds an early-exit fast path so steady-state
brains do near-zero work per cycle.

Verified on production data — autopilot cycle time goes from ~10s
fetching ~100 MB of vectors to <1s pulling a few hundred bytes.

What I observed

Supabase pooler-egress on a 329-page brain (~1443 chunks) climbed to
3,062 GB / 250 GB quota = 1,225% in a single billing cycle. Chart
showed a sharp transition: zero egress before the day the brain hit
100% embedded coverage, then 400–600 GB/day sustained.

03 May  04 May  05 May  06 May  07 May  08 May
 200GB   480GB   450GB   410GB   450GB   610GB

Cached egress: 0 (nothing was being served from the pooler cache).
Storage: 0 (not file traffic). Realtime: 0 (no WebSocket fanout).
Edge functions: 0. Pure PostgREST/pooler row traffic.

Realtime concurrent connections peaked at 4 — so it wasn't volume
from many clients, it was a small number of clients pulling the same
rows over and over.

Root cause

Autopilot:  KeepAlive=true + no sleep    →  ~4200 cycles/day
   ×
embedAll:   iterates all 682 pages       →  682 getChunks/cycle
   ×
getChunks:  SELECT cc.* including        →  ~30 KB/chunk over the wire
            the 1536-dim embedding          (vector marshalled as JSON
            column                          text balloons vs. binary)

= ~100 MB per cycle × 4200 cycles/day ≈ 420 GB/day  ✓ matches chart

Three independent contributors, each amplifying the others:

1. `KeepAlive=true` autopilot is a hot loop, not a periodic task

The plist generated by gbrain autopilot --install had no
StartInterval, no internal sleep, just KeepAlive=true. launchd
restarts the wrapper as soon as it exits, so cycles run back-to-back.
On one user's machine: 254,050 cycles in ~60 days = ~4,233 per day
= one every ~20 seconds.

Log evidence (every cycle):

[autopilot wrapper] starting; openai_key=set
[autopilot wrapper] sync source=default …
[autopilot wrapper] sync source=bh-brain …
[autopilot wrapper] sync source=bh-vault …
[autopilot wrapper] embed --stale --all
Embedded 0 chunks across 682 pages    ← all the work, none of the value
[autopilot wrapper] cycle complete

KeepAlive is the wrong launchd primitive for a periodic task — that's
what StartInterval is for.

2. `embedAll` calls `getChunks` for every page, then filters in memory

async function embedAll(engine, staleOnly, …) {
  const pages = await engine.listPages({ limit: 100000 });
  // ...
  async function embedOnePage(page) {
    const chunks = await engine.getChunks(page.slug);   // 682 round-trips
    const toEmbed = staleOnly
      ? chunks.filter(c => !c.embedded_at)              // filter AFTER fetch
      : chunks;
    // ...
  }
}

When staleOnly is true and the brain is fully embedded, toEmbed
is empty for every page — but every page's chunks are pulled across
the wire first, just to be discarded.

3. `getChunks` SELECTs the embedding column unnecessarily

async getChunks(slug: string): Promise<Chunk[]> {
  const rows = await sql`
    SELECT cc.* FROM content_chunks cc       // ← includes 1536-dim vector
    JOIN pages p ON p.id = cc.page_id
    WHERE p.slug = ${slug}
    ORDER BY cc.chunk_index
  `;
  return rows.map((r) => rowToChunk(r));     // rowToChunk(includeEmbedding=false)
}                                              // already drops it on parse

rowToChunk() defaults to includeEmbedding=false and discards the
vector after fetching. So the bytes were pulled across the network
only to be thrown away. A separate getChunksWithEmbeddings() already
exists for the legitimate caller (migrate-engine.ts).

The fix

Three small commits, each addresses one layer:

fix(engine) — getChunks projects only the columns
rowToChunk actually reads. Adds listStalePageSlugs() engine
method (one query: SELECT DISTINCT page_id FROM content_chunks WHERE embedding IS NULL).
fix(embed) — embedAll(staleOnly=true) calls
listStalePageSlugs() first. If empty, log + return. Otherwise
filter pages to only those with stale chunks, then iterate
normally. The non---stale path is unchanged.
fix(autopilot) — plist template uses
StartInterval=300 instead of KeepAlive=true. ~288 cycles/day
max. Tunable per-user.

Verification

Tests

$ bun run typecheck
$ bun test test/embed.test.ts test/autopilot-install.test.ts test/pglite-engine.test.ts
 96 pass  0 fail  189 expect() calls

Full suite: 2086 pass / 18 fail / 3 errors. The 18 failures are all
pre-existing beforeEach hook timed out / PGLite not connected
flakes in dream.test.ts, orphans.test.ts, multi-source- integration.test.ts, etc. — confirmed by re-running on main (clean
tree) where the same files all pass in isolation. None touch any of
the files in this PR.

Real-world

Tested against the production brain (682 pages, fully embedded):

$ time gbrain embed --stale
[embed.pages] start
[embed.pages] 6/6 (100%)
Embedded 0 chunks across 6 pages
[embed.pages] 6/6 (100%) done

real    0m0.893s

Six pages had transient stale chunks from a recent ingest. Old code
would have done 682 round-trips returning vectors; new code did one
small query and returned in <1s. After re-embedding those 6 pages,
subsequent runs early-exit:

$ time gbrain embed --stale
Embedded 0 chunks across 0 pages

real    0m0.02s

Production cutover

Re-enabled the autopilot with the patched code on the same machine
that was bleeding 400–600 GB/day. The Supabase egress chart will
confirm the fix over the next 24h; the chart and a follow-up will
land in the linked issue.

Behavior changes (intentional)

EmbedResult semantics change in --stale mode:

Field	Old (`--stale`)	New (`--stale`)
`pages_processed`	every page in brain	pages with at least one stale chunk
`total_chunks`	every chunk in those pages	only chunks on stale pages
`skipped`	every already-embedded chunk anywhere	skipped chunks on visited pages
`embedded`	unchanged	unchanged
`would_embed` (dry-run)	unchanged	unchanged

The new semantics are arguably more useful — they describe work done,
not work considered. The --all path is unchanged.

JSDoc on EmbedResult updated to call this out.

Migration notes for existing installs

The --install template change only affects new installs. Existing
users have plists with KeepAlive=true already deployed. They can:

Easy: gbrain autopilot --uninstall && gbrain autopilot --install
to regenerate from the new template, OR
Manual: edit ~/Library/LaunchAgents/com.gbrain.autopilot.plist
to replace <key>KeepAlive</key><true/> with
<key>StartInterval</key><integer>300</integer>, then
launchctl bootout gui/$UID/com.gbrain.autopilot && launchctl bootstrap gui/$UID ~/Library/LaunchAgents/com.gbrain.autopilot.plist.

A doctor / heal step that detects the legacy plist and rewrites it
in place would be a nice follow-up. Happy to add it if you'd like.

Out of scope

pgvector index tuning (HNSW vs. IVFFLAT) — separate concern
Search/query path — searchVector() already projects correctly,
no change needed there
Per-row JSON-vs-binary encoding — postgres-js handles this; the
fix is to not fetch the column at all

^{Need help on this PR? Tag @codesmith with what you need.}

Let Codesmith autofix CI failures and bot reviews

…PageSlugs Two related changes that lay the groundwork for fixing a hot-loop egress regression in `embed --stale`: 1. `getChunks(slug)` previously did `SELECT cc.*` which includes the 1536-dim `embedding` vector column. `rowToChunk()` already defaults to `includeEmbedding=false` and discards the column on parse, so the bytes were pulled across the wire only to be thrown away. Switched to an explicit projection that excludes `embedding`. Callers that actually need the vector (re-rank, similarity) already have a dedicated `getChunksWithEmbeddings()` method. 2. New `listStalePageSlugs()` returns the slugs of pages with at least one chunk where `embedding IS NULL`. Used by the next commit's embed --stale fast-path to avoid iterating every page in the brain when nothing is stale. Both engines updated for parity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`gbrain embed --stale` previously iterated every page in the brain and called `getChunks(slug)` on each one before filtering chunks where `embedded_at IS NULL`. On a steady-state brain (everything already embedded) this is wasted work — every cycle does N getChunks round-trips just to discover there's nothing to do. In production this manifested as 3 TB/month of Postgres egress on a fully-embedded brain (~682 pages) when the daemon polled rapidly. The autopilot plist's KeepAlive=true (separate fix) was the trigger; this is the underlying multiplier. Fix: when `staleOnly=true`, query `listStalePageSlugs()` first. If empty, return immediately. If non-empty, iterate only those pages — not every page in the brain. Behavior changes intentionally: - `pages_processed` and `total_chunks` in `--stale` mode now reflect the filtered (stale-only) set, not the entire brain. Test updated to assert the new semantics. The non-`--stale` path (`--all`) is unchanged. Combined with the `getChunks` projection fix in the previous commit, egress per cycle drops from ~100 MB to a few hundred bytes when the brain is fully embedded. Verified on a 682-page brain: cycle time 0.9s, log shows "Embedded 0 chunks across 0 pages" instead of "Embedded 0 chunks across 682 pages". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…=true The launchd plist generated by `gbrain autopilot --install` set KeepAlive=true with no internal sleep in the wrapper script. launchd restarts the wrapper as soon as it exits, so each cycle runs back- to-back with effectively no delay. On one machine this produced ~4200 cycles/day (~one cycle every 20s) against a fully-embedded brain — combined with two query-side bugs (fixed in the prior two commits), it drove a 3 TB/month Postgres egress overage. Switching to StartInterval=300 caps the cadence at 288 cycles/day (one cycle per 5 minutes) regardless of how fast a single cycle exits. This is the correct launchd primitive for "run periodically on a schedule" — KeepAlive is for "respawn if the process dies", which the wrapper isn't. Tunable per-user by editing the generated plist directly. Existing user installs need to either re-run `gbrain autopilot --install` (which regenerates the plist) or hand-edit the deployed plist to swap KeepAlive for StartInterval. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

garrytan · 2026-06-08T03:00:00Z

Thanks for this contribution — and apologies for the slow triage. We did a full pass over the entire PR backlog. gbrain has moved fast, and the maintainer's larger "cathedral" rewrites have superseded a big share of community PRs: the AI gateway + recipes + user_provided_models system replaced almost all individual provider PRs; #1805 fixed the whole Postgres module-singleton class; #1542 unified the type taxonomy; #1657 the retrieval path; #1802 the doctor; and so on.

We're closing this one in that cleanup — either the fix already landed on master, it duplicates another PR or merged change, or it's outside the current merge bar. Where a closed PR carried a genuinely valuable idea, we've recorded it in docs/designs/COMMUNITY_IDEAS.md so nothing good is lost (a few may graduate into TODOs).

Please don't read the close as a judgment of the work — thank you for contributing. If you believe the underlying issue is still live on the latest master, reopen with a quick note and we'll take another look. 🙏

kyledeanjackson and others added 3 commits May 9, 2026 13:16

100yenadmin mentioned this pull request May 12, 2026

bug: embed --stale --source should scope stale chunk queries by source_id #929

Open

kyledeanjackson mentioned this pull request May 14, 2026

sync --source X attributes pages to source_id='default' instead of X #978

Open

garrytan closed this Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: `embed --stale` pulls all chunks every cycle (3 TB egress regression)#775

fix: `embed --stale` pulls all chunks every cycle (3 TB egress regression)#775
kyledeanjackson wants to merge 3 commits into
garrytan:masterfrom
kyledeanjackson:fix/embed-stale-egress

kyledeanjackson commented May 9, 2026 •

edited by blacksmith-sh Bot

Loading

Uh oh!

garrytan commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kyledeanjackson commented May 9, 2026 • edited by blacksmith-sh Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

What I observed

Root cause

1. KeepAlive=true autopilot is a hot loop, not a periodic task

2. embedAll calls getChunks for every page, then filters in memory

3. getChunks SELECTs the embedding column unnecessarily

The fix

Verification

Tests

Real-world

Production cutover

Behavior changes (intentional)

Migration notes for existing installs

Out of scope

Uh oh!

garrytan commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kyledeanjackson commented May 9, 2026 •

edited by blacksmith-sh Bot

Loading

1. `KeepAlive=true` autopilot is a hot loop, not a periodic task

2. `embedAll` calls `getChunks` for every page, then filters in memory

3. `getChunks` SELECTs the embedding column unnecessarily