Skip to content

fix(embed): server-side staleness filter — fixes 270x egress amplification#409

Closed
atrevino47 wants to merge 1 commit intogarrytan:masterfrom
atrevino47:fix/embed-stale-egress-sql-side-filter
Closed

fix(embed): server-side staleness filter — fixes 270x egress amplification#409
atrevino47 wants to merge 1 commit intogarrytan:masterfrom
atrevino47:fix/embed-stale-egress-sql-side-filter

Conversation

@atrevino47
Copy link
Copy Markdown

Problem

gbrain embed --stale (used by autopilot every cycle and recommended by gbrain doctor) walks every page in the brain even when 0 chunks need embedding. On a 1.5K-page / 8K-chunk Supabase-backed brain that is 100% embedded, a single invocation pulls ~76 MB from Postgres (the pages table + every chunk's vector(1536) embedding column) and discards all of it after a client-side .filter(c => !c.embedded_at). With autopilot firing this every 5–10 minutes, plus a 2-hour cron also running embed --stale, this compounds to 20–80 GB/day of pure-waste egress. We hit Supabase's 5 GB free-tier bandwidth ceiling at 102 GB used (2058% over) twice in one week (2026-04-19 and again on 2026-04-24).

Relevant lines in master (v0.20.4, commit 11abb24):

  • src/commands/embed.ts:223const pages = await engine.listPages({ limit: 100000 }); — pulls every page row
  • src/commands/embed.ts:237const chunks = await engine.getChunks(page.slug); — pulls every chunk including the 1536-dim embedding vector
  • src/commands/embed.ts:238-240staleOnly ? chunks.filter(c => !c.embedded_at) : chunks — filter runs client-side, after the bytes have already crossed the wire

rowToChunk (src/core/utils.ts) defaults includeEmbedding=false, which means the embedding bytes are received and immediately dropped in TypeScript. Pure waste.

Fix

Two new BrainEngine methods, identical implementations on Postgres + PGLite:

  • countStaleChunks(): Promise<number> — single SELECT count(*) FROM content_chunks WHERE embedding IS NULL. ~50 bytes wire. Pre-flight short-circuit for embed --stale.
  • listStaleChunks(): Promise<StaleChunkRow[]>SELECT slug, chunk_index, chunk_text, chunk_source, model, token_count FROM content_chunks JOIN pages WHERE embedding IS NULL ORDER BY p.id, chunk_index LIMIT 100000. Excludes the (always-NULL on stale rows) embedding column. Per-row ~1.5 KB instead of ~8 KB.

embedAll is forked into two paths:

  • staleOnly === true → new embedAllStale using the SQL-side filter. Pre-flight count short-circuits when 0 stale.
  • staleOnly === false (i.e. --all) → existing behavior verbatim. The user explicitly asked for re-embed-everything.

Staleness predicate is embedding IS NULL, not embedded_at IS NULL, because the bulk-import path could previously leave embedded_at populated while embedding was NULL (see consistency fix below). embedding IS NULL is the truth source for "this chunk needs an embedding."

Wire-cost impact

State Before After Reduction
0 stale chunks (autopilot common case) ~76 MB ~50 bytes ~1,500,000×
100 stale across 10 pages (post-rotation) ~76 MB ~150 KB ~500×
8K stale across 1.5K pages (cold start) ~76 MB ~12 MB ~6×

Functional contract preserved

The fix preserves the existing contract — every chunk where the embedding is missing still gets embedded; nothing is skipped; non-stale chunks on partially-stale pages are NOT lost. Code-flow proof:

  1. embed --stale's only side effect is "for every chunk where the embedding is NULL, call embedBatch and write back via upsertChunks with the new embedding (and embedded_at = now())". Verified by reading src/commands/embed.ts:216-308.
  2. New chunks for new pages are never introduced by embedAll. They're added by importFromContent (src/core/import-file.ts:99-154) which chunks AND embeds inline before the DB transaction commits — verified by tracing gbrain sync, gbrain import, and MCP put_page paths. embed --stale's only role is the defensive sweep for chunks that lack an embedding.
  3. The new path preserves non-stale chunks on partially-stale pages: it re-fetches the full chunk set for each stale slug and passes it to upsertChunks, with embedding: undefined for non-stale chunks (which the engine's existing COALESCE(EXCLUDED.embedding, content_chunks.embedding) clause leaves untouched).

Bonus consistency fix in upsertChunks

While tracing the staleness predicate, found a quiet bookkeeping bug in both postgres-engine.ts and pglite-engine.ts: when chunk_text changes and no new embedding is supplied, the existing CASE clause correctly clears embedding to NULL — but embedded_at was set via COALESCE(EXCLUDED.embedded_at, content_chunks.embedded_at) which kept the old timestamp. That left rows with embedding IS NULL but embedded_at IS NOT NULL — making any staleness predicate on embedded_at silently miss them.

Fixed at write time: when text changes and no new embedding is supplied, both columns reset to NULL together. This is why the new staleness predicate uses embedding IS NULL (truth source today, even on databases written before this fix) and the consistency fix keeps both columns honest going forward.

Migration impact

None. No schema changes, no DB migration, no backfill, no config file changes. embedded_at and embedding have been on content_chunks since v1 of the schema (src/schema.sql:79-94); no version of gbrain in the wild lacks them.

Risk / things to look at in review

  • upsertChunks re-fetch on partially-stale pages: the new path re-fetches getChunks(slug) per slug so it can preserve non-stale chunks through the upsert. This costs one extra round-trip per stale slug. On a typical post-rotation case (10 stale slugs × ~5 KB existing chunks-without-embedding each = ~50 KB) this is negligible compared to the savings. Could be eliminated in a follow-up by adding a partialUpdate: boolean flag to upsertChunks that skips the implicit DELETE step.
  • Behavioral change for --stale callers when count==0: the previous code emitted "Embedded 0 chunks across N pages"; the new code emits "Embedded 0 chunks (0 stale found)". Searching the repo confirms only test/embed.test.ts matched the old wording; updated.
  • --all path is byte-identical to before — explicit regression test added.
  • listStaleChunks LIMIT 100000: identical to the existing listPages({ limit: 100000 }) cap. Brains with >100K stale chunks (extremely rare; would only happen post-mass-rotation or post-restore-without-embeddings) won't be fully drained on a single call but will converge over multiple runs since each run embeds and clears 100K. Pagination param can be added later if needed.

Tests

Added to test/embed.test.ts (4 new):

  • zero stale chunks: countStaleChunks short-circuits, listPages never called — proves the egress fix.
  • N stale chunks across M pages: only stale slugs re-fetched, exact stale set embedded, non-stale chunks preserved — proves correctness + the merge logic doesn't wipe fresh chunks (this is the regression risk).
  • --stale dry-run: counts stale via countStaleChunks, reports via listStaleChunks, no embedBatch or upsertChunks.
  • --all (non-stale) path is byte-identical: walks listPages and embeds every chunk — regression guard for backwards compat.

Updated existing --stale tests to mock countStaleChunks + listStaleChunks for the new path. Test counts: 11 pass / 0 fail in test/embed.test.ts. Full suite: 2422 pass; one pre-existing flake in test/e2e/minions-shell-pglite.test.ts (mock module ordering issue) confirmed present on stock master before this PR.

Reference

Empirical incident: Supabase 102 GB used / 5 GB free-tier ceiling = 2058% on 2026-04-24. Same bandwidth pattern as 2026-04-19. Three independent investigations (manual code read, GPT-via-Codex audit, clean-room subagent) converged on the same bug + same fix.

embed --stale walked listPages + per-page getChunks (incl. vector(1536)
embedding column) on every call, then client-side-filtered for chunks
where embedding was missing. On a 1.5K-page brain at 100% coverage, ~76 MB
pulled per call, all discarded. With autopilot firing every 5-10 min plus
a 2h cron, this hit Supabase's 5 GB free-tier ceiling at 102 GB used
(2058% over) twice in one week.

Two new BrainEngine methods replace the page walk with a SQL-side filter:
- countStaleChunks(): single SELECT count(*) WHERE embedding IS NULL.
  Pre-flight short-circuit; ~50 bytes wire when 0 stale.
- listStaleChunks(): slug + chunk_index + chunk_text + chunk_source +
  model + token_count for stale rows only. Excludes the (NULL) embedding
  column. Bounded by LIMIT 100000 mirroring listPages.

embedAll forks: staleOnly=true takes the new SQL-side path
(embedAllStale); staleOnly=false (--all) keeps existing behavior verbatim.

embedAllStale preserves non-stale chunks on partially-stale pages: it
re-fetches existing chunks per stale slug and merges (embedding=undefined
for non-stale → COALESCE preserves existing). Without the merge, the
upsertChunks != ALL filter would delete non-stale chunks. Re-fetch cost
is bounded by stale slug count; the autopilot common case (0 stale)
never reaches this path.

Predicate uses `embedding IS NULL`, not `embedded_at IS NULL`. The bulk-
import path could leave embedded_at populated while embedding was NULL
(see upsertChunks consistency fix below), so `embedding IS NULL` is the
truth source for "this chunk needs an embedding".

Also fixes the upsertChunks consistency bug in both engines: when
chunk_text changes and no new embedding is supplied, embedding correctly
clears to NULL but embedded_at kept its old timestamp. New behavior
resets BOTH columns together, keeping write-time honesty.

Wire-cost impact (measured against current behavior on a 1.5K-page brain):
- 0 stale chunks (autopilot common case): ~76 MB → ~50 bytes (~1.5M× reduction)
- 100 stale across 10 pages: ~76 MB → ~150 KB (~500× reduction)
- 8K stale across 1.5K pages (cold start): ~76 MB → ~12 MB (~6× reduction)

Tests: 4 new in test/embed.test.ts (zero-stale short-circuit; N-stale-
across-M-pages with non-stale preservation; --stale dry-run; --all path
byte-identical). Existing --stale tests updated for the new mock surface.

Migration impact: none. embedded_at and embedding columns have been on
content_chunks since schema inception.
garrytan added a commit that referenced this pull request Apr 26, 2026
…409) (#447)

* fix: propagate AbortSignal to runCycle + worker force-eviction safety net

Root cause: autopilot-cycle handler called runCycle() without passing
the job's AbortSignal. When the per-job timeout fired abort(), runCycle
never checked it and kept grinding through extract (54,605 pages).
The executeJob promise never resolved, inFlight never decremented, and
the worker thought it was at capacity forever — 98 jobs piled up waiting
with 0 active while a live worker sat idle.

Three-layer fix:

1. CycleOpts.signal: new optional AbortSignal field. runCycle checks it
   between every phase via checkAborted(). A timed-out cycle now bails
   after the current phase completes instead of running all 6 phases.

2. autopilot-cycle handler: passes job.signal to runCycle so the abort
   actually propagates.

3. Worker safety net: 30s after the abort fires, if the handler still
   hasn't resolved, force-evict from inFlight and mark as dead in DB.
   This is the last-resort escape hatch for any handler that ignores
   AbortSignal — the worker resumes claiming new jobs instead of
   wedging forever.

Incident: 2026-04-24, 98 waiting / 0 active / worker alive but idle.
143 existing minions tests pass unchanged.

* test: abort signal propagation + worker recovery regression tests

16 new tests across 3 files covering the 2026-04-24 worker wedge:

test/minions.test.ts (6 new, 149 total):
  - handler receiving abort signal exits cleanly
  - handler ignoring abort still gets signal delivered
  - worker claims new jobs after timeout (no wedge) ← key regression
  - checkAborted pattern: undefined/non-aborted/aborted signals

test/cycle-abort.test.ts (7 new):
  - CycleOpts.signal type contract
  - runCycle accepts signal without error
  - runCycle bails on pre-aborted signal
  - runCycle bails mid-flight when signal fires between phases
  - Source-level guard: jobs.ts passes job.signal to runCycle
  - Source-level guard: worker.ts has force-eviction safety net
  - Source-level guard: cycle.ts has checkAborted between all 6 phases

test/e2e/worker-abort-recovery.test.ts (3 new):
  - worker recovers from timed-out handler and processes next job
  - concurrency=2 processes parallel jobs during timeout
  - multiple sequential timeouts don't permanently wedge worker

All 159 tests pass.

* perf: incremental extract — only process slugs that sync touched

The autopilot-cycle runs every 5 min. Its extract phase was doing a full
filesystem walk of ALL markdown files (54K+) — twice (links + timeline).
On a brain this size, extract alone exceeded the 600s job timeout,
producing zero useful writes.

Fix: sync already returns pagesAffected (the slugs it added/modified).
Pipe that list through to extract. When provided, extract reads ONLY
those files instead of walking the entire brain directory.

- Add ExtractOpts.slugs for targeted extraction
- Add extractForSlugs() — single-pass links + timeline for specific slugs
- cycle.ts: capture sync's pagesAffected, pass to runPhaseExtract
- If sync didn't run or failed, extract falls back to full walk (safe)
- If pagesAffected is empty (nothing changed), extract returns instantly

Expected improvement: 54K file reads → ~10-50 per cycle. The full walk
is still available via CLI `gbrain extract` and on first-run.

* fix: connection resilience for minion supervisor + worker

Three fixes for the minion supervisor dying silently when PgBouncer rotates:

1. PostgresEngine: executeRaw retries once on connection-class errors
   (ECONNREFUSED, password auth failed, connection terminated, etc.)
   by tearing down the poisoned pool and creating a fresh one via
   reconnect(). Prevents cascading failures when Supabase bounces.

2. Supervisor: tracks consecutive health check failures. After 3 in a
   row, emits health_warn with reason=db_connection_degraded and attempts
   engine.reconnect() if available. Resets counter on success.

3. Supervisor: worker_exited events now include likely_cause field:
   SIGKILL → oom_or_external_kill, SIGTERM → graceful_shutdown,
   code=1 → runtime_error. Makes it trivial to distinguish OOM kills
   from connection deaths in logs.

Tests: 23 new tests covering connection error detection, reconnect
guard against concurrent reconnects, retry-once-not-infinite-loop,
health failure tracking, and exit classification.

* fix(db): set session timeouts on every connection to kill orphan backends

Prevents the failure mode from #361: a single autopilot UPDATE on
minion_jobs can leave a pooler backend in state='active'/ClientRead
for 24h+, holding a RowExclusiveLock that blocks every subsequent
ALTER TABLE minion_jobs. The stuck backend never times out on its
own because Supabase Micro has no default idle_in_transaction_session_timeout
and autovacuum can't reap sessions that hold active locks.

Fix: deliver statement_timeout + idle_in_transaction_session_timeout
as startup parameters via postgres.js's `connection` option, applied
automatically on every new backend connection. Works correctly on
both session-mode and transaction-mode PgBouncer poolers (startup
params persist for the backend's lifetime, unlike SET commands
which transaction-mode PgBouncer strips between transactions).

Defaults chosen conservatively so they don't interfere with bulk
work like multi-minute embed passes or CREATE INDEX on large pages
tables:
  - statement_timeout: '5min'
  - idle_in_transaction_session_timeout: '2min'

Each overridable per-GUC via env var (GBRAIN_STATEMENT_TIMEOUT,
GBRAIN_IDLE_TX_TIMEOUT). Set any to '0' or 'off' to disable.

client_connection_check_interval is the specific GUC that would
kill the observed state='active'/ClientRead case, but it's
Postgres 14+ and some managed poolers reject unknown startup
parameters. Made it opt-in only via GBRAIN_CLIENT_CHECK_INTERVAL
for users who know their Postgres supports it.

Applied in both the module-level singleton connect (src/core/db.ts)
and the per-engine-instance pool used by `gbrain jobs work`
(src/core/postgres-engine.ts) via a shared resolveSessionTimeouts()
helper.

Tests: 5 new cases in migrate.test.ts covering defaults, env
overrides, '0'/'off' disable, and multi-GUC disable. 39/39 pass
(34 pre-existing + 5 new).

Closes #361.

Co-Authored-By: orendi84 <orendigergo@gmail.com>

* fix(embed): server-side staleness filter for embed --stale (v0.20.5)

embed --stale walked listPages + per-page getChunks (incl. vector(1536)
embedding column) on every call, then client-side-filtered for chunks
where embedding was missing. On a 1.5K-page brain at 100% coverage, ~76 MB
pulled per call, all discarded. With autopilot firing every 5-10 min plus
a 2h cron, this hit Supabase's 5 GB free-tier ceiling at 102 GB used
(2058% over) twice in one week.

Two new BrainEngine methods replace the page walk with a SQL-side filter:
- countStaleChunks(): single SELECT count(*) WHERE embedding IS NULL.
  Pre-flight short-circuit; ~50 bytes wire when 0 stale.
- listStaleChunks(): slug + chunk_index + chunk_text + chunk_source +
  model + token_count for stale rows only. Excludes the (NULL) embedding
  column. Bounded by LIMIT 100000 mirroring listPages.

embedAll forks: staleOnly=true takes the new SQL-side path
(embedAllStale); staleOnly=false (--all) keeps existing behavior verbatim.

embedAllStale preserves non-stale chunks on partially-stale pages: it
re-fetches existing chunks per stale slug and merges (embedding=undefined
for non-stale → COALESCE preserves existing). Without the merge, the
upsertChunks != ALL filter would delete non-stale chunks. Re-fetch cost
is bounded by stale slug count; the autopilot common case (0 stale)
never reaches this path.

Predicate uses `embedding IS NULL`, not `embedded_at IS NULL`. The bulk-
import path could leave embedded_at populated while embedding was NULL
(see upsertChunks consistency fix below), so `embedding IS NULL` is the
truth source for "this chunk needs an embedding".

Also fixes the upsertChunks consistency bug in both engines: when
chunk_text changes and no new embedding is supplied, embedding correctly
clears to NULL but embedded_at kept its old timestamp. New behavior
resets BOTH columns together, keeping write-time honesty.

Wire-cost impact (measured against current behavior on a 1.5K-page brain):
- 0 stale chunks (autopilot common case): ~76 MB → ~50 bytes (~1.5M× reduction)
- 100 stale across 10 pages: ~76 MB → ~150 KB (~500× reduction)
- 8K stale across 1.5K pages (cold start): ~76 MB → ~12 MB (~6× reduction)

Tests: 4 new in test/embed.test.ts (zero-stale short-circuit; N-stale-
across-M-pages with non-stale preservation; --stale dry-run; --all path
byte-identical). Existing --stale tests updated for the new mock surface.

Migration impact: none. embedded_at and embedding columns have been on
content_chunks since schema inception.

Co-Authored-By: atrevino47 <atbuster47@gmail.com>

* chore(wave): post-merge tightening — drop executeRaw retry (D3) + gate noExtract (F2)

- Drop #406's per-call executeRaw retry wrapper. The regex idempotence
  boundary is unsound (writable CTEs, side-effecting SELECTs). Recovery
  now happens at the supervisor level via 3-strikes-then-reconnect.
- Update db.ts: setSessionDefaults becomes a back-compat no-op.
  resolveSessionTimeouts (from #363) is the source of truth, sending
  GUCs as startup parameters that survive PgBouncer transaction mode.
  Bumped idle_in_transaction default from 2min to 5min to match v0.21.0
  posture.
- Gate noExtract in cycle's runPhaseSync on whether extract phase is
  scheduled. Avoids silently dropping extraction when the user runs
  `gbrain dream --phase sync` (Codex F2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(db): rephrase docstring to avoid false-positive in test source-grep

The migrate.test.ts structural check counts `SET idle_in_transaction_session_timeout`
matches in source. The literal string in this docstring was tripping it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test: backfill regression guards for #417, D3, F2 (Step 5)

15 new test cases across 3 files, ~250 LOC, all PGLite/in-memory:

test/extract-incremental.test.ts (NEW, 8 cases for #417):
- slugs: [] returns immediately (early-return)
- slugs: undefined falls through to full-walk
- slugs: [a, b] reads only those files
- Slug whose file no longer exists is silently skipped
- Mode filter (links) skips timeline extraction
- dryRun: true does not invoke addLinksBatch / addTimelineEntriesBatch
- BATCH_SIZE flush — >100 candidate links exercise mid-iteration flush
- Full-slug-set resolution — link to file outside changed set still resolves

test/core/cycle.test.ts (4 new cases for #417 + Codex F2):
- cycle threads sync.pagesAffected into extract phase as the slugs argument
- extract phase falls back to full walk when sync was skipped
- F2 guard: full cycle (sync + extract) sets noExtract=true on sync
- F2 guard: phases:[sync] only sets noExtract=false (no silent extract drop)

test/connection-resilience.test.ts (3 new cases for D3):
- PostgresEngine.executeRaw is a single-statement passthrough (no try/catch)
- PostgresEngine.reconnect() still exists for supervisor-driven recovery
- Supervisor still has the 3-strikes-then-reconnect path

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(wave): v0.21.1 release notes + 3 follow-up TODOs + CLAUDE.md updates

CHANGELOG.md: segment-aware entry per CEO-review D1 — 'For everyone'
section (#417 incremental extract, #403 cycle abort) leads, 'For Postgres /
Supabase users' section (#406, #363, #409) follows. Production proof
point as a sidebar, not the lead.

TODOS.md: 3 follow-up items per Eng-review D6:
  1. Caller-opt-in retry for executeRaw (D3 follow-up)
  2. Replace walkMarkdownFiles with engine.getAllSlugs() (F1 follow-up)
  3. err.code-based connection-error matching (B1 follow-up)

CLAUDE.md: 6 file-reference updates for the wave's behavioral additions
(postgres-engine, db, cycle, worker, supervisor, embed, extract).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(release): bump version 0.21.1 → 0.22.1 + document version locations

User-explicit version override on /ship: ship as v0.22.1 (MINOR jump from
master's 0.21.0) instead of the v0.21.1 PATCH the wave originally targeted.
The wave bundles 5 production fixes which is meaningful enough to clear a
MINOR version, even though the API surface is additive.

Files updated to 0.22.1:
- VERSION (single source of truth)
- package.json (Bun/npm version)
- CHANGELOG.md (release header + "To take advantage of v0.22.1" block)
- TODOS.md (3 follow-up TODOs reference the version that filed them)
- CLAUDE.md (Key Files annotations cite the release that introduced behavior)

Also adds a "Version locations" section to CLAUDE.md documenting all five
required files plus the auto-derived (bun.lock, llms-full.txt) and
historical (skills/migrations/v*.md, src/commands/migrations/v*.ts,
test/migrations-v*.test.ts) categories. Future /ship runs and the
auto-update agent now have a canonical list of where versions live.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(test): unbreak CI typecheck — annotate signal as AbortSignal | undefined

CI's `bun run typecheck` step was failing with TS2339 at
test/minions.test.ts:2026 — `const signal = undefined` narrows to literal
`undefined`, which has no `.aborted` property, so `signal?.aborted`
doesn't compile.

Fix uses `as AbortSignal | undefined` to preserve the union type. A
plain type annotation gets narrowed back via control-flow analysis; the
`as` cast doesn't. Runtime behavior is unchanged — the optional-chain
still short-circuits as intended.

Verified: bunx tsc --noEmit → exit 0; the 3 checkAborted cases still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(doctor): forward-progress override for stale minions partials

The minions_migration check reads ~/.gbrain/migrations/completed.jsonl
and flags any version that has a `partial` entry without a matching
`complete`. Long-lived installs accumulate partial records from
historical stopgap runs (notably v0.11.0). Without time decay or
forward-progress detection, the FAIL flag fires forever once any
partial lands, even on installs that have been running clean at
v0.22+ for months.

Concrete failure: test/e2e/mechanical.test.ts "gbrain doctor exits 0
on healthy DB" was flaking on dev machines whose ~/.gbrain/ carried
v0.11.0 partials from earlier in the day. The fresh test DB had
nothing wrong with it; doctor was just reading host filesystem state
that bled in via $HOME.

Fix: a partial vX.Y.Z is treated as stale (not stuck) if any vA.B.C
where A.B.C >= X.Y.Z has a `complete` entry anywhere in the file.
The reasoning: if a newer migration successfully landed, the install
has clearly moved past the older partial. compareVersions() from
src/commands/migrations/index.ts handles the semver compare.

Cases preserved:
- v0.10 complete + v0.11 partial → still FAILs (older complete doesn't
  supersede newer partial)
- v0.16 partial alone → still FAILs (no override exists)
- Fresh install (no completed.jsonl) → no warning
- Real partial-then-complete-same-version → no warning

Cases now fixed:
- v0.16 complete + v0.11 partial → no FAIL (forward progress made;
  the v0.11 record is stale)

Two regression tests in test/doctor-minions-check.test.ts cover both
directions of the override (when it fires, when it doesn't).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(docs): regenerate llms-full.txt after CLAUDE.md updates

CI's build-llms regen-drift guard caught that llms-full.txt was stale
relative to CLAUDE.md after the wave's documentation commits (the
"Version locations" section + 6 file-reference annotations for the
wave's behavioral additions).

CLAUDE.md notes that llms-full.txt is auto-derived — bumped via
'bun run build:llms' when CLAUDE.md's file-references change. This
commit catches up.

llms.txt is unchanged; the curated index doesn't pull from CLAUDE.md's
file-reference body. Only llms-full.txt (the inlined single-fetch
bundle) needed regeneration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: root <root@localhost>
Co-authored-by: orendi84 <orendigergo@gmail.com>
Co-authored-by: atrevino47 <atbuster47@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
garrytan added a commit that referenced this pull request Apr 26, 2026
Master shipped v0.22.1 with 5 prod hotfixes (PRs #417/#403/#406/#363/#409)
while this branch was open. Merging cleanly:

Conflicts resolved:
- VERSION: kept 0.22.2 (this branch's slot, master is now 0.22.1)
- package.json: kept 0.22.2
- CHANGELOG.md: v0.22.2 entry on top, master's v0.22.1 + earlier entries below.
  Also stripped a stray "=======" leftover from the prior merge resolution.
- test/minions.test.ts: kept both blocks — my v0.22.2 watchdog + connectWithRetry
  describes (11 cases) AND master's v0.20.5 abort-signal-propagation describe
  (added by PR #403 cycle-abort).

Auto-merged cleanly:
- src/core/minions/worker.ts: my watchdog (checkMemoryLimit, gracefulShutdown,
  periodic timer, jobsCompleted, gracefulShutdownFired) coexists with master's
  AbortSignal cycle-abort plumbing (PR #403). Different code paths.
- src/core/minions/supervisor.ts: my maxRssMb default (2048) + spawn arg
  injection coexists with master's consecutiveHealthFailures + engine.reconnect
  (PR #406). Different layers (boot-time vs runtime).
- src/core/db.ts: my connectWithRetry + isRetryableDbConnectError coexists with
  master's resolveSessionTimeouts + setSessionDefaults shim (PR #363).
  Different concerns (connect-retry vs session-GUC delivery).
- src/commands/jobs.ts: my parseMaxRssFlag + work/supervisor flag plumbing
  coexists with master's jobs-list/extract changes.
- src/commands/autopilot.ts: my maxWaiting:1 + stable-run reset coexists with
  master's incremental-extract changes (PR #417).

PR #406's reconnect is at health-check level (engine.reconnect after 3
consecutive failures); my connectWithRetry is at boot/cold-start level.
Complementary, not duplicative — both layers ship.

Verification:
- bun build: clean, gbrain 0.22.2
- bun test test/minions.test.ts test/supervisor.test.ts: 174/174 pass
  (was 168 pre-merge; +6 from master's new abort-signal cases)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garrytan
Copy link
Copy Markdown
Owner

Closing — Server-side stale filter shipped in v0.22.1 (credited to @atrevino47).

Thanks for the report. If anything still reproduces on the latest release, please reopen with the version + repro.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants