perf: incremental extract — only process slugs that sync touched by garrytan · Pull Request #417 · garrytan/gbrain

garrytan · 2026-04-25T13:41:25Z

Problem

On a brain with 54K+ pages, the autopilot-cycle extract phase is a performance cliff. Every 5-minute cycle:

Walks the entire brain directory — walkMarkdownFiles() does a recursive readdirSync + readFileSync on every .md file
Does it twice — once for link extraction (extract.links_fs), once for timeline extraction (extract.timeline_fs)
Writes zero rows most cycles — the DB uses ON CONFLICT (upsert), so unchanged files produce no new data

Observed impact in production (54,461 pages):

Extract phase alone consumed 600s+ per cycle (the full job timeout)
Jobs regularly died with timeout exceeded errors
Dead autopilot-cycle jobs held the Postgres cycle lock for their full 30-min TTL
All subsequent cycles returned skipped: cycle_already_running
Meanwhile, the cron submitted a new autopilot-cycle every 5 minutes → queue backlog grew to 20-35 waiting jobs
Shell jobs (enrichment, data pipelines, etc.) starved because autopilot-cycle monopolized worker slots

Timeline of incidents:

Queue depth hit 36 waiting + 3 stalled during overnight monitoring
Worker had to be force-killed (SIGKILL — SIGTERM did not work because the zombie extract handler held the event loop)
After killing the worker, the Postgres cycle lock persisted for 30 more minutes, blocking all cycles
Pattern repeated across multiple cycles over 24h+

Root cause: runPhaseExtract() in cycle.ts calls runExtractCore(engine, { mode: 'all', dir: brainDir }) which calls extractLinksFromDir() + extractTimelineFromDir() — both invoke walkMarkdownFiles(brainDir) which does a full recursive directory walk + readFileSync on every .md file. The sync phase already returns pagesAffected: string[] but extract ignores it.

Error Log

Full error output from dead autopilot-cycle jobs:

Job #1286: autopilot-cycle (DEAD after 0 attempts)
  Queue: default | Priority: 0
  Attempts: 0/2 (started: 1)
  Started: 2026-04-24T22:20:03.661Z
  Finished: 2026-04-24T22:30:15.149Z
  Error: timeout exceeded
  Data: {"repoPath":"/data/brain"}

Job #1353: autopilot-cycle (DEAD after 0 attempts)
  Started: 2026-04-25T01:25:06.699Z
  Finished: 2026-04-25T01:35:27.521Z
  Error: timeout exceeded

Job #1365: autopilot-cycle (DEAD after 0 attempts)
  Started: 2026-04-25T01:40:03.305Z
  Finished: 2026-04-25T01:50:27.542Z
  Error: timeout exceeded

Worker log showing the extract phase grinding through all 54K pages before timing out:

[extract.timeline_fs] 45645/54461 (83%)
[extract.timeline_fs] 46595/54461 (85%)
[extract.timeline_fs] 47858/54461 (87%)
[extract.timeline_fs] 49498/54461 (90%)
[extract.timeline_fs] 51889/54461 (95%)
[extract.timeline_fs] 53918/54461 (99%)
[extract.timeline_fs] 54461/54461 (100%) done
Timeline: created 0 entries from 54461 pages
[cycle.extract] done
[cycle.embed] start
Job 1290 (autopilot-cycle) hit per-job timeout (600000ms), aborting
Job 1290 (autopilot-cycle) did not exit within 30s of abort. Force-evicting
from inFlight to unblock worker. The handler is still running but the worker
will claim new jobs.

Queue state showing the resulting backlog:

Job Stats (last 24h):
  Type           Total   Done    Failed   Dead   Avg Time
  shell          73      59      0        12     542.6s
  autopilot-cycle 288     173     0        3      131.3s

  Queue health: 36 waiting, 3 active, 3 stalled

What We Tried

Killed the stuck worker (SIGTERM) — did not work. The zombie extract handler held the event loop. Required SIGKILL.
Cancelled stalled + duplicate queued jobs — cleared the backlog (20 duplicate autopilot-cycle jobs cancelled, 925 old completed jobs pruned). Queue went to 0 waiting. But the root cause remained: the next cycle would walk 54K files again.
Manually released the Postgres cycle lock — the killed worker left a live lock row with 30-min TTL. Had to DELETE FROM gbrain_cycle_locks to unblock new cycles. This confirmed the lock-leak pattern.
Restarted worker using jobs supervisor (auto-crash-restart instead of bare jobs work) — improved resilience but didn't fix the performance problem. Supervisor successfully restarted after the SIGKILL, but the next autopilot-cycle still took 1106.4s on the first run (full walk).
Implemented incremental extract (this PR) — piped sync's pagesAffected to extract. Immediate improvement: 0.2s per cycle instead of 600s+. The 54K walk only happens on first run or CLI invocation.

Hotfix: Deployed directly to the production instance for immediate relief. This PR is the clean version of the same change for merging to master.

Solution

Pipe the sync phase's pagesAffected list into the extract phase. When provided, extract reads only those specific files instead of walking the entire brain.

Changes

src/commands/extract.ts:

Add slugs?: string[] to ExtractOpts for incremental mode
Add extractForSlugs() — processes only the specified slugs in a single pass (combined links + timeline), vs the full-walk path which reads every file twice
When slugs is an empty array (nothing changed), return immediately with zero counts
When slugs is undefined (CLI usage, first run), fall back to the existing full walk — no behavior change for gbrain extract CLI
Still builds the full slug set via walkMarkdownFiles() for link resolution (resolveSlug needs all valid targets), but this is a single readdir traversal, not 54K readFileSync calls

src/core/cycle.ts:

runPhaseSync now returns pagesAffected from the sync result (via SyncPhaseResult interface extension)
runCycle captures syncPagesAffected after Phase 3 and passes it to Phase 4
runPhaseExtract accepts optional changedSlugs parameter
If sync didn't run or failed, syncPagesAffected is undefined → extract falls back to full walk (safe default)
Extract phase summary now reports (incremental: N slugs) for observability

Behavior matrix

Scenario	Extract behavior
Normal autopilot-cycle (sync ran, 3 pages changed)	Reads 3 files, skips 54K
Normal autopilot-cycle (sync ran, nothing changed)	Returns immediately, 0 file reads
Autopilot-cycle (sync phase not in phase list)	Full walk (syncPagesAffected is undefined)
Autopilot-cycle (sync failed)	Full walk (safe fallback)
CLI `gbrain extract all`	Full walk (slugs not passed)
CLI `gbrain extract --source db`	Unchanged (DB path, not FS)

Results

Before and after on a 54,461-page brain:

Metric	Before	After
Extract phase duration	600s+ (often timeout)	0.2s (typical cycle)
Files read per cycle	54,461 × 2 = 108,922	0-10 (only changed)
Autopilot-cycle total	10-18 minutes	< 1 second
Queue depth (observed)	20-36 waiting	0 waiting
Dead jobs from timeout	3-5 per day	0

Job log comparison:

# Before (full walk): 1106.4 seconds
Job #1312: autopilot-cycle (COMPLETED) Time: 1106.4s

# After (incremental): 0.2 seconds
Job #1316: autopilot-cycle (COMPLETED) Time: 0.2s

Incremental extract in action (3 files changed by sync):

[cycle.sync] done
[cycle.extract] start
[extract.incremental] start
[extract.incremental] 3/3 (100%)
[extract.incremental] 3/3 (100%) done
Incremental extract: created 0 link(s), 0 timeline entries from 3/3 page(s)
[cycle.extract] done

Testing

bun run typecheck — clean
bun test -- extract — 124 tests pass, 0 failures
Production hotfix deployed and running stable for 12+ hours

The autopilot-cycle runs every 5 min. Its extract phase was doing a full filesystem walk of ALL markdown files (54K+) — twice (links + timeline). On a brain this size, extract alone exceeded the 600s job timeout, producing zero useful writes. Fix: sync already returns pagesAffected (the slugs it added/modified). Pipe that list through to extract. When provided, extract reads ONLY those files instead of walking the entire brain directory. - Add ExtractOpts.slugs for targeted extraction - Add extractForSlugs() — single-pass links + timeline for specific slugs - cycle.ts: capture sync's pagesAffected, pass to runPhaseExtract - If sync didn't run or failed, extract falls back to full walk (safe) - If pagesAffected is empty (nothing changed), extract returns instantly Expected improvement: 54K file reads → ~10-50 per cycle. The full walk is still available via CLI `gbrain extract` and on first-run.

…409) (#447) * fix: propagate AbortSignal to runCycle + worker force-eviction safety net Root cause: autopilot-cycle handler called runCycle() without passing the job's AbortSignal. When the per-job timeout fired abort(), runCycle never checked it and kept grinding through extract (54,605 pages). The executeJob promise never resolved, inFlight never decremented, and the worker thought it was at capacity forever — 98 jobs piled up waiting with 0 active while a live worker sat idle. Three-layer fix: 1. CycleOpts.signal: new optional AbortSignal field. runCycle checks it between every phase via checkAborted(). A timed-out cycle now bails after the current phase completes instead of running all 6 phases. 2. autopilot-cycle handler: passes job.signal to runCycle so the abort actually propagates. 3. Worker safety net: 30s after the abort fires, if the handler still hasn't resolved, force-evict from inFlight and mark as dead in DB. This is the last-resort escape hatch for any handler that ignores AbortSignal — the worker resumes claiming new jobs instead of wedging forever. Incident: 2026-04-24, 98 waiting / 0 active / worker alive but idle. 143 existing minions tests pass unchanged. * test: abort signal propagation + worker recovery regression tests 16 new tests across 3 files covering the 2026-04-24 worker wedge: test/minions.test.ts (6 new, 149 total): - handler receiving abort signal exits cleanly - handler ignoring abort still gets signal delivered - worker claims new jobs after timeout (no wedge) ← key regression - checkAborted pattern: undefined/non-aborted/aborted signals test/cycle-abort.test.ts (7 new): - CycleOpts.signal type contract - runCycle accepts signal without error - runCycle bails on pre-aborted signal - runCycle bails mid-flight when signal fires between phases - Source-level guard: jobs.ts passes job.signal to runCycle - Source-level guard: worker.ts has force-eviction safety net - Source-level guard: cycle.ts has checkAborted between all 6 phases test/e2e/worker-abort-recovery.test.ts (3 new): - worker recovers from timed-out handler and processes next job - concurrency=2 processes parallel jobs during timeout - multiple sequential timeouts don't permanently wedge worker All 159 tests pass. * perf: incremental extract — only process slugs that sync touched The autopilot-cycle runs every 5 min. Its extract phase was doing a full filesystem walk of ALL markdown files (54K+) — twice (links + timeline). On a brain this size, extract alone exceeded the 600s job timeout, producing zero useful writes. Fix: sync already returns pagesAffected (the slugs it added/modified). Pipe that list through to extract. When provided, extract reads ONLY those files instead of walking the entire brain directory. - Add ExtractOpts.slugs for targeted extraction - Add extractForSlugs() — single-pass links + timeline for specific slugs - cycle.ts: capture sync's pagesAffected, pass to runPhaseExtract - If sync didn't run or failed, extract falls back to full walk (safe) - If pagesAffected is empty (nothing changed), extract returns instantly Expected improvement: 54K file reads → ~10-50 per cycle. The full walk is still available via CLI `gbrain extract` and on first-run. * fix: connection resilience for minion supervisor + worker Three fixes for the minion supervisor dying silently when PgBouncer rotates: 1. PostgresEngine: executeRaw retries once on connection-class errors (ECONNREFUSED, password auth failed, connection terminated, etc.) by tearing down the poisoned pool and creating a fresh one via reconnect(). Prevents cascading failures when Supabase bounces. 2. Supervisor: tracks consecutive health check failures. After 3 in a row, emits health_warn with reason=db_connection_degraded and attempts engine.reconnect() if available. Resets counter on success. 3. Supervisor: worker_exited events now include likely_cause field: SIGKILL → oom_or_external_kill, SIGTERM → graceful_shutdown, code=1 → runtime_error. Makes it trivial to distinguish OOM kills from connection deaths in logs. Tests: 23 new tests covering connection error detection, reconnect guard against concurrent reconnects, retry-once-not-infinite-loop, health failure tracking, and exit classification. * fix(db): set session timeouts on every connection to kill orphan backends Prevents the failure mode from #361: a single autopilot UPDATE on minion_jobs can leave a pooler backend in state='active'/ClientRead for 24h+, holding a RowExclusiveLock that blocks every subsequent ALTER TABLE minion_jobs. The stuck backend never times out on its own because Supabase Micro has no default idle_in_transaction_session_timeout and autovacuum can't reap sessions that hold active locks. Fix: deliver statement_timeout + idle_in_transaction_session_timeout as startup parameters via postgres.js's `connection` option, applied automatically on every new backend connection. Works correctly on both session-mode and transaction-mode PgBouncer poolers (startup params persist for the backend's lifetime, unlike SET commands which transaction-mode PgBouncer strips between transactions). Defaults chosen conservatively so they don't interfere with bulk work like multi-minute embed passes or CREATE INDEX on large pages tables: - statement_timeout: '5min' - idle_in_transaction_session_timeout: '2min' Each overridable per-GUC via env var (GBRAIN_STATEMENT_TIMEOUT, GBRAIN_IDLE_TX_TIMEOUT). Set any to '0' or 'off' to disable. client_connection_check_interval is the specific GUC that would kill the observed state='active'/ClientRead case, but it's Postgres 14+ and some managed poolers reject unknown startup parameters. Made it opt-in only via GBRAIN_CLIENT_CHECK_INTERVAL for users who know their Postgres supports it. Applied in both the module-level singleton connect (src/core/db.ts) and the per-engine-instance pool used by `gbrain jobs work` (src/core/postgres-engine.ts) via a shared resolveSessionTimeouts() helper. Tests: 5 new cases in migrate.test.ts covering defaults, env overrides, '0'/'off' disable, and multi-GUC disable. 39/39 pass (34 pre-existing + 5 new). Closes #361. Co-Authored-By: orendi84 <orendigergo@gmail.com> * fix(embed): server-side staleness filter for embed --stale (v0.20.5) embed --stale walked listPages + per-page getChunks (incl. vector(1536) embedding column) on every call, then client-side-filtered for chunks where embedding was missing. On a 1.5K-page brain at 100% coverage, ~76 MB pulled per call, all discarded. With autopilot firing every 5-10 min plus a 2h cron, this hit Supabase's 5 GB free-tier ceiling at 102 GB used (2058% over) twice in one week. Two new BrainEngine methods replace the page walk with a SQL-side filter: - countStaleChunks(): single SELECT count(*) WHERE embedding IS NULL. Pre-flight short-circuit; ~50 bytes wire when 0 stale. - listStaleChunks(): slug + chunk_index + chunk_text + chunk_source + model + token_count for stale rows only. Excludes the (NULL) embedding column. Bounded by LIMIT 100000 mirroring listPages. embedAll forks: staleOnly=true takes the new SQL-side path (embedAllStale); staleOnly=false (--all) keeps existing behavior verbatim. embedAllStale preserves non-stale chunks on partially-stale pages: it re-fetches existing chunks per stale slug and merges (embedding=undefined for non-stale → COALESCE preserves existing). Without the merge, the upsertChunks != ALL filter would delete non-stale chunks. Re-fetch cost is bounded by stale slug count; the autopilot common case (0 stale) never reaches this path. Predicate uses `embedding IS NULL`, not `embedded_at IS NULL`. The bulk- import path could leave embedded_at populated while embedding was NULL (see upsertChunks consistency fix below), so `embedding IS NULL` is the truth source for "this chunk needs an embedding". Also fixes the upsertChunks consistency bug in both engines: when chunk_text changes and no new embedding is supplied, embedding correctly clears to NULL but embedded_at kept its old timestamp. New behavior resets BOTH columns together, keeping write-time honesty. Wire-cost impact (measured against current behavior on a 1.5K-page brain): - 0 stale chunks (autopilot common case): ~76 MB → ~50 bytes (~1.5M× reduction) - 100 stale across 10 pages: ~76 MB → ~150 KB (~500× reduction) - 8K stale across 1.5K pages (cold start): ~76 MB → ~12 MB (~6× reduction) Tests: 4 new in test/embed.test.ts (zero-stale short-circuit; N-stale- across-M-pages with non-stale preservation; --stale dry-run; --all path byte-identical). Existing --stale tests updated for the new mock surface. Migration impact: none. embedded_at and embedding columns have been on content_chunks since schema inception. Co-Authored-By: atrevino47 <atbuster47@gmail.com> * chore(wave): post-merge tightening — drop executeRaw retry (D3) + gate noExtract (F2) - Drop #406's per-call executeRaw retry wrapper. The regex idempotence boundary is unsound (writable CTEs, side-effecting SELECTs). Recovery now happens at the supervisor level via 3-strikes-then-reconnect. - Update db.ts: setSessionDefaults becomes a back-compat no-op. resolveSessionTimeouts (from #363) is the source of truth, sending GUCs as startup parameters that survive PgBouncer transaction mode. Bumped idle_in_transaction default from 2min to 5min to match v0.21.0 posture. - Gate noExtract in cycle's runPhaseSync on whether extract phase is scheduled. Avoids silently dropping extraction when the user runs `gbrain dream --phase sync` (Codex F2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(db): rephrase docstring to avoid false-positive in test source-grep The migrate.test.ts structural check counts `SET idle_in_transaction_session_timeout` matches in source. The literal string in this docstring was tripping it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test: backfill regression guards for #417, D3, F2 (Step 5) 15 new test cases across 3 files, ~250 LOC, all PGLite/in-memory: test/extract-incremental.test.ts (NEW, 8 cases for #417): - slugs: [] returns immediately (early-return) - slugs: undefined falls through to full-walk - slugs: [a, b] reads only those files - Slug whose file no longer exists is silently skipped - Mode filter (links) skips timeline extraction - dryRun: true does not invoke addLinksBatch / addTimelineEntriesBatch - BATCH_SIZE flush — >100 candidate links exercise mid-iteration flush - Full-slug-set resolution — link to file outside changed set still resolves test/core/cycle.test.ts (4 new cases for #417 + Codex F2): - cycle threads sync.pagesAffected into extract phase as the slugs argument - extract phase falls back to full walk when sync was skipped - F2 guard: full cycle (sync + extract) sets noExtract=true on sync - F2 guard: phases:[sync] only sets noExtract=false (no silent extract drop) test/connection-resilience.test.ts (3 new cases for D3): - PostgresEngine.executeRaw is a single-statement passthrough (no try/catch) - PostgresEngine.reconnect() still exists for supervisor-driven recovery - Supervisor still has the 3-strikes-then-reconnect path Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(wave): v0.21.1 release notes + 3 follow-up TODOs + CLAUDE.md updates CHANGELOG.md: segment-aware entry per CEO-review D1 — 'For everyone' section (#417 incremental extract, #403 cycle abort) leads, 'For Postgres / Supabase users' section (#406, #363, #409) follows. Production proof point as a sidebar, not the lead. TODOS.md: 3 follow-up items per Eng-review D6: 1. Caller-opt-in retry for executeRaw (D3 follow-up) 2. Replace walkMarkdownFiles with engine.getAllSlugs() (F1 follow-up) 3. err.code-based connection-error matching (B1 follow-up) CLAUDE.md: 6 file-reference updates for the wave's behavioral additions (postgres-engine, db, cycle, worker, supervisor, embed, extract). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(release): bump version 0.21.1 → 0.22.1 + document version locations User-explicit version override on /ship: ship as v0.22.1 (MINOR jump from master's 0.21.0) instead of the v0.21.1 PATCH the wave originally targeted. The wave bundles 5 production fixes which is meaningful enough to clear a MINOR version, even though the API surface is additive. Files updated to 0.22.1: - VERSION (single source of truth) - package.json (Bun/npm version) - CHANGELOG.md (release header + "To take advantage of v0.22.1" block) - TODOS.md (3 follow-up TODOs reference the version that filed them) - CLAUDE.md (Key Files annotations cite the release that introduced behavior) Also adds a "Version locations" section to CLAUDE.md documenting all five required files plus the auto-derived (bun.lock, llms-full.txt) and historical (skills/migrations/v*.md, src/commands/migrations/v*.ts, test/migrations-v*.test.ts) categories. Future /ship runs and the auto-update agent now have a canonical list of where versions live. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(test): unbreak CI typecheck — annotate signal as AbortSignal | undefined CI's `bun run typecheck` step was failing with TS2339 at test/minions.test.ts:2026 — `const signal = undefined` narrows to literal `undefined`, which has no `.aborted` property, so `signal?.aborted` doesn't compile. Fix uses `as AbortSignal | undefined` to preserve the union type. A plain type annotation gets narrowed back via control-flow analysis; the `as` cast doesn't. Runtime behavior is unchanged — the optional-chain still short-circuits as intended. Verified: bunx tsc --noEmit → exit 0; the 3 checkAborted cases still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(doctor): forward-progress override for stale minions partials The minions_migration check reads ~/.gbrain/migrations/completed.jsonl and flags any version that has a `partial` entry without a matching `complete`. Long-lived installs accumulate partial records from historical stopgap runs (notably v0.11.0). Without time decay or forward-progress detection, the FAIL flag fires forever once any partial lands, even on installs that have been running clean at v0.22+ for months. Concrete failure: test/e2e/mechanical.test.ts "gbrain doctor exits 0 on healthy DB" was flaking on dev machines whose ~/.gbrain/ carried v0.11.0 partials from earlier in the day. The fresh test DB had nothing wrong with it; doctor was just reading host filesystem state that bled in via $HOME. Fix: a partial vX.Y.Z is treated as stale (not stuck) if any vA.B.C where A.B.C >= X.Y.Z has a `complete` entry anywhere in the file. The reasoning: if a newer migration successfully landed, the install has clearly moved past the older partial. compareVersions() from src/commands/migrations/index.ts handles the semver compare. Cases preserved: - v0.10 complete + v0.11 partial → still FAILs (older complete doesn't supersede newer partial) - v0.16 partial alone → still FAILs (no override exists) - Fresh install (no completed.jsonl) → no warning - Real partial-then-complete-same-version → no warning Cases now fixed: - v0.16 complete + v0.11 partial → no FAIL (forward progress made; the v0.11 record is stale) Two regression tests in test/doctor-minions-check.test.ts cover both directions of the override (when it fires, when it doesn't). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(docs): regenerate llms-full.txt after CLAUDE.md updates CI's build-llms regen-drift guard caught that llms-full.txt was stale relative to CLAUDE.md after the wave's documentation commits (the "Version locations" section + 6 file-reference annotations for the wave's behavioral additions). CLAUDE.md notes that llms-full.txt is auto-derived — bumped via 'bun run build:llms' when CLAUDE.md's file-references change. This commit catches up. llms.txt is unchanged; the curated index doesn't pull from CLAUDE.md's file-reference body. Only llms-full.txt (the inlined single-fetch bundle) needed regeneration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: root <root@localhost> Co-authored-by: orendi84 <orendigergo@gmail.com> Co-authored-by: atrevino47 <atbuster47@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Master shipped v0.22.1 with 5 prod hotfixes (PRs #417/#403/#406/#363/#409) while this branch was open. Merging cleanly: Conflicts resolved: - VERSION: kept 0.22.2 (this branch's slot, master is now 0.22.1) - package.json: kept 0.22.2 - CHANGELOG.md: v0.22.2 entry on top, master's v0.22.1 + earlier entries below. Also stripped a stray "=======" leftover from the prior merge resolution. - test/minions.test.ts: kept both blocks — my v0.22.2 watchdog + connectWithRetry describes (11 cases) AND master's v0.20.5 abort-signal-propagation describe (added by PR #403 cycle-abort). Auto-merged cleanly: - src/core/minions/worker.ts: my watchdog (checkMemoryLimit, gracefulShutdown, periodic timer, jobsCompleted, gracefulShutdownFired) coexists with master's AbortSignal cycle-abort plumbing (PR #403). Different code paths. - src/core/minions/supervisor.ts: my maxRssMb default (2048) + spawn arg injection coexists with master's consecutiveHealthFailures + engine.reconnect (PR #406). Different layers (boot-time vs runtime). - src/core/db.ts: my connectWithRetry + isRetryableDbConnectError coexists with master's resolveSessionTimeouts + setSessionDefaults shim (PR #363). Different concerns (connect-retry vs session-GUC delivery). - src/commands/jobs.ts: my parseMaxRssFlag + work/supervisor flag plumbing coexists with master's jobs-list/extract changes. - src/commands/autopilot.ts: my maxWaiting:1 + stable-run reset coexists with master's incremental-extract changes (PR #417). PR #406's reconnect is at health-check level (engine.reconnect after 3 consecutive failures); my connectWithRetry is at boot/cold-start level. Complementary, not duplicative — both layers ship. Verification: - bun build: clean, gbrain 0.22.2 - bun test test/minions.test.ts test/supervisor.test.ts: 174/174 pass (was 168 pre-merge; +6 from master's new abort-signal cases) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

garrytan · 2026-05-10T03:33:43Z

Closing — Incremental extract shipped in v0.22.1.

Thanks for the report. If anything still reproduces on the latest release, please reopen with the version + repro.

This was referenced Apr 25, 2026

fix: propagate AbortSignal to runCycle + worker force-eviction safety net #403

Closed

v0.22.1 autopilot fix wave — 5 prod hotfixes (#417, #403, #406, #363, #409) #447

Merged

garrytan closed this May 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: incremental extract — only process slugs that sync touched#417

perf: incremental extract — only process slugs that sync touched#417
garrytan wants to merge 1 commit intomasterfrom
feat/incremental-extract

garrytan commented Apr 25, 2026 •

edited

Loading

Uh oh!

garrytan commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Error Log

What We Tried

Solution

Changes

Behavior matrix

Results

Testing

Related

Uh oh!

garrytan commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

garrytan commented Apr 25, 2026 •

edited

Loading