Skip to content

v0.42.41.0 fix: triage wave — 6 data-loss/availability fixes + 9 community PRs#2128

Merged
garrytan merged 47 commits into
masterfrom
garrytan/triage-gbrain-prs-issues
Jun 12, 2026
Merged

v0.42.41.0 fix: triage wave — 6 data-loss/availability fixes + 9 community PRs#2128
garrytan merged 47 commits into
masterfrom
garrytan/triage-gbrain-prs-issues

Conversation

@garrytan

Copy link
Copy Markdown
Owner

Summary

A triage of open reports surfaced six bugs with no fix yet plus a batch of community PRs. This ships them together as one wave (v0.42.41.0), each authored fix with a regression test, each community PR reviewed against the cross-cutting invariants (fail-closed trust, source isolation, JSONB-no-stringify, engine parity, migration discipline).

Authored critical fixes (no prior PR):

Community PRs merged (each reviewed): #2064 (ignore cwd .env DATABASE_URL, closes #427), #2052 + #2020 (sync walker honors .gitignore, prunes vendor/dist/build/venv), #2033 (asymmetric embedding input_type across the AI SDK, supersedes #1400), #2074 (atomic updateSourceConfig JSONB merge), #2075 (doctor correctness pass), #2009 + #2072 (OAuth scope-default + legacy token source grants, shared scope parser), #2073 (search 10s force-exit + scoped call-graph + migration v116).

Test Coverage

Each authored fix ships a regression test on both engines where engine-affecting:

Integrated targeted gate across all 6 fixes + the 9 PRs' test files: 327 pass / 0 fail. bun run typecheck clean on the full merged tree. The full sharded unit suite runs in CI as the authoritative gate (the single-process bun test OOMs locally; the repo shards for this reason).

Pre-Landing Review

Plan hardened by /plan-eng-review + a codex outside-voice pass (caught that 4 of 6 authored fixes were wrong/incomplete as first drafted — corrected before any code). A second codex adversarial pass on the as-built diff surfaced 4 more real issues, all fixed in this PR:

Eval Results

No prompt-related files changed — evals skipped.

Plan Completion

All 8 plan tasks complete (T1–T6 authored fixes, T7 PR wave, T8 ship). Plan + GSTACK REVIEW REPORT at ~/.claude/plans/system-instruction-you-are-working-zany-thacker.md.

TODOS

Filed 4 deferred follow-ups (eng-reviewed as separate scope): #1994 (supervisor backoff), #1963 (PGLite statement_timeout), #2050 (drain-worker self-deadlock), and a name-keyed migration ledger (#2038 structural follow-up).

Documentation

Updated docs/architecture/KEY_FILES.md to current-state truth (no release-history clauses, per reference-doc discipline): write-through per-source local_path (#2018), reconnect() as a required lifecycle method (#2034), migration v116 + the always-run dedup self-heal (#2073/#2038), new entries for timeline-dedup-repair.ts + timeline_dedup_index doctor check (#2038) and pglite-lock.ts heartbeat + GBRAIN_PGLITE_LOCK_STEAL_GRACE_SECONDS (#2058), and extract-facts.ts cli:-protection + net-deletion warn (#1928). bun run build:llms + freshness/current-state guards green.

Closes

Closes #1928, #2018, #2034, #2058, #2038, #2057, #2027, #427

Test plan

  • Integrated targeted gate: 327 pass / 0 fail (6 fixes + 9 PRs' tests)
  • Per-fix regression tests green on both engines (18/18 after adversarial fixes)
  • bun run typecheck clean on the full merged tree
  • Full sharded unit suite green in CI (authoritative — do not merge until green)

🤖 Generated with Claude Code

Austin Arnett and others added 30 commits June 9, 2026 11:39
When a client omits `scope` on /authorize, the authorize() grant computed
`(params.scopes || []).filter(...)` → the empty set. That empty grant was
written to oauth_codes and propagated into the access AND refresh tokens, so
every request failed `insufficient_scope` even though the client was
registered with e.g. `read write`. Because refresh inherits the stored grant,
it never self-healed — reconnecting just minted another empty-scoped token.

Some MCP connectors (observed with Claude Desktop) omit `scope` on /authorize,
so they hit this on every connection.

Fix: when no scope is requested, default to the client's full registered scope
(RFC 6749 §3.3 permits a server default). This mirrors exchangeClientCredentials,
which already does `requestedScope ? ... : allowedScopes`. The result is still
clamped to the allowed set, so an explicit over-broad request cannot escalate.

Adds test/oauth-authorize-scope-default.test.ts covering: omitted/empty →
inherits full grant; explicit subset honored; clamp preserved (over-broad and
disallowed-only requests cannot escalate or trigger inheritance).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
collectSyncableFiles (first-sync walker) and the incremental PRUNE_DIR_NAMES
set skipped node_modules but not Python venv/. On a Python repo the walker
descended into venv/ (thousands of files); the resulting slug collisions
crashed putPage's INSERT ... ON CONFLICT ... RETURNING with
"undefined is not an object (evaluating 'row.deleted_at')".

Add `venv` alongside node_modules in both the import.ts inline skip and
PRUNE_DIR_NAMES. venv is the Python equivalent of node_modules.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…re body (#1400)

dimsProviderOptions() threads input_type ('query' | 'document') into
providerOptions.openaiCompatible for asymmetric models (ZE zembed-1,
Voyage v3+), but the AI SDK's openai-compatible adapter validates
providerOptions against a fixed schema and silently drops the field
before building the HTTP body. Every embedQuery() was therefore encoded
document-side: the ZE shim's hard default fired ('document'), Voyage and
local openai-compat servers got no input_type at all, and asymmetric
retrieval silently collapsed toward surface-token overlap — while the
providerOptions-level contract test stayed green.

Fix: an AsyncLocalStorage (same pattern as __budgetStore) populated in
embedSubBatch() only when providerOptions actually threads an
input_type, read at body-rewrite time by the fetch shims:
- zeroEntropyCompatFetch: recovers the threaded value; document default
  preserved for ingest paths.
- voyageCompatFetch: opt-in like the dims.ts Voyage branch — inject only
  when threaded; the field stays off the wire otherwise.
- NEW openAICompatAsymmetricFetch: fallthrough default for every other
  openai-compatible recipe (llama-server, litellm, ollama, ...) — the
  canonical local/proxy paths for asymmetric models. Strict pass-through
  when nothing was threaded, so symmetric deployments see zero wire
  change; recipes with their own compat fetch (azure) keep it via the
  compat.fetch ?? precedence.

KNOBS_HASH_VERSION bumped 10→11: cached query_cache rows were keyed on
document-side query vectors; pre-fix rows must not be served to post-fix
lookups (same convention as the v=3 embedding-provider bump). One-time
global cold-miss on upgrade; refills within cache.ttl_seconds.

Tests: test/embed-input-type-wire.test.ts runs the REAL SDK transport
with a mocked global fetch and asserts on the outbound body — the only
layer where this regression is observable. Covers ZE hosted, llama-server,
litellm, ollama (query + document sides) and pins the pass-through for
non-asymmetric models and Voyage's opt-in shape. 4 of the original 7
assertions fail on master, proving the pin. One structural pin in
test/ai/zeroentropy-compat-fetch.test.ts updated to the new line shape
(same semantic); KEY_FILES.md gateway.ts entry updated to the new truth.

Supersedes #1400 (closed unmerged) — same ALS mechanism, extended to
Voyage + all openai-compatible recipes. Credit to @billy-armstrong for
the original diagnosis.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
collectSyncableFiles (the full-sync / dry-run enumerator) reimplemented its
own directory skip list inline (node_modules || ops), bypassing the canonical
pruneDir gate and ignoring .gitignore entirely. On a Laravel/PHP repo this
descended into vendor/ (~50k Composer files), storage/, and public/build/,
trying to import 52k dependency/build files and flooding the index with
library internals (a 35-min sync that never finished, killed by the watchdog
at 3%).

- collectSyncableFiles now enumerates via `git ls-files --cached --others
  --exclude-standard` when dir is a git work tree, so the walk honors
  .gitignore (tracked + untracked-not-ignored). Falls back to the FS walk for
  non-git dirs. EroLab: 52164 -> 1028 files.
- The FS fallback now prunes through the canonical pruneDir() instead of a
  drifted inline list, so the two skip lists can't diverge again.
- PRUNE_DIR_NAMES gains vendor/dist/build (dependency + build-output trees).

Addresses #1483 (.gbrainignore), #1159 (--respect-gitignore), and the
maintainer's #1942 vendor/dist/build prune. Walker regression suites
(sync-walker-symlink, brain-writer-walk-prune, sync, sync-walker-submodule)
green: 90 pass.
Bun merges .env files from the process cwd into process.env before any
user code runs. loadConfig() prefers env DATABASE_URL over
~/.gbrain/config.json, so any gbrain invocation from inside a web-app
checkout silently retargets the brain at that app's database — reads go
to the wrong DB and apply-migrations can write gbrain's schema into a
production app database (#427).

effectiveEnvDatabaseUrl() re-parses the .env files Bun auto-loads from
cwd and treats a DATABASE_URL whose value matches one of them as
file-origin: ignored, with a one-time stderr notice. GBRAIN_DATABASE_URL
and genuinely exported DATABASE_URLs are honored unchanged, so the
operator escape hatch and the e2e suite's env-provided URL keep working.
Applied at loadConfig, getDbUrlSource (doctor parity), init
--non-interactive, and migrate --to.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ore the op body

The 10s force-exit timer in the shared-op dispatch was armed BEFORE the
try block, so any op whose handler ran past 10s wall-clock was killed
mid-flight with process.exit(0) and zero stdout. On a slow Postgres
pooler (6-10s per fresh connection) a healthy `gbrain search` was
force-exited every time — an empty 'success' indistinguishable from no
results. The v0.42.20.0 exitCode honor can't help: a mid-op kill fires
before any error path sets exitCode.

Move the arming into the finally (teardown entry), matching the
fall-through owner-disconnect site later in main(): the timer still
bounds a hung drain/disconnect (the C13 contract) but can no longer
kill a slow-but-progressing op. Verified on a transaction-pooler
Supabase brain: search went from 0 bytes/exit 0 at 10s to real results
at ~21s.
importCodeFile built CodeEdgeInput rows without source_id, so every
edge landed NULL. getCallersOf/getCalleesOf filter
`AND source_id = <scoped>` whenever a worktree pin or --source is in
play — NULL never matches, so scoped call-graph queries silently
returned 0 rows on multi-source brains even though the edges existed
(2,122 edges, 26 targeting the probed symbol, count 0 returned).

One-line fix: carry the sourceId already in scope into the edge input.
Existing NULL rows backfill with:
  UPDATE code_edges_symbol e SET source_id = p.source_id
    FROM content_chunks c JOIN pages p ON p.id = c.page_id
   WHERE c.id = e.from_chunk_id AND e.source_id IS NULL;
(same for code_edges_chunk). Verified: code-callers returns 21 callers
where it returned 0.
The Postgres recipe ordered ALTER COLUMN TYPE vector(N) before the
UPDATE that clears stale embeddings. pgvector refuses to cast existing
vectors across dimensions ('expected 1024 dimensions, not 1536'), so
the recipe as written aborts the transaction on any brain that has
embeddings — which is every brain doing this migration. Swap the steps:
NULLs cast fine.
… review)

With the hard-deadline timer correctly scoped to teardown, a genuinely
wedged read handler (hung pooler connection mid-query) would hang the
CLI forever — the #1633 zombie class the old pre-try timer accidentally
bounded at 10s. Reads now get a generous withTimeout (180s default, far
above any healthy slow-pooler run; --timeout=Ns overrides; exit 124 with
the teardown finally still draining + disconnecting). Writes/admin stay
unbounded: a long import/embed must never be killed by a default.
… default

Review catch: 'sourceId ?? null' fixed the scoped path but left the
unscoped one (reindex --code without --source, importCodeFile callers
without opts.sourceId) stranding edges at NULL while their pages land
under the schema default (pages.source_id DEFAULT 'default') — so
getCallersOf(sym, { sourceId: 'default' }) missed them. Same bug,
other door. Fallback is now 'default'.
…lter

Review catch: the doc fix corrected docs/embedding-migrations.md, but
embeddingMismatchMessage still PRINTED the broken order — ALTER before
UPDATE ... SET embedding = NULL — and linked to the now-contradicting
doc. pgvector refuses to cast existing vectors across dimensions, so
the printed recipe aborted on any brain that has embeddings. Swap the
steps and say why inline.
…l_qualified

1. Backfill: edges written before the stamping fix sit at source_id=NULL
   and stay invisible to scoped call-graph queries until repaired. Derive
   each edge's source from its own from_chunk's page (pages.source_id is
   NOT NULL DEFAULT 'default'). Same SQL verified live on a 2,122-edge
   production brain.
2. Indexes: getCalleesOf filters both edge tables on from_symbol_qualified,
   which had no index — every callee lookup was a seq scan, amplified
   per-BFS-node by the recursive code walk. With NULL edges repaired,
   scoped walks actually expand, so the latent cost becomes real.
   Mirrored into src/schema.sql; schema-embedded.ts regenerated.
…order

The 'Why we don't do this automatically' list still said alter-then-wipe;
reorder to wipe-then-alter and replace the fragile 'step 3' numeric
cross-reference with a name-based one.
…t, recipe order

- import-code-edges-source-id: scoped import stamps edges + scoped
  getCallersOf/getCalleesOf match (verified failing pre-fix), plus the
  unscoped-import case asserting 'default' stamping.
- cli-force-exit-teardown-arming: structural pin — the hard-deadline
  timer arms inside the finally (teardown entry), never before the op
  body; daemon guard, unref, clearTimeout intact.
- embedding-dim-check: recipe order pinned — UPDATE precedes ALTER so
  the printed SQL can't drift from docs/embedding-migrations.md again.
…ntext too

Adversarial review, two findings on the new timeout path:
1. On timeout the finally drained, disconnected, then CLEARED the
   hard-deadline timer — removing the only backstop while the abandoned
   handler (withTimeout races, it does not cancel) can hold ref'd
   sockets/SDK timers that keep Bun's loop alive: 'timed out' printed,
   process immortal — the zombie class this branch exists to kill,
   resurrected through its own fix. The finally now exits explicitly
   after teardown completes on the timeout path.
2. makeContext does DB I/O (resolveSourceId) for EVERY op and sat
   outside any bound — a pooler wedge at context build hung reads,
   writes, and admin alike. It now shares the same wallclock bound.
…unscoped chunk fan-out

Adversarial review: txOpts used truthiness while the edge stamp used
nullish — sourceId:'' put pages under 'default' but stamped edges '',
FK-violating against sources(id) and silently dropping the file's whole
call graph in the best-effort catch. The unscoped getChunks could also
fan out to same-slug chunks from another source. One normalized
edgeSourceId (sourceId || 'default') now drives both the chunk lookup
and the stamp.
…(both engines)

Adversarial review: addCodeEdges still wrote e.source_id ?? null, so any
future caller that forgets the field reintroduces invisible NULL edges
the day after the v116 backfill runs. A NULL source_id is invisible to
every scoped call-graph query; default to the schema-default source the
way the pages table does. Applied to both engines (parity).
… alters

Adversarial review: buildFactsAlterRecipe shipped the same defect class
this branch fixes for content_chunks 350 lines up — a cross-dimension
ALTER ... USING cast that pgvector refuses while rows hold old-width
vectors. Dimension changes now wipe first (the facts pipeline re-embeds
on next write); same-dim type swaps (halfvec <-> vector) keep the
lossless cast and PRESERVE data. Both behaviors pinned by tests.
Marks the v0.42.20.0 'decouple the op-dispatch force-exit timer' follow-up
complete — this branch ships exactly that decoupling.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…inate lost-update race

## Problem

`updateSourceConfig` used a read-then-write pattern: read the current
`config` row, normalize it in JavaScript, then write the merged result
back with `SET config = <normalized> || <patch>`.

Under concurrent callers (two background autopilot/cycle paths patching
different keys simultaneously), both callers can read the same stale
row. The later `SET config = ...` then clobbers the earlier patch,
silently dropping whatever keys the first caller wrote. Reproduced
at 21/25 lost-update events under real Postgres with parallel callers.

## Fix

Fold the normalization and merge into a single atomic `UPDATE … SET
config = CASE … END || patch` statement. Because the `SET` expression
evaluates against the row-locked latest version of `config`, there is
no snapshot window between the read and the write. Concurrent callers
now converge correctly (50/50 clean in reproduction test).

The `CASE` also normalizes historical bad JSONB shapes inline:
- `object` — used as-is
- `string` — double-encoded config; inner text parsed with the SQL
  `IS JSON` guard (Postgres 16+) so unparseable strings fall back to
  `{}` instead of raising `invalid input syntax for type json`
- `array` — array of patch objects aggregated into a flat object via
  `jsonb_object_agg`
- anything else — falls back to `{}`

`pglite-engine.updateSourceConfig` already used an atomic `||` merge;
this change brings postgres-engine to parity.

## Test

Added two assertions to `test/list-all-sources.test.ts`:
1. JSONB string holding non-JSON text normalizes to `{}` (no cast throw)
2. JSONB string holding double-encoded valid JSON is parsed then merged
…aph coverage, exit code, gateway guard

## 1. Stale lock break hints cover gbrain-cycle: keys

The doctor stale-lock report only recognized `gbrain-sync:` lock prefixes;
everything else fell back to `gbrain sync --break-lock`, which is wrong for
dream/autopilot cycle locks. A `gbrain-cycle:<source>` or `gbrain-cycle`
lock now suggests `gbrain dream --break-lock [--source <name>]`, and
unknown lock shapes fall back to `gbrain doctor` instead of a
misleading sync command.

## 2. content_sanity_audit_recent counts reject and quarantine as hard failures

v0.42 renamed the hard disposition path: rejected pages emit a `reject`
event and quarantined junk pages emit `quarantine`; `hard_block` is now
only the pre-v0.42 legacy alias. The status check only counted `hard_block`,
so fresh `reject` / `quarantine` events from the new path cleared as `ok`
whenever fewer than 10 events existed. The check now sums all three for the
hard count, and `soft_block + flag` for the soft count.

## 3. graph_coverage excludes test fixture entity pages from the denominator

Brains seeded with code sources (e.g. a sync of the gbrain repo itself)
could accumulate test fixture pages typed as `entity` / `person`. Including
these in the entity-count denominator diluted coverage and produced spurious
warnings ("Entity link coverage 0%, timeline 0%") on knowledge-only brains
with no real entity pages. The check now queries a per-entity stats CTE that
excludes `tools/gbrain/test/*` slugs and the `templates/new-person` stub,
with an additional guard for the all-fixture case (`eligibleEntityCount = 0`).

## 4. process.exitCode instead of process.exit at doctor main exit point

`process.exit(hasFail ? 1 : 0)` was a hard kill that prevented cleanup
handlers (Bun unload events, open DB connections) from running. Using
`process.exitCode = hasFail ? 1 : 0` defers the actual termination until
the end of the event loop, allowing cleanup to complete.

## 5. checkSubagentCapability exported for test seams + gateway loop guard

The function was private, making it untestable in isolation. It is now
exported. Additionally, users running gbrain with a non-Anthropic chat model
via `agent.use_gateway_loop=true` no longer receive a spurious warning that
`ANTHROPIC_API_KEY` is missing — subagents route via the gateway loop in
that configuration and do not need the key directly.

## Tests

Doctor test suite: 77 pass, 0 fail (no regressions).
…nect() parity (#2034)

Engine-layer API for two cycle/availability fixes that share these files:
- deleteFactsForPage gains optional excludeSourcePrefixes so the fence
  reconcile can protect non-fence facts (e.g. cli: conversation facts).
- reconnect(ctx?) is now a first-class BrainEngine method on both engines
  (PostgresEngine already had it; PGLite gains config capture + reconnect)
  so callers stop using disconnect()+bare connect().

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The fence reconcile delete-then-reinsert wiped cli:-origin facts (no fence to
recreate them); a failed-sync full walk turned it brain-wide (1829 rows, 0
reinserted, status ok). Now: exclude cli: rows from the wipe, do NOT inherit
the failed-sync->full-walk fallback for this destructive phase, and warn on
net-negative reconcile.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…onnect() (#2034)

The autopilot health-probe recovery called connect() with no args after
disconnect(), losing the startup config (database_url undefined -> FATAL
restart-loop on every DB blip) and opening a null-pool window. Both call sites
now use engine.reconnect(), which restores the captured config.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… the global repo (#2018)

put_page write-through resolved the disk target from the global sync.repo_path,
so a default-source page (local_path NULL) got written into an unrelated
federated source's working tree. Now it uses the assigned source's own
local_path; NULL local_path skips (no leak); the global path is used only as a
sole-source fallback.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…tolen (#2058)

A live holder's lock was force-removed after 5min age alone, letting a second
process share the single-writer data dir -> WAL corruption. The lock now
heartbeats while held; a holder is reaped only when its PID is dead OR its
heartbeat went stale past the steal grace. Pairs PID liveness with heartbeat
age to also defeat PID reuse.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A migration renumbered during a merge (v102) could be recorded-as-applied
without its DDL running, leaving the 3-column index so every timeline write
failed the 4-column ON CONFLICT. runMigrations now always runs a shape-keyed
drift repair (dedupe-then-rebuild) even when no migration is pending, and
doctor surfaces the drift.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ound-trip (#2057)

The meetings extractor's bare catch {} hid a brain-wide timeline-write failure
(0 entries, no error). It now counts + surfaces batch errors. Adds a Date-bearing
batch regression test proving the #1861 jsonb_to_recordset refactor already
fixed the original ::text[] cast failure.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
garrytan and others added 13 commits June 11, 2026 20:49
…nv (#427)

Reviewed: config-layer only (config.ts, init.ts, migrate-engine.ts); disjoint
from the authored fixes. Prevents a cwd .env DATABASE_URL from silently
retargeting the brain at the wrong DB.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ist/build

Reviewed: import.ts + sync.ts walker scoping; disjoint from authored fixes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Reviewed: same walker as #2052; venv glob addition. Resolved on top of #2052.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

# Conflicts:
#	src/commands/import.ts
#	src/core/sync.ts
…SDK (supersedes #1400)

Reviewed: gateway.ts AsyncLocalStorage + search/mode.ts; disjoint from authored
fixes. Chosen over #2083 (whole openai-compatible recipe class vs ZeroEntropy-only).
Credit @billy-armstrong for the original #1400 diagnosis.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…Config

Reviewed: single-statement config || patch eliminates the read-then-write
lost-update race. Touches updateSourceConfig (disjoint from the deleteFactsForPage
region the authored #1928 fix changed).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…nt sanity, graph coverage, exit code, gateway guard

Reviewed: doctor.ts only; the five fixes sit in different checks than the
authored #2038 timeline_dedup_index check.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…full registered grant

Reviewed (A6): omitted/empty scope both safely default to the client's
REGISTERED scope and stay clamped to it (no escalation); fixes the empty-grant
that never self-heals. Verified the result can't exceed allowedScopes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Reviewed (A7): extracts parseLegacyTokenScope into src/core/legacy-token-scope.ts
shared by the HTTP transport + OAuth provider (no duplicate logic). Closes the
OAuth-provider gap where legacy permissions.source_id grants collapsed to 'default'.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…-graph 0-rows + dim-migration order

Reviewed: CLI read-timeout wraps connectEngine (line 249) per the timeout-layer
learning; migration v116 (code_edges source_id backfill + from_symbol_qualified
index) coexists with the authored #2038 drift repair in runMigrations; typecheck
clean. CHANGELOG version collision resolved (master's 0.42.39.0 Retrieval Reflex
kept; #2073's notes fold into the wave entry at ship).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Triage fix wave: 6 authored critical fixes (#1928 facts wipe, #2018
write-through leak, #2034 reconnect loop, #2058 WAL lock, #2038 timeline
migration drift, #2057 timeline silent-empty) + community PRs #2064 #2052
#2020 #2033 #2074 #2075 #2009 #2072 #2073. TODOS: deferred #1994 #1963 #2050.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Codex as-built review of the authored fixes surfaced 4 real issues:
- #2058: add a pid+acquired_at ownership token. A stale holder reaped + replaced
  past the grace must NOT let its resumed heartbeat refresh, nor releaseLock
  remove, the NEW owner's lock (re-opened the concurrent-writer hole). Heartbeat
  and release now verify the on-disk lock is still ours. + regression test.
- #1928: the destructive-full-walk guard keyed off phases.includes('sync'),
  which wrongly suppressed a legitimate full reconcile when sync was SKIPPED
  (no engine / no brainDir). Key off a syncAttempted flag set only when sync
  actually ran.
- #2038: dedupe keeps MIN(id) not MIN(ctid) — deterministic and consistent with
  the existing v-migration lower-id rule.
- #2057: the extract CLI caller now surfaces batch_errors (stderr + exit 1)
  instead of printing a clean success over failed inserts.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Update KEY_FILES.md to current-state truth for the shipped fixes (no
release-history clauses, per the reference-doc discipline):

- write-through.ts (#2018): resolves the disk target from the assigned
  source's own local_path; sole-source falls back to sync.repo_path,
  multi-source skips with source_has_no_local_path rather than leak.
- engine.ts (#2034): reconnect() is now a REQUIRED lifecycle method on
  both engines; config-restoring, never disconnect()+bare connect().
- migrate.ts (#2073): document v116 edge source_id backfill + callee
  index, and the always-run (version-counter-blind) timeline dedup
  self-heal.
- new entry for timeline-dedup-repair.ts (#2038) + the
  timeline_dedup_index doctor check.
- new entry for pglite-lock.ts (#2058): heartbeat + steal-grace
  (GBRAIN_PGLITE_LOCK_STEAL_GRACE_SECONDS) so a live holder is never
  stolen.
- extract-facts.ts (#1928): cli:-fact protection, no failed-sync
  full-walk inheritance, net_fact_deletion warn floor.

bun run build:llms re-run (KEY_FILES is link-only so bundles unchanged);
freshness + current-state guards green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
garrytan and others added 4 commits June 11, 2026 23:01
…leak guard

The first #2018 fix skipped any no-local_path source on a multi-source brain,
which broke the legitimate nested layout (a source without its own tree nests
under the host repo at .sources/<id>/ — pinned by put-page-write-through.test).
Narrow the guard: a no-local_path source nests under sync.repo_path as before;
only SKIP when sync.repo_path is literally another source's own local_path
(the actual leak — writing there pollutes that sibling's repo). Caught by the
sharded suite.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
CI `verify` flagged 3 intra-process isolation violations in the tests added
this wave (the parallel runner shares one process per shard):
- pglite-lock.test.ts: the GBRAIN_PGLITE_LOCK_STEAL_GRACE_SECONDS mutation now
  goes through withEnv() instead of a raw process.env write (R1).
- pglite-reconnect: renamed to *.serial.test.ts — it creates per-test engines
  to exercise the connect/reconnect lifecycle, which doesn't fit the shared
  beforeAll-engine model (R3/R4).
verify is now 30/30; both files green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
CI serial-tests + test(5) caught two in-branch regressions from the #2034
PGLite reconnect():
- worker/queue claim-error recovery + their renewLock e2e test assume PGLite
  reconnect is absent/no-op (queue.ts documents it). Making it a real
  disconnect+reopen wiped an in-memory engine's state mid-job. reconnect() now
  no-ops for in-memory (no database_path) — file-backed still re-opens the dir
  (state persists on disk). Restores the documented worker assumption.
- connection-resilience 'Supervisor still has the 3-strikes-then-reconnect
  path' pinned the removed unsafe-cast text; updated to assert the direct
  this.engine.reconnect() call.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
#2033's embed-input-type-wire.test.ts configures a 1280-dim embedding gateway;
the active dimension survived into engine-find-trajectory when CI's 10-way
hash-disjoint sharding co-located them (this branch's added files reshuffled the
assignment), failing 7 trajectory tests with 'expected 1280 dimensions, not
1536'. resetGateway() in afterEach clears the gateway but the dimension still
leaked. It mutates global gateway/embedding state, so it belongs in the serial
lane (own bun process, true isolation) by the repo's own definition. Root-caused
by reproducing the exact failing pair locally.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@garrytan garrytan merged commit 7c27fa1 into master Jun 12, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

8 participants