Skip to content

feat(sync): opt-in --respect-gitignore for the full-import walker (#1073)#1159

Open
jetsetterfl wants to merge 1 commit into
garrytan:masterfrom
jetsetterfl:fix/1073-respect-gitignore-walker
Open

feat(sync): opt-in --respect-gitignore for the full-import walker (#1073)#1159
jetsetterfl wants to merge 1 commit into
garrytan:masterfrom
jetsetterfl:fix/1073-respect-gitignore-walker

Conversation

@jetsetterfl

Copy link
Copy Markdown

Closes #1073.

Problem

collectSyncableFiles (src/commands/import.ts) only skips dot-dirs / node_modules / ops. The full / first / --full import therefore walks gitignored build output (dist/, out/, coverage/, __pycache__/, …) and admits every file in there as "code" — CODE_EXTENSIONS covers .json / .yaml / .toml / .html / .css. On repos with large gitignored trees this bloats the DB, raises embedding cost, pollutes code-def / semantic search with stale fixtures/bundles, and can wedge the chunker on a pathological file (e.g. a single-line giant JSON). This is the stall reported in #1073. The incremental sync path is already git-based (git ls-files --exclude-standard, sync.ts) and excludes these — only the full-import walker diverges.

Change (opt-in, default off)

  • collectSyncableFiles gains respectGitignore in CollectOpts. When set and the root is a git work tree, git ls-files -o -i --exclude-standard --directory builds the ignored set; entries are pruned at descent time so a fully-ignored directory is never recursed into — addressing the walk-stall, not just emit-time filtering. Non-git root / git unavailable → empty set → zero overhead, exact legacy behavior.
  • Default OFF. Preserves behavior for dotfile/secret brains that deliberately keep content out of git but want it brain-searchable (the explicit "why not just default-on" case in sync --strategy code enhancement: honor .gitignore (opt-in flag) #1073).
  • Surface area:
    • gbrain sync --respect-gitignore (and --no-respect-gitignore to override an enabled config)
    • sync.respect_gitignore config knob (CLI flag wins)
    • gbrain import --respect-gitignore
    • threaded into the sync --all cost preview so the estimate matches what will actually be walked
    • documented in gbrain --help

Mechanism note

#1073 suggested git ls-files --cached --others --exclude-standard (an allowlist gated at emit time). I used the ignored set with --directory instead so whole gitignored trees are pruned before recursing — the issue's core pain is the walker stalling on tens of thousands of files, which directory-level pruning avoids and emit-time allowlisting does not. All existing walker hardenings (symlink/inode-cycle/max-depth) are untouched. Happy to switch to the allowlist shape if you'd rather.

Tests

test/import-walker.test.ts:

  • default (off) still admits gitignored dist/bundle.js — legacy behavior preserved
  • opt-in prunes dist/ + coverage/ while keeping src/app.ts
  • opt-in on a non-git dir falls back gracefully (no prune, no throw)

bun run typecheck clean; bun test test/import-walker.test.ts → 7 pass / 0 fail.

Follow-ups (intentionally out of scope)

--ignore-from FILE and .gbrainignore / sync.exclude (#449) — this PR lands the easy win for repos that already express the intent in .gitignore.

🤖 Generated with Claude Code

…rrytan#1073)

collectSyncableFiles only skipped dot-dirs / node_modules / ops, so the
full / first / `--full` import walked gitignored build output (dist/,
out/, coverage/, __pycache__/, ...) and admitted every file in there as
"code" (CODE_EXTENSIONS covers .json/.yaml/.toml/.html/.css). On repos
with large gitignored trees this bloats the DB, raises embedding cost,
pollutes search with stale fixtures, and can wedge the chunker on a
pathological file — exactly the stall reported in garrytan#1073. The incremental
sync path is already git-based and excludes these; this makes the
full-import walker consistent with it, opt-in.

- `collectSyncableFiles` gains `respectGitignore` (CollectOpts). When set
  and the root is a git work tree, `git ls-files -o -i --exclude-standard
  --directory` builds the ignored set; entries are pruned at *descent*
  time so a huge gitignored dir is never recursed into (addresses the
  walk-stall, not just emit-time filtering). Empty set / non-git root /
  no git => zero overhead, legacy behavior.
- Default OFF. Preserves behavior for dotfile/secret brains that
  deliberately keep content out of git but want it brain-searchable
  (the "why not just default-on" case in garrytan#1073).
- Wired through: `gbrain sync --respect-gitignore`
  (and `--no-respect-gitignore` to override an enabled config),
  `sync.respect_gitignore` config knob (flag wins), `gbrain import
  --respect-gitignore`, and the `sync --all` cost preview so the
  estimate matches what will actually be walked.
- Tests: default still admits gitignored output (legacy preserved),
  opt-in prunes dist/ + coverage/ while keeping src/, non-git root
  falls back gracefully without throwing.

`--ignore-from FILE` and `.gbrainignore` (garrytan#449) are intentionally left
as follow-ups; this lands the easy win for repos that already express
the intent in .gitignore.
garrytan added a commit that referenced this pull request May 21, 2026
…dential clients (#1253)

* fix(reindex-frontmatter): connect engine before query (#1225)

`createEngine()` from src/core/engine-factory.ts only constructs the
engine; callers MUST call connect() before any executeRaw. The
reindex-frontmatter CLI was constructing the engine and going
straight to countAffected, which crashed on PGLite with "PGLite not
connected. Call connect() first." even on --dry-run.

Fix follows the existing-command pattern (src/commands/auth.ts,
src/commands/backfill.ts, src/commands/integrity.ts all do the
same): pass toEngineConfig(cfg) into both createEngine() AND
engine.connect(), then engine.initSchema() (idempotent on a current
schema, ~1ms cost).

Pre-fix verification: codex outside-voice CF5 flagged the related
"can't import connectEngine from cli.ts" misdirection in the
original fix plan. This implementation uses the canonical sibling
pattern instead.

Regression test pinned at test/reindex-frontmatter-connect.test.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump VERSION to 0.37.7.0 + stub CHANGELOG

v0.37.5.0 claimed by #1229 (warsaw-v4); v0.37.6.0 by #1246
(OpenRouter recipe). v0.37.7.0 is the next free slot for this
fix wave.

CHANGELOG entry stubbed in user-facing voice per CLAUDE.md
"CHANGELOG voice + release-summary format" — ELI10 lead-first,
real fix details below. The "## To take advantage of v0.37.7.0"
block follows the v0.13+ self-repair pattern from CLAUDE.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(subagent): short-circuit terminal-on-resume (#1151)

Bug: when the worker resumed a subagent job whose persisted last
message was an assistant turn with text-only content (no tool_use
blocks), the replay reconciler at subagent.ts:241-247 had no
branch for that case. The main loop then called messages.create
against a conversation ending in assistant role, which Sonnet 4.6+
rejects with HTTP 400 "This model does not support assistant
message prefill." 3 retries later → dead-letter, despite all the
job's work having committed in earlier turns.

@zscgeek's bug report pinned this exactly: dream-cycle Otter
corpus runs hit ~7% dead-letter rate, every dead job's last
subagent_messages row was a text-only synthesis summary listing
slugs that already existed in `pages`. Their proposed fix mirrors
this implementation.

Fix: add an else branch to the assistant-tail check that mirrors
the live-loop terminal logic at subagent.ts:440-447 — reconstruct
finalText from the persisted text blocks, return
stop_reason='end_turn' immediately. No LLM call, no schema change.

Two new regression cases:
  - text-only terminal on resume returns immediately with zero
    messages.create calls
  - tool-use replay path unchanged (existing behavior preserved)

Codex outside-voice (CF13) initially flagged this fix as
mis-targeted, claiming subagent.ts already handled the case.
/investigate run revealed the live-loop terminal at :440-447 was
covered but the REPLAY-path terminal at :241-247 was missing —
both branches need symmetric handling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(autopilot): scope lockfile to GBRAIN_HOME (#1226)

The autopilot lockfile was hardcoded at `~/.gbrain/autopilot.lock`
(via `process.env.HOME`), bypassing GBRAIN_HOME. Two brains pointed
at different GBRAIN_HOME directories still wrote to the same global
lockfile; one would silently take over the other on each restart.

Fix: route through `gbrainPath('autopilot.lock')` from
src/core/config.ts (imported aliased as gbrainHomePath since the
local `gbrainPath` var in installAutopilot references the CLI
binary path). The mkdirSync(`~/.gbrain`) call also routes through
the helper so the directory is created in the right place too.

Co-authored with @rafaelreis-r — same fix shape as PR #1227,
re-implemented against current master per the wave's
"re-implement, credit, close" workflow.

Tests cover: one GBRAIN_HOME → one canonical lock; two
GBRAIN_HOME values → two distinct locks; default fall-through
still works.

Co-Authored-By: rafaelreis-r <noreply@github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(graph-query): foreign-edge footer + --include-foreign (#1153)

The graph-query CLI silently dropped edges to pages in other sources
on federated brains. Users had no signal those edges existed unless
they read the source code.

Fix:
- New --include-foreign flag (off by default, preserves the existing
  scoping contract; on = explicit cross-source traversal).
- After every traversal, count edges from rootSlug whose target page
  lives in a different source. When count > 0 AND user didn't opt in,
  emit a stderr footer:
    `(N edge(s) to foreign-source pages hidden; pass --include-foreign
     to include them)`
- The "no edges found" path also runs the count + footer so users
  discover foreign edges even when scoped traversal returned nothing.
- Thin-client path skips the count (engine query not available);
  future T1 work threads source resolution through MCP for that path.
- Single quotation correctness in count SQL: page_links table is
  `links` (not `page_links`); JOIN both endpoints to pages and compare
  source_id, NULL-safe via `IS NOT NULL` guards on both sides.
- Fail-open on missing source_id column for pre-v0.18 brains: return 0
  (no foreign edges to report) instead of throwing.

4 new test cases: footer fires on scoped query with foreign edge,
--include-foreign suppresses footer, zero-foreign no-footer case,
pluralization regression guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(sources): `gbrain sources current` + tier attribution (#1222)

Federated-brain users running destructive ops (extract, import,
purge) need a way to verify which source they're targeting BEFORE
the op runs. Pre-fix, the only way was to grep config files or run
the op with --dry-run and inspect output.

New command:
  gbrain sources current             # human output
  gbrain sources current --json      # machine-readable
  gbrain sources current --source X  # show what an explicit --source
                                     # X would resolve to (validates
                                     # X exists in the sources table)

Output names BOTH the resolved source id AND which tier of the 6-tier
resolution chain won (flag / env / dotfile / local_path /
brain_default / seed_default), plus a `detail` line naming the
winning signal (e.g. "GBRAIN_SOURCE=dept-x" or ".gbrain-source" or
"/work/gstack/src").

Implementation:
- New `resolveSourceWithTier()` in source-resolver.ts as an additive
  variant of `resolveSourceId()`. Walks the same 6 steps in the same
  order; just returns `{ source_id, tier, detail? }` instead of bare
  string. Existing `resolveSourceId()` unchanged — all callers
  continue working.
- New `SOURCE_TIER_NAMES` const + `SourceTier` type export so the
  CLI, doctor (Tier 5 follow-up), and future MCP consumers share one
  vocabulary instead of inlining strings.
- Help text updated; `current` subcommand registered in dispatcher.

11 new tests pin the 6-tier ladder + priority semantics. Existing
19 source-resolver tests still pass (regression preserved).

Per codex CF3 (the existing src/core/source-resolver.ts was missed
in the original plan). Re-uses the existing helper instead of
inventing a duplicate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(extract): --source-id scopes extraction to one brain source (#1204)

Federated brain users running `gbrain extract` had no way to scope
extraction to one source. The DB path walks all sources together via
listAllPageRefs(), which is correct for cross-source resolution but
sometimes the user wants to extract per-source explicitly (e.g.
re-running extract on a specific source after a manual import).

The pre-existing `--source` flag is the data-source axis (fs|db) and
can't be repurposed. New flag `--source-id <id>` joins it on the
brain-source-id axis:

  gbrain extract all --source db --source-id alpha
    -> walks only alpha-source pages; extracts links + timeline
       from those, into the alpha source

Important: the resolver maps (allSlugs + slugToSources) stay built
from the FULL listAllPageRefs result, not the scoped subset. This
ensures qualified cross-source wikilinks like `[[other-src:slug]]`
still resolve correctly even when the extract walk is scoped — the
filter is on which pages we extract FROM, not what we can resolve TO.

Threaded through both `extractLinksFromDB` and `extractTimelineFromDB`
with backward-compat: callers passing no opts get the old behavior.

4 new test cases pin: walks-all-without-flag baseline,
alpha-only-when-scoped-to-alpha, beta-only-when-scoped-to-beta,
empty-set-on-unknown-source.

Note: #1204's wider "silent 0 links" report on federated brains has
additional facets beyond this flag (resolver path edge cases on
overlapping slugs). The scoped-walk fix gives users an explicit
workaround AND closes the per-source extraction gap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(todos): file v0.37.7.0 follow-ups (#1173, #1204, T5N)

Three items deferred from v0.37.7.0:

1. #1173 .sql indexing — verify-first gate found
   tree-sitter-sql.wasm missing from src/assets/wasm/grammars/.
   Dedicated wave needed: vendor the wasm, add .sql to walker
   filter, address slug-shape collision with #1172.

2. #1204 deeper investigation — wave added --source-id flag as
   workaround. Underlying silent-zero-links bug on unscoped
   federated extracts needs its own /investigate pass against
   a cross-source-duplicate-slug fixture.

3. Tier 5N doctor sweep for dead-lettered subagent jobs matching
   the #1151 fingerprint. Deferred to v0.37.8+ behind the islamabad
   doctor.ts conflict resolution.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sync): walker skips git submodule directories (#1169)

Sync walker descended into git submodules and indexed their markdown
content as if it belonged to the parent brain. Users with submodules
in their brain repo saw foreign content in their pages table.

Fix: pruneDir gains an optional `parentDir` arg. When set, the helper
stats `<parentDir>/<name>/.git` and skips the directory if `.git`
exists as a FILE (gitfile pointer — the canonical submodule shape).
Directories containing `.git` as a DIRECTORY (a real nested repo,
not a submodule) are descended into; the inner `.git` dir itself is
then dot-prefix-excluded.

Callers updated to pass parentDir:
- src/commands/extract.ts walkMarkdownFiles
- src/core/cycle/transcript-discovery.ts walker

Back-compat preserved: existing pruneDir(name) callers without
parentDir get the pre-v0.37.7.0 behavior unchanged.

Companion `.gitignore`-respect feature from PR #1159 (@jetsetterfl)
NOT in this wave — it would require adding the `ignore` npm package
as a dep, which the plan's "no new deps in this PR" gate excludes.
Filed as follow-up TODO for a dedicated wave.

5 new test cases pin the submodule shape + back-compat + nested-repo
ambiguity. Existing extract-fs / extract-db tests unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(brain-routing): document 6-tier source resolution chain (#1222)

The convention skill didn't have a tier-by-tier reference for how
gbrain resolves the active source. Users running federated brains
had to read the source code to know which signal wins.

Added:
- Canonical 6-tier table (flag → env → dotfile → local_path →
  brain_default → seed_default) matching src/core/source-resolver.ts.
- Pointer to `gbrain sources current` (new in v0.37.7.0) as the
  verification command.
- The CLI-layer trust boundary note: operations.ts handlers don't
  read env/dotfile (preserves v0.34.1.0 source-isolation work for
  MCP callers).
- Per-command flag map: --source, --source-id (extract), and
  --include-foreign (graph-query).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(import): --source-id flag routes pages to a brain source (#1167)

`gbrain import --source dept-x ./pages` silently fell back to the
default source because the CLI parser never consumed --source. PR
#707's design intent excluded the flag explicitly; users had no
signal their pages were going to the wrong place. #1167 + #1222
filed the regression.

Fix: parse `--source-id <id>` (matching v0.37.7.0 extract.ts T2's
naming convention — --source-id stays out of conflict with future
axes that may want --source). When set, the flag value wins over
any programmatic opts.sourceId; back-compat preserved for callers
that pass sourceId via opts only.

Also threaded into the positional-dir arg parser's flagValues set
so `--source-id <value> <dir>` doesn't treat <value> as the dir.

Note on related surfaces:
- `gbrain query "X" --source_id dept-x` already routed correctly
  via the operations.ts query op (added in v0.34) — no fix needed.
- `gbrain extract --source-id <id>` shipped in T2.
- `gbrain sync --source <id>` already worked (pre-existing).
- `gbrain sources current` (shipped in T4) is the verification
  tool — run it before destructive ops to confirm routing.

Closes the silent-fallback for the import path. Co-authored with
@tyad67-netizen (#1168), @hnshah (#1124, #1120), whose patches
informed the shape; re-implemented against current master per
the wave's "re-implement, credit, close" workflow.

3 new test cases pin: default-without-flag, --source-id-routes-correctly,
flag-value-not-treated-as-dirArg.

Co-Authored-By: tyad67-netizen <noreply@github.com>
Co-Authored-By: hnshah <noreply@github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(autopilot): reconnect classifier + launchd ThrottleInterval (#1162)

Pre-fix: when database_url was unset/malformed, the DB-health-check
reconnect loop logged `config.database_url undefined` forever
because the catch swallowed every error type uniformly. launchd's
KeepAlive=true respawned immediately on any exit, so even when
the process did exit, it came right back into the same bad state.
@colin477 reported the daemon-thrash pattern.

Two-part fix:

1. In-process error classifier — `classifyReconnectError(err)`:
   - `unrecoverable` (database_url missing/empty/malformed, auth
     failure, no-brain-configured): exit immediately with a clear
     stderr line. Pattern-matched against postgres / config-loader
     error shapes. Tests pin the matcher against the #1162
     fingerprint exactly.
   - `recoverable` (network blip, pool saturated, connection refused
     on a port coming up, Supabase 503): retry. Up to
     GBRAIN_AUTOPILOT_MAX_RECONNECT_FAILS (default 30 = ~5min) before
     finally giving up with `max_reconnect_fails_exceeded`.
   - Counter resets on every successful health probe or reconnect.

2. launchd plist gains `ThrottleInterval=60`. Combined with the
   in-process exit, launchd waits 60s before relaunching instead
   of immediate respawn. Pure-function `generateLaunchdPlist()`
   exported for tests.

16 new test cases:
- 11 classifier cases (database_url shapes, malformed URL, auth,
  role-does-not-exist with quoted name, network blip, pool
  saturated, 503, non-Error inputs, case-insensitivity)
- 5 plist generator cases (ThrottleInterval=60, KeepAlive
  preserved, wrapper path, XML escaping, StandardErrorPath).

Pre-existing autopilot-lock-path tests unchanged — both fixes
land cleanly side-by-side.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(oauth): confidential clients via custom /token middleware (#1166)

v0.34.1.0 (#909) fixed PUBLIC PKCE clients (client_secret=undefined)
by normalizing NULL → undefined in getClient. Confidential clients
regressed: the MCP SDK's clientAuth middleware does plaintext
`client.client_secret !== presented_secret` compare, but gbrain
stores SHA-256 hashes, so the SDK's compare always failed for
authorization_code and refresh_token grants on confidential clients.
Result: /token returned `invalid_client` for every confidential
exchange.

Fix shape per locked-decision-5: custom /token middleware BEFORE the
SDK's authRouter, similar to the pre-existing client_credentials
handler. The middleware:

1. Detects confidential auth via `client_secret` in body
   (client_secret_post) OR `Authorization: Basic` header
   (client_secret_basic per RFC 6749 §2.3.1).
2. Falls through to the SDK when neither is present (public PKCE
   path stays canonical, preserves v0.34.1.0 behavior).
3. Calls new `verifyConfidentialClientSecret(clientId, presented)`
   on the provider which does SHA-256 hash compare ourselves
   (same shape as exchangeClientCredentials' existing hash check).
4. On verification success, calls existing
   `exchangeAuthorizationCode` / `exchangeRefreshToken` directly
   with the validated client.
5. RFC 6749 §5.2 error semantics: 401 invalid_client for auth
   failures, 400 invalid_grant for code/token problems.

Per CLAUDE.md "GBRAIN:RLS_EXEMPT" annotation contract: this surface
sits in front of the SDK's clientAuth and doesn't depend on the
SDK's plaintext compare working — the SDK's middleware never
fires for confidential paths the new middleware claims.

7 new test cases pin: correct-secret-returns-client, wrong-secret
opaque rejection, non-existent client, public-client refuses
the confidential path, case-sensitivity, soft-deleted revocation,
verify-then-exchange-refresh round-trip with second-use rejection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(doctor): 3 new checks — source routing + oauth + autopilot lock (T12/T13/T14)

Three v0.37.7.0 doctor checks landing in one atomic commit (single
file, shared merge-conflict surface with garrytan/islamabad-v3 per
locked decision 1):

1. source_routing_health (T12 / #1167):
   Sample non-default sources for pages; warn when a registered
   source has zero pages (silent-collapse-to-default fingerprint).
   D5 lock: total-sample cap of 200 pages across all sources, with
   per-source cap = min(50, ceil(200/N)) so a 20-source CEO brain
   pays 200 selects, not 1000. Fix hint paste-ready to
   `gbrain sources current --json` for verification.

2. oauth_confidential_client_health (T13 / #1166):
   Probe every oauth_clients row. Confidential clients (auth_method
   != 'none') must have a non-NULL client_secret_hash; if any row
   claims confidential auth but stores NULL hash, that's the
   pre-v0.37.7.0 regression. Public clients (auth_method='none')
   correctly keep NULL hash per v0.34.1.0 #909. Fix hint:
   `gbrain auth revoke-client + register-client` OR `gbrain upgrade`.
   Pre-OAuth schemas (missing oauth_clients table) skip gracefully.

3. autopilot_lock_scope (T14 / #1226):
   Detect stale ~/.gbrain/autopilot.lock outside the current
   GBRAIN_HOME. Codex CF11: dangerous to paste-ready `rm` without
   verifying the owning PID isn't a live process. Hint reads the
   PID file and gives the user a `ps -p <pid>` check before any
   delete — matches sshd-style stale-lock recovery hints.

9 new test cases pin the canonical paths. Pre-existing 80+ doctor
checks unchanged.

Expected to conflict with garrytan/islamabad-v3 at merge time. The
3 new check functions live in their own block far from the
islamabad skill_brain_first check; the conflict surface should be
limited to the `checks.push(...)` call site near the end of
runDoctor's DB-checks phase (~10 lines).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(test): withEnv wrapper in source-resolver-with-tier (test-isolation lint)

The new source-resolver-with-tier.test.ts from T4 mutated
process.env.GBRAIN_SOURCE directly in two cases, which violates
scripts/check-test-isolation.sh R1 (env mutations leak across
parallel-loaded test files in the same shard process).

Fix: wrap both mutation sites in withEnv() from test/helpers/with-env.ts,
which saves+restores via try/finally per the canonical pattern in
CLAUDE.md.

Pure refactor — all 11 cases still green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: update project documentation for v0.37.7.0

CHANGELOG.md — populated the "What landed" stub with the 18-commit
brisbane wave (source-id flag threading, sources current subcommand,
graph-query foreign-edge footer, autopilot lockfile scope + reconnect
classifier + launchd ThrottleInterval, OAuth confidential client
middleware, reindex-frontmatter connect fix, subagent terminal-on-resume
fix, sync walker submodule skip, 3 new doctor checks, brain-routing.md
convention skill). Voice: ELI10 lead, capability table, paste-ready
verification, "what's safe to know" + "what we caught" sections.

CLAUDE.md — extended Key Files annotations for the v0.37.7.0 changes:
import/extract --source-id flags, sources current subcommand, graph-query
--include-foreign, resolveSourceWithTier() additive helper, autopilot
classifyReconnectError + generateLaunchdPlist exports, OAuth confidential
client middleware, pruneDir submodule detection, subagent terminal
short-circuit, 3 new doctor checks. Pinned by their test files.

llms-full.txt — regenerated via `bun run build:llms` (CI guard at
test/build-llms.test.ts will fail otherwise).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: rafaelreis-r <noreply@github.com>
@spotshare-nick

Copy link
Copy Markdown

Re: the "requires ignore npm package" exclusion in #1253

Just want to flag for the record that this PR as it stands adds no npm dependencies. The diff touches only src/cli.ts, src/commands/import.ts, src/commands/sync.ts, and test/import-walker.test.ts — there's no package.json change. The gitignore set is built by shelling out to git:

execFileSync('git', ['-C', root, 'ls-files', '-o', '-i', '--exclude-standard', '--directory'], )

execFileSync is a Node/Bun builtin (child_process), and git is already an assumed runtime tool everywhere else in the sync path (the incremental walker uses git ls-files --exclude-standard today). So the "no new deps" gate that excluded this in the v0.37.7.0 wave should actually pass — there's nothing to vendor. I think the exclusion may have conflated this PR with the broader .gbrainignore / #449 work, which would want a glob library for non-git roots.

Two paths, your call:

Path A — keep the git subprocess (this PR). Zero npm deps, correct gitignore semantics for free (negations, **, .git/info/exclude, global excludes), and directory-level pruning via --directory. The one constraint is it only fires inside a git work tree — non-git roots fall back to legacy behavior, which this PR already documents as intentional.

Path B — if you'd rather not depend on the git executable at all, the same intent can be done in-repo without a glob library, by reading .gitignore and matching it ourselves. Rough sketch:

// Parse .gitignore (+ .git/info/exclude) into ordered rules.
type Rule = { re: RegExp; negate: boolean; dirOnly: boolean };

function loadRules(root: string): Rule[] {
  const rules: Rule[] = [];
  for (const line of readIgnoreLines(root)) {           // skip '' and '#...'
    let pat = line.trim();
    if (!pat || pat.startsWith('#')) continue;
    const negate = pat.startsWith('!'); if (negate) pat = pat.slice(1);
    const dirOnly = pat.endsWith('/'); if (dirOnly) pat = pat.replace(/\/$/, '');
    const anchored = pat.startsWith('/'); if (anchored) pat = pat.slice(1);
    rules.push({ re: globToRegExp(pat, anchored), negate, dirOnly });
  }
  return rules;
}

// gitignore glob -> RegExp:  ** = any depth, * = within a segment, ? = one char.
// Unanchored patterns match at any path depth; anchored match from root.
function globToRegExp(pat: string, anchored: boolean): RegExp {
  const body = pat
    .replace(/[.+^${}()|[\]\\]/g, '\\$&')   // escape regex metachars
    .replace(/\*\*\//g, '(?:.*/)?')          // **/  -> any number of dirs
    .replace(/\*\*/g, '.*')                  // **   -> anything
    .replace(/\*/g, '[^/]*')                 // *    -> within one segment
    .replace(/\?/g, '[^/]');
  const prefix = anchored ? '^' : '(?:^|/)';
  return new RegExp(`${prefix}${body}(?:/.*)?$`);        // dir matches its children
}

// Last matching rule wins (gitignore precedence); negation re-includes.
function isIgnored(relPath: string, isDir: boolean, rules: Rule[]): boolean {
  let ignored = false;
  for (const r of rules) {
    if (r.dirOnly && !isDir) continue;
    if (r.re.test(relPath)) ignored = !r.negate;
  }
  return ignored;
}

Then in the walker, prune at descent time exactly like the current PR:

const rel = relative(root, full);
if (isIgnored(rel, isDirEntry, rules)) continue;

That's ~40 lines, no dependency, and gives the same directory-pruning win for both git and non-git roots — at the cost of owning a (small) slice of gitignore semantics ourselves. It won't be 100% spec-complete the way git is (e.g. nested per-directory .gitignore files, some edge cases), but it covers the dist//coverage//out/ pain from #1073.

Happy to push either shape — Path A is already in the diff and is the lower-risk merge; Path B if you want to drop the git executable assumption entirely. Either way it lands without a new npm dep.


This comment was drafted with the help of Claude Code.

garrytan pushed a commit that referenced this pull request Jun 12, 2026
collectSyncableFiles (the full-sync / dry-run enumerator) reimplemented its
own directory skip list inline (node_modules || ops), bypassing the canonical
pruneDir gate and ignoring .gitignore entirely. On a Laravel/PHP repo this
descended into vendor/ (~50k Composer files), storage/, and public/build/,
trying to import 52k dependency/build files and flooding the index with
library internals (a 35-min sync that never finished, killed by the watchdog
at 3%).

- collectSyncableFiles now enumerates via `git ls-files --cached --others
  --exclude-standard` when dir is a git work tree, so the walk honors
  .gitignore (tracked + untracked-not-ignored). Falls back to the FS walk for
  non-git dirs. EroLab: 52164 -> 1028 files.
- The FS fallback now prunes through the canonical pruneDir() instead of a
  drifted inline list, so the two skip lists can't diverge again.
- PRUNE_DIR_NAMES gains vendor/dist/build (dependency + build-output trees).

Addresses #1483 (.gbrainignore), #1159 (--respect-gitignore), and the
maintainer's #1942 vendor/dist/build prune. Walker regression suites
(sync-walker-symlink, brain-writer-walk-prune, sync, sync-walker-submodule)
green: 90 pass.
garrytan added a commit that referenced this pull request Jun 12, 2026
…unity PRs (#2128)

* fix(oauth): default omitted authorize scope to client's full grant

When a client omits `scope` on /authorize, the authorize() grant computed
`(params.scopes || []).filter(...)` → the empty set. That empty grant was
written to oauth_codes and propagated into the access AND refresh tokens, so
every request failed `insufficient_scope` even though the client was
registered with e.g. `read write`. Because refresh inherits the stored grant,
it never self-healed — reconnecting just minted another empty-scoped token.

Some MCP connectors (observed with Claude Desktop) omit `scope` on /authorize,
so they hit this on every connection.

Fix: when no scope is requested, default to the client's full registered scope
(RFC 6749 §3.3 permits a server default). This mirrors exchangeClientCredentials,
which already does `requestedScope ? ... : allowedScopes`. The result is still
clamped to the allowed set, so an explicit over-broad request cannot escalate.

Adds test/oauth-authorize-scope-default.test.ts covering: omitted/empty →
inherits full grant; explicit subset honored; clamp preserved (over-broad and
disallowed-only requests cannot escalate or trigger inheritance).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(sync): skip Python venv/ in the code walker

collectSyncableFiles (first-sync walker) and the incremental PRUNE_DIR_NAMES
set skipped node_modules but not Python venv/. On a Python repo the walker
descended into venv/ (thousands of files); the resulting slug collisions
crashed putPage's INSERT ... ON CONFLICT ... RETURNING with
"undefined is not an object (evaluating 'row.deleted_at')".

Add `venv` alongside node_modules in both the import.ts inline skip and
PRUNE_DIR_NAMES. venv is the Python equivalent of node_modules.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(gateway): carry asymmetric input_type across the AI SDK to the wire body (#1400)

dimsProviderOptions() threads input_type ('query' | 'document') into
providerOptions.openaiCompatible for asymmetric models (ZE zembed-1,
Voyage v3+), but the AI SDK's openai-compatible adapter validates
providerOptions against a fixed schema and silently drops the field
before building the HTTP body. Every embedQuery() was therefore encoded
document-side: the ZE shim's hard default fired ('document'), Voyage and
local openai-compat servers got no input_type at all, and asymmetric
retrieval silently collapsed toward surface-token overlap — while the
providerOptions-level contract test stayed green.

Fix: an AsyncLocalStorage (same pattern as __budgetStore) populated in
embedSubBatch() only when providerOptions actually threads an
input_type, read at body-rewrite time by the fetch shims:
- zeroEntropyCompatFetch: recovers the threaded value; document default
  preserved for ingest paths.
- voyageCompatFetch: opt-in like the dims.ts Voyage branch — inject only
  when threaded; the field stays off the wire otherwise.
- NEW openAICompatAsymmetricFetch: fallthrough default for every other
  openai-compatible recipe (llama-server, litellm, ollama, ...) — the
  canonical local/proxy paths for asymmetric models. Strict pass-through
  when nothing was threaded, so symmetric deployments see zero wire
  change; recipes with their own compat fetch (azure) keep it via the
  compat.fetch ?? precedence.

KNOBS_HASH_VERSION bumped 10→11: cached query_cache rows were keyed on
document-side query vectors; pre-fix rows must not be served to post-fix
lookups (same convention as the v=3 embedding-provider bump). One-time
global cold-miss on upgrade; refills within cache.ttl_seconds.

Tests: test/embed-input-type-wire.test.ts runs the REAL SDK transport
with a mocked global fetch and asserts on the outbound body — the only
layer where this regression is observable. Covers ZE hosted, llama-server,
litellm, ollama (query + document sides) and pins the pass-through for
non-asymmetric models and Voyage's opt-in shape. 4 of the original 7
assertions fail on master, proving the pin. One structural pin in
test/ai/zeroentropy-compat-fetch.test.ts updated to the new line shape
(same semantic); KEY_FILES.md gateway.ts entry updated to the new truth.

Supersedes #1400 (closed unmerged) — same ALS mechanism, extended to
Voyage + all openai-compatible recipes. Credit to @billy-armstrong for
the original diagnosis.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(sync): honor .gitignore in code walk; prune vendor/dist/build

collectSyncableFiles (the full-sync / dry-run enumerator) reimplemented its
own directory skip list inline (node_modules || ops), bypassing the canonical
pruneDir gate and ignoring .gitignore entirely. On a Laravel/PHP repo this
descended into vendor/ (~50k Composer files), storage/, and public/build/,
trying to import 52k dependency/build files and flooding the index with
library internals (a 35-min sync that never finished, killed by the watchdog
at 3%).

- collectSyncableFiles now enumerates via `git ls-files --cached --others
  --exclude-standard` when dir is a git work tree, so the walk honors
  .gitignore (tracked + untracked-not-ignored). Falls back to the FS walk for
  non-git dirs. EroLab: 52164 -> 1028 files.
- The FS fallback now prunes through the canonical pruneDir() instead of a
  drifted inline list, so the two skip lists can't diverge again.
- PRUNE_DIR_NAMES gains vendor/dist/build (dependency + build-output trees).

Addresses #1483 (.gbrainignore), #1159 (--respect-gitignore), and the
maintainer's #1942 vendor/dist/build prune. Walker regression suites
(sync-walker-symlink, brain-writer-walk-prune, sync, sync-walker-submodule)
green: 90 pass.

* fix(config): ignore DATABASE_URL auto-loaded from cwd .env (#427)

Bun merges .env files from the process cwd into process.env before any
user code runs. loadConfig() prefers env DATABASE_URL over
~/.gbrain/config.json, so any gbrain invocation from inside a web-app
checkout silently retargets the brain at that app's database — reads go
to the wrong DB and apply-migrations can write gbrain's schema into a
production app database (#427).

effectiveEnvDatabaseUrl() re-parses the .env files Bun auto-loads from
cwd and treats a DATABASE_URL whose value matches one of them as
file-origin: ignored, with a one-time stderr notice. GBRAIN_DATABASE_URL
and genuinely exported DATABASE_URLs are honored unchanged, so the
operator escape hatch and the e2e suite's env-provided URL keep working.
Applied at loadConfig, getDbUrlSource (doctor parity), init
--non-interactive, and migrate --to.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(cli): arm the disconnect hard-deadline at teardown entry, not before the op body

The 10s force-exit timer in the shared-op dispatch was armed BEFORE the
try block, so any op whose handler ran past 10s wall-clock was killed
mid-flight with process.exit(0) and zero stdout. On a slow Postgres
pooler (6-10s per fresh connection) a healthy `gbrain search` was
force-exited every time — an empty 'success' indistinguishable from no
results. The v0.42.20.0 exitCode honor can't help: a mid-op kill fires
before any error path sets exitCode.

Move the arming into the finally (teardown entry), matching the
fall-through owner-disconnect site later in main(): the timer still
bounds a hung drain/disconnect (the C13 contract) but can no longer
kill a slow-but-progressing op. Verified on a transaction-pooler
Supabase brain: search went from 0 bytes/exit 0 at 10s to real results
at ~21s.

* fix(import): stamp source_id on extracted call-graph edges

importCodeFile built CodeEdgeInput rows without source_id, so every
edge landed NULL. getCallersOf/getCalleesOf filter
`AND source_id = <scoped>` whenever a worktree pin or --source is in
play — NULL never matches, so scoped call-graph queries silently
returned 0 rows on multi-source brains even though the edges existed
(2,122 edges, 26 targeting the probed symbol, count 0 returned).

One-line fix: carry the sourceId already in scope into the edge input.
Existing NULL rows backfill with:
  UPDATE code_edges_symbol e SET source_id = p.source_id
    FROM content_chunks c JOIN pages p ON p.id = c.page_id
   WHERE c.id = e.from_chunk_id AND e.source_id IS NULL;
(same for code_edges_chunk). Verified: code-callers returns 21 callers
where it returned 0.

* docs(migrations): NULL embeddings BEFORE the column-type alter

The Postgres recipe ordered ALTER COLUMN TYPE vector(N) before the
UPDATE that clears stale embeddings. pgvector refuses to cast existing
vectors across dimensions ('expected 1024 dimensions, not 1536'), so
the recipe as written aborts the transaction on any brain that has
embeddings — which is every brain doing this migration. Swap the steps:
NULLs cast fine.

* fix: honor legacy token source grants in oauth

* fix(cli): bound read-scope op handlers at 180s wallclock (pre-landing review)

With the hard-deadline timer correctly scoped to teardown, a genuinely
wedged read handler (hung pooler connection mid-query) would hang the
CLI forever — the #1633 zombie class the old pre-try timer accidentally
bounded at 10s. Reads now get a generous withTimeout (180s default, far
above any healthy slow-pooler run; --timeout=Ns overrides; exit 124 with
the teardown finally still draining + disconnecting). Writes/admin stay
unbounded: a long import/embed must never be killed by a default.

* fix(import): stamp unscoped edges 'default', matching the pages-table default

Review catch: 'sourceId ?? null' fixed the scoped path but left the
unscoped one (reindex --code without --source, importCodeFile callers
without opts.sourceId) stranding edges at NULL while their pages land
under the schema default (pages.source_id DEFAULT 'default') — so
getCallersOf(sym, { sourceId: 'default' }) missed them. Same bug,
other door. Fallback is now 'default'.

* fix(core): runtime dim-migration recipe NULLs embeddings before the alter

Review catch: the doc fix corrected docs/embedding-migrations.md, but
embeddingMismatchMessage still PRINTED the broken order — ALTER before
UPDATE ... SET embedding = NULL — and linked to the now-contradicting
doc. pgvector refuses to cast existing vectors across dimensions, so
the printed recipe aborted on any brain that has embeddings. Swap the
steps and say why inline.

* feat(migrate): v116 — backfill NULL edge source_id + index from_symbol_qualified

1. Backfill: edges written before the stamping fix sit at source_id=NULL
   and stay invisible to scoped call-graph queries until repaired. Derive
   each edge's source from its own from_chunk's page (pages.source_id is
   NOT NULL DEFAULT 'default'). Same SQL verified live on a 2,122-edge
   production brain.
2. Indexes: getCalleesOf filters both edge tables on from_symbol_qualified,
   which had no index — every callee lookup was a seq scan, amplified
   per-BFS-node by the recursive code walk. With NULL edges repaired,
   scoped walks actually expand, so the latent cost becomes real.
   Mirrored into src/schema.sql; schema-embedded.ts regenerated.

* docs(migrations): align the rationale list with the corrected recipe order

The 'Why we don't do this automatically' list still said alter-then-wipe;
reorder to wipe-then-alter and replace the fragile 'step 3' numeric
cross-reference with a name-based one.

* test: regression coverage for edge source_id stamping, timer placement, recipe order

- import-code-edges-source-id: scoped import stamps edges + scoped
  getCallersOf/getCalleesOf match (verified failing pre-fix), plus the
  unscoped-import case asserting 'default' stamping.
- cli-force-exit-teardown-arming: structural pin — the hard-deadline
  timer arms inside the finally (teardown entry), never before the op
  body; daemon guard, unref, clearTimeout intact.
- embedding-dim-check: recipe order pinned — UPDATE precedes ALTER so
  the printed SQL can't drift from docs/embedding-migrations.md again.

* fix(cli): hard-exit after teardown on wallclock timeout; bound makeContext too

Adversarial review, two findings on the new timeout path:
1. On timeout the finally drained, disconnected, then CLEARED the
   hard-deadline timer — removing the only backstop while the abandoned
   handler (withTimeout races, it does not cancel) can hold ref'd
   sockets/SDK timers that keep Bun's loop alive: 'timed out' printed,
   process immortal — the zombie class this branch exists to kill,
   resurrected through its own fix. The finally now exits explicitly
   after teardown completes on the timeout path.
2. makeContext does DB I/O (resolveSourceId) for EVERY op and sat
   outside any bound — a pooler wedge at context build hung reads,
   writes, and admin alike. It now shares the same wallclock bound.

* fix(import): normalize edge source once — closes the '' door and the unscoped chunk fan-out

Adversarial review: txOpts used truthiness while the edge stamp used
nullish — sourceId:'' put pages under 'default' but stamped edges '',
FK-violating against sources(id) and silently dropping the file's whole
call graph in the best-effort catch. The unscoped getChunks could also
fan out to same-slug chunks from another source. One normalized
edgeSourceId (sourceId || 'default') now drives both the chunk lookup
and the stamp.

* fix(engine): default edge source_id to 'default' at the insert layer (both engines)

Adversarial review: addCodeEdges still wrote e.source_id ?? null, so any
future caller that forgets the field reintroduces invisible NULL edges
the day after the v116 backfill runs. A NULL source_id is invisible to
every scoped call-graph query; default to the schema-default source the
way the pages table does. Applied to both engines (parity).

* fix(core): facts alter recipe NULLs embeddings before cross-dimension alters

Adversarial review: buildFactsAlterRecipe shipped the same defect class
this branch fixes for content_chunks 350 lines up — a cross-dimension
ALTER ... USING cast that pgvector refuses while rows hold old-width
vectors. Dimension changes now wipe first (the facts pipeline re-embeds
on next write); same-dim type swaps (halfvec <-> vector) keep the
lossless cast and PRESERVE data. Both behaviors pinned by tests.

* v0.42.39.0 chore: version bump + CHANGELOG + TODOS

Marks the v0.42.20.0 'decouple the op-dispatch force-exit timer' follow-up
complete — this branch ships exactly that decoupling.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(postgres-engine): atomic JSONB merge in updateSourceConfig — eliminate lost-update race

## Problem

`updateSourceConfig` used a read-then-write pattern: read the current
`config` row, normalize it in JavaScript, then write the merged result
back with `SET config = <normalized> || <patch>`.

Under concurrent callers (two background autopilot/cycle paths patching
different keys simultaneously), both callers can read the same stale
row. The later `SET config = ...` then clobbers the earlier patch,
silently dropping whatever keys the first caller wrote. Reproduced
at 21/25 lost-update events under real Postgres with parallel callers.

## Fix

Fold the normalization and merge into a single atomic `UPDATE … SET
config = CASE … END || patch` statement. Because the `SET` expression
evaluates against the row-locked latest version of `config`, there is
no snapshot window between the read and the write. Concurrent callers
now converge correctly (50/50 clean in reproduction test).

The `CASE` also normalizes historical bad JSONB shapes inline:
- `object` — used as-is
- `string` — double-encoded config; inner text parsed with the SQL
  `IS JSON` guard (Postgres 16+) so unparseable strings fall back to
  `{}` instead of raising `invalid input syntax for type json`
- `array` — array of patch objects aggregated into a flat object via
  `jsonb_object_agg`
- anything else — falls back to `{}`

`pglite-engine.updateSourceConfig` already used an atomic `||` merge;
this change brings postgres-engine to parity.

## Test

Added two assertions to `test/list-all-sources.test.ts`:
1. JSONB string holding non-JSON text normalizes to `{}` (no cast throw)
2. JSONB string holding double-encoded valid JSON is parsed then merged

* fix(doctor): five correctness fixes — stale locks, content sanity, graph coverage, exit code, gateway guard

## 1. Stale lock break hints cover gbrain-cycle: keys

The doctor stale-lock report only recognized `gbrain-sync:` lock prefixes;
everything else fell back to `gbrain sync --break-lock`, which is wrong for
dream/autopilot cycle locks. A `gbrain-cycle:<source>` or `gbrain-cycle`
lock now suggests `gbrain dream --break-lock [--source <name>]`, and
unknown lock shapes fall back to `gbrain doctor` instead of a
misleading sync command.

## 2. content_sanity_audit_recent counts reject and quarantine as hard failures

v0.42 renamed the hard disposition path: rejected pages emit a `reject`
event and quarantined junk pages emit `quarantine`; `hard_block` is now
only the pre-v0.42 legacy alias. The status check only counted `hard_block`,
so fresh `reject` / `quarantine` events from the new path cleared as `ok`
whenever fewer than 10 events existed. The check now sums all three for the
hard count, and `soft_block + flag` for the soft count.

## 3. graph_coverage excludes test fixture entity pages from the denominator

Brains seeded with code sources (e.g. a sync of the gbrain repo itself)
could accumulate test fixture pages typed as `entity` / `person`. Including
these in the entity-count denominator diluted coverage and produced spurious
warnings ("Entity link coverage 0%, timeline 0%") on knowledge-only brains
with no real entity pages. The check now queries a per-entity stats CTE that
excludes `tools/gbrain/test/*` slugs and the `templates/new-person` stub,
with an additional guard for the all-fixture case (`eligibleEntityCount = 0`).

## 4. process.exitCode instead of process.exit at doctor main exit point

`process.exit(hasFail ? 1 : 0)` was a hard kill that prevented cleanup
handlers (Bun unload events, open DB connections) from running. Using
`process.exitCode = hasFail ? 1 : 0` defers the actual termination until
the end of the event loop, allowing cleanup to complete.

## 5. checkSubagentCapability exported for test seams + gateway loop guard

The function was private, making it untestable in isolation. It is now
exported. Additionally, users running gbrain with a non-Anthropic chat model
via `agent.use_gateway_loop=true` no longer receive a spurious warning that
`ANTHROPIC_API_KEY` is missing — subagents route via the gateway loop in
that configuration and do not need the key directly.

## Tests

Doctor test suite: 77 pass, 0 fail (no regressions).

* fix(engine): deleteFactsForPage excludeSourcePrefixes (#1928) + reconnect() parity (#2034)

Engine-layer API for two cycle/availability fixes that share these files:
- deleteFactsForPage gains optional excludeSourcePrefixes so the fence
  reconcile can protect non-fence facts (e.g. cli: conversation facts).
- reconnect(ctx?) is now a first-class BrainEngine method on both engines
  (PostgresEngine already had it; PGLite gains config capture + reconnect)
  so callers stop using disconnect()+bare connect().

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(cycle): stop extract_facts from wiping conversation facts (#1928)

The fence reconcile delete-then-reinsert wiped cli:-origin facts (no fence to
recreate them); a failed-sync full walk turned it brain-wide (1829 rows, 0
reinserted, status ok). Now: exclude cli: rows from the wipe, do NOT inherit
the failed-sync->full-walk fallback for this destructive phase, and warn on
net-negative reconcile.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(autopilot,supervisor): reconnect() instead of disconnect()+bare connect() (#2034)

The autopilot health-probe recovery called connect() with no args after
disconnect(), losing the startup config (database_url undefined -> FATAL
restart-loop on every DB blip) and opening a null-pool window. Both call sites
now use engine.reconnect(), which restores the captured config.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(write-through): mirror to the assigned source's local_path, never the global repo (#2018)

put_page write-through resolved the disk target from the global sync.repo_path,
so a default-source page (local_path NULL) got written into an unrelated
federated source's working tree. Now it uses the assigned source's own
local_path; NULL local_path skips (no leak); the global path is used only as a
sole-source fallback.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(pglite-lock): heartbeat + steal-grace so live holders are never stolen (#2058)

A live holder's lock was force-removed after 5min age alone, letting a second
process share the single-writer data dir -> WAL corruption. The lock now
heartbeats while held; a holder is reaped only when its PID is dead OR its
heartbeat went stale past the steal grace. Pairs PID liveness with heartbeat
age to also defeat PID reuse.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(migrate,doctor): self-heal idx_timeline_dedup drift (#2038)

A migration renumbered during a merge (v102) could be recorded-as-applied
without its DDL running, leaving the 3-column index so every timeline write
failed the 4-column ON CONFLICT. runMigrations now always runs a shape-keyed
drift repair (dedupe-then-rebuild) even when no migration is pending, and
doctor surfaces the drift.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(timeline): un-silence the swallowed batch catch; pin Date-batch round-trip (#2057)

The meetings extractor's bare catch {} hid a brain-wide timeline-write failure
(0 entries, no error). It now counts + surfaces batch errors. Adds a Date-bearing
batch regression test proving the #1861 jsonb_to_recordset refactor already
fixed the original ::text[] cast failure.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* chore: bump version and changelog (v0.42.41.0)

Triage fix wave: 6 authored critical fixes (#1928 facts wipe, #2018
write-through leak, #2034 reconnect loop, #2058 WAL lock, #2038 timeline
migration drift, #2057 timeline silent-empty) + community PRs #2064 #2052
#2020 #2033 #2074 #2075 #2009 #2072 #2073. TODOS: deferred #1994 #1963 #2050.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix: address adversarial review findings (#1928, #2058, #2038, #2057)

Codex as-built review of the authored fixes surfaced 4 real issues:
- #2058: add a pid+acquired_at ownership token. A stale holder reaped + replaced
  past the grace must NOT let its resumed heartbeat refresh, nor releaseLock
  remove, the NEW owner's lock (re-opened the concurrent-writer hole). Heartbeat
  and release now verify the on-disk lock is still ours. + regression test.
- #1928: the destructive-full-walk guard keyed off phases.includes('sync'),
  which wrongly suppressed a legitimate full reconcile when sync was SKIPPED
  (no engine / no brainDir). Key off a syncAttempted flag set only when sync
  actually ran.
- #2038: dedupe keeps MIN(id) not MIN(ctid) — deterministic and consistent with
  the existing v-migration lower-id rule.
- #2057: the extract CLI caller now surfaces batch_errors (stderr + exit 1)
  instead of printing a clean success over failed inserts.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* docs(key-files): sync reference to v0.42.41.0 triage-wave behavior

Update KEY_FILES.md to current-state truth for the shipped fixes (no
release-history clauses, per the reference-doc discipline):

- write-through.ts (#2018): resolves the disk target from the assigned
  source's own local_path; sole-source falls back to sync.repo_path,
  multi-source skips with source_has_no_local_path rather than leak.
- engine.ts (#2034): reconnect() is now a REQUIRED lifecycle method on
  both engines; config-restoring, never disconnect()+bare connect().
- migrate.ts (#2073): document v116 edge source_id backfill + callee
  index, and the always-run (version-counter-blind) timeline dedup
  self-heal.
- new entry for timeline-dedup-repair.ts (#2038) + the
  timeline_dedup_index doctor check.
- new entry for pglite-lock.ts (#2058): heartbeat + steal-grace
  (GBRAIN_PGLITE_LOCK_STEAL_GRACE_SECONDS) so a live holder is never
  stolen.
- extract-facts.ts (#1928): cli:-fact protection, no failed-sync
  full-walk inheritance, net_fact_deletion warn floor.

bun run build:llms re-run (KEY_FILES is link-only so bundles unchanged);
freshness + current-state guards green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(write-through): preserve nested multi-source layout; narrow #2018 leak guard

The first #2018 fix skipped any no-local_path source on a multi-source brain,
which broke the legitimate nested layout (a source without its own tree nests
under the host repo at .sources/<id>/ — pinned by put-page-write-through.test).
Narrow the guard: a no-local_path source nests under sync.repo_path as before;
only SKIP when sync.repo_path is literally another source's own local_path
(the actual leak — writing there pollutes that sibling's repo). Caught by the
sharded suite.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* test: satisfy test-isolation guard for the new lock/reconnect tests

CI `verify` flagged 3 intra-process isolation violations in the tests added
this wave (the parallel runner shares one process per shard):
- pglite-lock.test.ts: the GBRAIN_PGLITE_LOCK_STEAL_GRACE_SECONDS mutation now
  goes through withEnv() instead of a raw process.env write (R1).
- pglite-reconnect: renamed to *.serial.test.ts — it creates per-test engines
  to exercise the connect/reconnect lifecycle, which doesn't fit the shared
  beforeAll-engine model (R3/R4).
verify is now 30/30; both files green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(pglite): reconnect() is a no-op for in-memory engines (#2034)

CI serial-tests + test(5) caught two in-branch regressions from the #2034
PGLite reconnect():
- worker/queue claim-error recovery + their renewLock e2e test assume PGLite
  reconnect is absent/no-op (queue.ts documents it). Making it a real
  disconnect+reopen wiped an in-memory engine's state mid-job. reconnect() now
  no-ops for in-memory (no database_path) — file-backed still re-opens the dir
  (state persists on disk). Restores the documented worker assumption.
- connection-resilience 'Supervisor still has the 3-strikes-then-reconnect
  path' pinned the removed unsafe-cast text; updated to assert the direct
  this.engine.reconnect() call.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* test: quarantine embed-input-type-wire to serial lane (CI test(5) leak)

#2033's embed-input-type-wire.test.ts configures a 1280-dim embedding gateway;
the active dimension survived into engine-find-trajectory when CI's 10-way
hash-disjoint sharding co-located them (this branch's added files reshuffled the
assignment), failing 7 trajectory tests with 'expected 1280 dimensions, not
1536'. resetGateway() in afterEach clears the gateway but the dimension still
leaked. It mutates global gateway/embedding state, so it belongs in the serial
lane (own bun process, true isolation) by the repo's own definition. Root-caused
by reproducing the exact failing pair locally.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Austin Arnett <austin@sdsconsultinggroup.org>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Dave MacDonald <djmacdonald@ucdavis.edu>
Co-authored-by: pabloglzg <186649799+pabloglzg@users.noreply.github.com>
Co-authored-by: Alex P. <12667893+aphaiboon@users.noreply.github.com>
Co-authored-by: Garry Tan <bo.m.liu@gmail.com>
Co-authored-by: jbarol <barol.j@gmail.com>
Co-authored-by: maxpetrusenkoagent <max.petrusenko.agent@gmail.com>
Co-authored-by: PAI <pai@scaffolde.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sync --strategy code enhancement: honor .gitignore (opt-in flag)

3 participants