Skip to content

fix(recipes/openai): add max_batch_tokens to embedding touchpoint#924

Closed
mgunnin wants to merge 1 commit into
garrytan:masterfrom
mgunnin:fix/openai-max-batch-tokens
Closed

fix(recipes/openai): add max_batch_tokens to embedding touchpoint#924
mgunnin wants to merge 1 commit into
garrytan:masterfrom
mgunnin:fix/openai-max-batch-tokens

Conversation

@mgunnin

@mgunnin mgunnin commented May 12, 2026

Copy link
Copy Markdown
Contributor

PR: Add max_batch_tokens to OpenAI recipe

Problem

src/core/ai/recipes/openai.ts is the only embedding recipe in the codebase
without a max_batch_tokens cap. Every other provider declares one
(voyage=120K, azure-openai=8K, dashscope=8K, zhipu=8K, minimax=4K).

Without max_batch_tokens, gbrain's recursive-halving safety net never engages.
The batcher dispatches whatever fits in its char-based estimator window —
typically ~150K-200K tokens per request — and any page with token-dense content
(Discord exports, JSON dumps, code-heavy pages) tips a single request past
OpenAI's 1M-token TPM ceiling. Retry storm follows. First-failed request blocks
the queue head; subsequent passes hit the same fat page first and never progress.

Reproduction:

  1. Init brain with Postgres engine.
  2. Sync a corpus containing a few discord-full chat exports (~800K chars each).
  3. gbrain embed --stale.
  4. Observe Rate limit reached ... TPM: Limit 1000000, Used 1000000, Requested 150000
    on the same 4-5 pages every pass.

Fix

Add max_batch_tokens: 100_000 to the OpenAI embedding touchpoint.

Why 100K and not 200K-250K (closer to OpenAI's 300K per-request cap):

  • gbrain's batcher estimates tokens as chars / 4.
  • Token-dense content (markdown+JSON, code blocks, timestamps in Discord/Slack
    dumps) tokenizes at ~chars/2.7. A 100K estimated batch can be ~150K real tokens.
  • 100K estimated = ~150K worst-case real, safely under the 300K/request hard cap.
  • Also gives recursive-halving room to work on outlier pages without thrashing.

Diff

     embedding: {
       models: ['text-embedding-3-large', 'text-embedding-3-small'],
       default_dims: 1536,
       dims_options: [256, 512, 768, 1024, 1536, 3072],
       cost_per_1m_tokens_usd: 0.13,
       price_last_verified: '2026-04-20',
+      max_batch_tokens: 100_000,
     },

Tested

  • Local gbrain install (v0.33.0, Postgres+pgvector) cleared its embedding
    queue after the patch where 5 prior passes had been stuck.
  • No regression on small batches — recursive-halving only engages when
    estimated batch tokens exceed cap.

Related

The doctor warning recipe "google" declares an embedding touchpoint without max_batch_tokens; recursion is the only safety net for batch caps is the
exact issue this PR fixes for OpenAI. Google recipe should get the same
treatment — happy to add that in a follow-up.


View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

OpenAI is the only recipe in the codebase without a max_batch_tokens cap.
Every other provider declares one (voyage=120K, azure-openai=8K, dashscope=8K,
zhipu=8K, minimax=4K). Without it, gbrain's recursive-halving safety net never
engages — batches dispatched purely on the char/4 estimator window will trip
OpenAI's 1M-token TPM ceiling on token-dense pages (Discord exports, JSON
dumps, code-heavy markdown), then retry storm and block the queue head.

Setting cap to 100_000:
- gbrain's batcher estimates tokens as chars/4
- Token-dense markdown+JSON tokenizes at ~chars/2.7
- 100K estimated = ~150K real worst-case, safely under OpenAI's 300K
  per-request hard cap and the 1M/min TPM ceiling
- Leaves headroom for recursive-halving on outlier chunks
garrytan added a commit that referenced this pull request May 25, 2026
…1374)

* fix(recipes/openai): add max_batch_tokens to embedding touchpoint

OpenAI is the only recipe in the codebase without a max_batch_tokens cap.
Every other provider declares one (voyage=120K, azure-openai=8K, dashscope=8K,
zhipu=8K, minimax=4K). Without it, gbrain's recursive-halving safety net never
engages — batches dispatched purely on the char/4 estimator window will trip
OpenAI's 1M-token TPM ceiling on token-dense pages (Discord exports, JSON
dumps, code-heavy markdown), then retry storm and block the queue head.

Setting cap to 100_000:
- gbrain's batcher estimates tokens as chars/4
- Token-dense markdown+JSON tokenizes at ~chars/2.7
- 100K estimated = ~150K real worst-case, safely under OpenAI's 300K
  per-request hard cap and the 1M/min TPM ceiling
- Leaves headroom for recursive-halving on outlier chunks

(cherry picked from commit 40536aa)

* fix(ai/embed): recognize OpenAI 'maximum request size' error in isTokenLimitError

OpenAI's /v1/embeddings endpoint hard-caps a single request at 300k tokens
total across all input items. When the cap is exceeded it returns:

    Invalid 'input': maximum request size is 300000 tokens per request.

None of the three existing regexes in isTokenLimitError matched this
phrasing, so the recursive-halving safety net in embedSubBatch never
engaged for OpenAI. The same fat page (a token-dense markdown export,
e.g. a Discord transcript) would re-fail every pass, blocking forward
progress on the whole batch indefinitely.

Locally reproduced on a 31,129-chunk Postgres brain: 2,125 chunks
stuck at 'remaining' across 30+ embed --stale passes with retry
loops + sleep delays. Adding the two new patterns lets halving fire;
the same backlog cleared in one pass after the regex change (the
companion max_batch_tokens recipe fix from PR #924 caps fresh batches,
but existing oversize pages still need halving to recover).

Adds:
  - /maximum request size.*tokens/i  — OpenAI verbatim
  - /max.*tokens.*per.*request/i    — defensive against minor rewording

Tests:
  - Regression test for the exact OpenAI error string
  - Coverage for the generic 'max tokens per request' variant
  - All 25 tests in adaptive-embed-batch.test.ts pass

No behavior change for providers whose errors already matched.

(cherry picked from commit b834e84)

* fix(connection-manager): strip .<project-ref> suffix from username when deriving direct URL

`deriveDirectUrl()` correctly rewrites the host (`aws-0-us-east-1.pooler.supabase.com`
→ `db.abcxyz.supabase.co`) but preserves the full pooler-form username
(`postgres.abcxyz`). Supabase direct connections expect a bare `postgres`
username — Supavisor uses the `.<ref>` suffix for tenant routing, but it's
not a real database user. The auto-derived URL therefore fails to authenticate
even with the correct password:

    password authentication failed for user "postgres.abcxyz"

Strip the suffix to `postgres` whenever the project-ref was successfully
extracted (same condition that triggers the host rewrite). The non-pooler
username branch is unaffected — preserved as-is to keep the port-only
fallback case working.

Hit while exercising v0.30.1's dual-pool routing on a real Supabase brain;
the kill switch (`GBRAIN_DISABLE_DIRECT_POOL=1`) papered over it locally
but every Supabase user with a stock pooler URL would silently fall through
to single-pool until the user-supplied a `GBRAIN_DIRECT_DATABASE_URL`
override. With this fix, dual-pool works out of the box for the canonical
Supabase shape.

Test additions:
  - 1 case asserting bare `postgres:secret@` in the derived URL when
    project-ref is parseable from the pooler URL (the new behavior)
  - extends the existing "falls back to port-only" case with an
    assertion that non-pooler usernames are preserved (unchanged behavior)

`bun run typecheck` clean. `deriveDirectUrl` test block passes 5/5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(cherry picked from commit ddf2c6a)

* fix(init): --help should not mutate config or scan filesystem

`gbrain init --help` (and `-h`) currently fall through to the smart-detection
branch in runInit(), which scans cwd for .md files and on a directory with
1000+ files prints "Found ~1500 .md files. For a brain this size, Supabase
gives faster search..." then defaults to PGLite — calling saveConfig() and
overwriting any existing Postgres config with `engine: 'pglite' +
database_path: ~/.gbrain/brain.pglite`.

Confirmed in the wild: ran `gbrain init --help` from $HOME on a machine where
~/.gbrain/config.json pointed at a Supabase Postgres brain with 10K+ pages.
The config was silently flipped to PGLite. The Supabase data was intact, but
gbrain stopped pointing at it until the config was manually restored.

Root cause: cli.ts:62-69 only routes --help → printOpHelp() for shared-op
commands; CLI_ONLY commands (init, embed, etc.) fall through to their handler
with --help still in argv. None of them check for it.

Fix: add a --help/-h guard at the top of runInit() that prints help text and
returns. Help should never mutate state — Postel's robustness principle for
CLI tools.

Help text covers all flags (engine selection, AI provider options, thin-client
mode) so users running `--help` get the canonical list rather than having to
read the source.

A wider architectural fix — adding --help routing for all CLI_ONLY commands in
cli.ts — is plausible follow-up, but each CLI_ONLY command would still need
its own help text. This per-command pattern matches how shared ops handle it
via printOpHelp(). Init is the highest-stakes case because it's the only
CLI_ONLY command that calls saveConfig().

Smoke test: from a directory with 1500 .md files, with GBRAIN_HOME pointed at
a fresh tempdir:
  - Before fix: ~/.gbrain/config.json materialized with engine: 'pglite'
  - After fix: help text printed, no config dir created

`bun run typecheck` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(cherry picked from commit ed11fdd)

* test(frontmatter-install-hook): isolate hooksPath assertion from developer global config

The "installHook writes ... and sets core.hooksPath" test asserted
`git config --get core.hooksPath` returns `.githooks`, which falls
back to the global scope when local is unset. Developers who set
`core.hooksPath` globally (common with dotfiles managers pointing at
~/.config/git/hooks) saw a deterministic FAIL because installHook
intentionally respects an existing global value and skips writing
the local one — exactly the documented contract.

Fix: read via `git config --local --get core.hooksPath` (scope-locked)
and branch the assertion on whether a global is already set. Both
clean-CI (local should be '.githooks') and developer-with-global
(local should be empty; installHook correctly didn't clobber) now
pass deterministically.

No API change. installHook behavior is unchanged.

Verified locally with the affected test passing under
`GIT_CONFIG_GLOBAL=~/.gitconfig` carrying `core.hooksPath=...`.

(cherry picked from commit 0e4da2c)

* fix: guard against missing 'intent' field in routing-eval fixtures

Two defensive fixes:

1. normalizeText(): return empty string on null/undefined input instead
   of crashing with 'undefined is not an object (evaluating s.toLowerCase)'

2. loadRoutingFixtures(): validate that parsed fixture has 'intent' as a
   string before adding to fixtures array. Fixtures with wrong field
   names (e.g. 'input' instead of 'intent') are now reported as
   malformed with a helpful error message listing the actual keys found.

Root cause: a skill's routing-eval.jsonl used {"input": ...} instead
of {"intent": ...}. The JSON parsed fine but the cast to
RoutingFixture was unchecked, so fixture.intent was undefined.
normalizeText(undefined) then crashed. This made 'gbrain doctor'
completely unusable.

(cherry picked from commit b142bbd)

* fix(test): isolate HOME in run-e2e.sh to stop config corruption

Replaces #517 (re-ported fresh against current scripts/run-e2e.sh after
v0.23.1 rewrote the script — original cherry-pick would not apply).

E2E tests call setupDB which writes $HOME/.gbrain/config.json pointing at
the docker test container. When the container tears down, the user's real
autopilot daemon wedges trying to connect to a vanished postgres. Three
operators hit this within 16 days before the original PR filed.

Fix: wrapper exports HOME + GBRAIN_HOME to a mktemp tmpdir BEFORE bun
starts so config writes land in the tmpdir, with a post-run breach
detector that compares md5 of the user's real config against pre-run.
Both env vars required: loadConfig/saveConfig resolve via HOME while
configPath honors GBRAIN_HOME. HOME set before bun starts because
os.homedir() caches at first call.

Test seam: test/gbrain-home-isolation.test.ts updated to assert against
homedir() === configDir() when GBRAIN_HOME unset (correct under the
safety wrapper itself) instead of the prior "not /tmp/" sentinel.

Revert path: git revert <this-sha> if test:e2e regresses on master.

Co-Authored-By: orendi84 <orendi84@users.noreply.github.com>

* test(dream-cycle): add schema-suggest to EXPECTED_PHASES

v0.40.7.0 Schema Cathedral v3 added the 'schema-suggest' phase between
'orphans' and 'purge' in ALL_PHASES, but the E2E phase-order test was
not updated to match. ALL_PHASES vs EXPECTED_PHASES diverged and the
shape-pin test failed every run on master.

Surfaced during fix-wave: warm-narwhal E2E gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(autopilot-fanout): use relative timestamp inside freshness window

The 'end-to-end: updateSourceConfig persists timestamp visible to next
listAllSources' test pinned last_full_cycle_at to a hardcoded
'2026-05-22T15:00:00.000Z'. The 60-minute freshness window passed
within ~1 hour of write — every run after the deadline classified the
source as stale and dispatched it, breaking the test's
.skippedFresh expectation.

Switch to Date.now() - 30min relative timestamp (mirrors the prior
'source with last_full_cycle_at < 60min ago is skipped by gate' test).

Surfaced during fix-wave: warm-narwhal E2E gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(fresh-install-pglite): unset other provider keys in beforeEach

init.ts:455 fails loud when multiple embedding providers are env-ready
in non-TTY mode. The test sets ZEROENTROPY_API_KEY then runs init,
but developer machines commonly have OPENAI_API_KEY + VOYAGE_API_KEY +
ZEROENTROPY_API_KEY all set, so init sees 3 providers and exits 1.

Save+unset OPENAI_API_KEY + VOYAGE_API_KEY in beforeEach, restore in
afterEach. Now only ZE is env-ready, init picks it, schema sized to
zembed-1's 1280d as the test expects.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(voyage-multimodal): switch fixture from AVIF to PNG

Voyage's /multimodalembeddings endpoint rejects AVIF as of 2026-05
with 'Please provide a valid base64-encoded image'. The prior comment
('AVIF is fine for an embed call') held at v0.27.x and regressed
silently on the provider side.

Add test/fixtures/images/tiny.png (16x16 RGB PNG, 1307 bytes generated
via sips from the macOS default wallpaper). PNG is universally
accepted by Voyage and other multimodal providers.

Surfaced during fix-wave: warm-narwhal E2E gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cycle/synthesize): prefix bare anthropic model ids before queue.add

queue.add's subagent capability validator (classifyCapabilities →
resolveRecipe) requires provider:model format and rejects bare ids
with 'unknown provider'. resolveModel returns the bare id from
TIER_DEFAULTS / DEFAULT_ALIASES (e.g. 'claude-sonnet-4-6'), which the
validator then rejects, dropping the synthesize phase to status:fail
with SYNTH_PHASE_FAIL.

Narrow fix at the call site: if config.model has no colon AND starts
with 'claude-', prefix 'anthropic:'. Other providers must already
declare a colon. Avoids changing TIER_DEFAULTS / DEFAULT_ALIASES
constant shapes, which would ripple across every resolveModel caller.

Surfaced by dream-synthesize-chunking E2E during fix-wave: warm-narwhal.
Affected tests: 'single-chunk transcript uses legacy idempotency key'
and 'multi-chunk transcript spawns N children with chunk-suffixed
idempotency keys' — both relied on result.details.children_submitted
which only the ok() path sets; the failed() path returns details: {}.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(mechanical): pin doctor init embedding model + clean non-default sources

Two fixes in the E2E Doctor Command describe block, both surfaced by
cross-file state pollution under the full sequential E2E run:

1. Pass --embedding-model openai:text-embedding-3-large to the init
   subprocess. Without the explicit flag, doctor inherits whatever the
   resolver picks from env keys (ZE if ZEROENTROPY_API_KEY is set,
   defaulting to zembed-1 at 1280d). The test's setupDB initialized
   schema at 1536d, so the dim mismatch fires
   embedding_width_consistency WARN, exiting doctor 1.

2. DELETE FROM sources WHERE id != 'default' in beforeAll. Prior E2E
   files leave non-default source rows (e.g. 'delta' from autopilot /
   sources tests). sync_freshness + cycle_freshness then FAIL on those
   orphans because they were never synced/cycled, exiting doctor 1.
   setupDB TRUNCATEs sources but schema.sql re-seeds 'default' via
   initSchema; this leaves only the canonical single-source brain
   the test expects.

Surfaced during fix-wave: warm-narwhal E2E gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(run-e2e): per-file connection flush + 180s outer timeout

Two cross-file isolation hardenings for the sequential E2E runner:

1. Terminate stale Postgres connections before each file. Without this,
   idle connections from the prior bun process's pool race with the
   next file's setupDB() TRUNCATE CASCADE, producing 'fixture pages
   disappear mid-test' failures. The terminate call is idempotent +
   ~50ms; first iteration is a no-op.

2. Hard outer timeout (180s per file) via gtimeout / timeout. bun's
   --timeout=60000 is per-test; if a PGLite WASM call hangs in
   beforeAll/afterAll (e.g. ingestion-roundtrip.test.ts wedging
   30+ minutes on macOS), --timeout never fires and the entire suite
   wedges. Outer SIGKILL lets the suite advance and the file is
   recorded as failed for triage. Falls through to bare bun if neither
   gtimeout nor timeout is on PATH.

Surfaced during fix-wave: warm-narwhal — 3 of 5 cross-file flakes
caught by the connection flush; ingestion-roundtrip 30-min wedge
caught by the outer timeout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.41.3.0)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs: annotate synthesize.ts narrow prefix fix (v0.41.3.0)

CLAUDE.md gains the v0.41.3.0 note on src/core/cycle/synthesize.ts (narrow
anthropic: prefix at the queue.add boundary so resolveModel's bare ids
satisfy the subagent validator). llms-full.txt regenerated to match.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore: rebump v0.41.3.0 → v0.41.5.0 (queue drift; PR #1377 claimed .4.0)

Sibling fix-wave PR #1377 (garrytan/community-pr-wave) claimed v0.41.4.0
between my queue check (.3.0 was available) and PR creation. Re-bump to
the next available slot per workspace-aware allocator.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(cycle/synthesize): refuse empty brainDir + resolve relative paths

Pre-fix, runPhaseSynthesize accepted any brainDir string and passed it
to writeReversePages which does join(brainDir, '<slug>.md'). When
brainDir is '' or relative ('.' / './brain' / etc), join() produces a
relative path that writeFileSync resolves against cwd. Result: every
synthesize reverse-write spills into <cwd>/companies/<slug>.md,
<cwd>/people/<slug>.md, etc. instead of the intended brainDir tempdir.

Surfaced by the warm-narwhal wave when E2E test cleanup found orphan
synthesize pages (companies/novamind.md, people/sarah-chen.md,
meetings/2025-04-01-novamind-board-update.md) at the gbrain repo root
from a runCycle({brainDir: '.'}) chain that ran during morning E2E
execution.

Fix at the function entry, single location, all callers protected:
  1. Empty/whitespace brainDir → return failed(BRAINDIR_EMPTY) loud
     instead of silently resolving against cwd
  2. Relative brainDir → resolve(opts.brainDir) before any read/write
     can use it. opts.brainDir mutated so writeReversePages,
     writeSummaryPage, and every join() downstream see the absolute path

Regression test pins all 4 contracts:
  - empty string → fail(BRAINDIR_EMPTY)
  - whitespace-only → fail(BRAINDIR_EMPTY)
  - '.' → mutated to absolute on entry
  - already-absolute → unchanged

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(dream): resolve brainDir to absolute at CLI surface

Defense-in-depth for the synthesize-braindir spillage bug class. The
core fix lives in runPhaseSynthesize (commit 98222a0); this resolves
brainDir one layer earlier so the entire 9-phase runCycle gets the
absolute path, not just synthesize.

Two paths in resolveBrainDir get path.resolve():
  - explicit --dir argument (e.g., `gbrain dream --dir .`)
  - sync.repo_path config (in case it was ever stored relative)

resolveBrainDir already checked existsSync; resolve() just canonicalizes
before return. No behavior change for paths already absolute.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Matt Gunnin <mgunnin@esports.one>
Co-authored-by: Brandon Lipman <brandon@offdeck.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Jeremy Knows <jeremy@veefriends.com>
Co-authored-by: root <root@localhost>
Co-authored-by: orendi84 <orendigergo@gmail.com>
Co-authored-by: orendi84 <orendi84@users.noreply.github.com>
Co-authored-by: Garry Tan <garry@ycombinator.com>
@garrytan

garrytan commented Jun 8, 2026

Copy link
Copy Markdown
Owner

Thanks for this contribution — and apologies for the slow triage. We did a full pass over the entire PR backlog. gbrain has moved fast, and the maintainer's larger "cathedral" rewrites have superseded a big share of community PRs: the AI gateway + recipes + user_provided_models system replaced almost all individual provider PRs; #1805 fixed the whole Postgres module-singleton class; #1542 unified the type taxonomy; #1657 the retrieval path; #1802 the doctor; and so on.

We're closing this one in that cleanup — either the fix already landed on master, it duplicates another PR or merged change, or it's outside the current merge bar. Where a closed PR carried a genuinely valuable idea, we've recorded it in docs/designs/COMMUNITY_IDEAS.md so nothing good is lost (a few may graduate into TODOs).

Please don't read the close as a judgment of the work — thank you for contributing. If you believe the underlying issue is still live on the latest master, reopen with a quick note and we'll take another look. 🙏

@garrytan garrytan closed this Jun 8, 2026
garrytan-agents pushed a commit to garrytan-agents/gbrain that referenced this pull request Jun 13, 2026
…arrytan#1374)

* fix(recipes/openai): add max_batch_tokens to embedding touchpoint

OpenAI is the only recipe in the codebase without a max_batch_tokens cap.
Every other provider declares one (voyage=120K, azure-openai=8K, dashscope=8K,
zhipu=8K, minimax=4K). Without it, gbrain's recursive-halving safety net never
engages — batches dispatched purely on the char/4 estimator window will trip
OpenAI's 1M-token TPM ceiling on token-dense pages (Discord exports, JSON
dumps, code-heavy markdown), then retry storm and block the queue head.

Setting cap to 100_000:
- gbrain's batcher estimates tokens as chars/4
- Token-dense markdown+JSON tokenizes at ~chars/2.7
- 100K estimated = ~150K real worst-case, safely under OpenAI's 300K
  per-request hard cap and the 1M/min TPM ceiling
- Leaves headroom for recursive-halving on outlier chunks

(cherry picked from commit 40536aa)

* fix(ai/embed): recognize OpenAI 'maximum request size' error in isTokenLimitError

OpenAI's /v1/embeddings endpoint hard-caps a single request at 300k tokens
total across all input items. When the cap is exceeded it returns:

    Invalid 'input': maximum request size is 300000 tokens per request.

None of the three existing regexes in isTokenLimitError matched this
phrasing, so the recursive-halving safety net in embedSubBatch never
engaged for OpenAI. The same fat page (a token-dense markdown export,
e.g. a Discord transcript) would re-fail every pass, blocking forward
progress on the whole batch indefinitely.

Locally reproduced on a 31,129-chunk Postgres brain: 2,125 chunks
stuck at 'remaining' across 30+ embed --stale passes with retry
loops + sleep delays. Adding the two new patterns lets halving fire;
the same backlog cleared in one pass after the regex change (the
companion max_batch_tokens recipe fix from PR garrytan#924 caps fresh batches,
but existing oversize pages still need halving to recover).

Adds:
  - /maximum request size.*tokens/i  — OpenAI verbatim
  - /max.*tokens.*per.*request/i    — defensive against minor rewording

Tests:
  - Regression test for the exact OpenAI error string
  - Coverage for the generic 'max tokens per request' variant
  - All 25 tests in adaptive-embed-batch.test.ts pass

No behavior change for providers whose errors already matched.

(cherry picked from commit b834e84)

* fix(connection-manager): strip .<project-ref> suffix from username when deriving direct URL

`deriveDirectUrl()` correctly rewrites the host (`aws-0-us-east-1.pooler.supabase.com`
→ `db.abcxyz.supabase.co`) but preserves the full pooler-form username
(`postgres.abcxyz`). Supabase direct connections expect a bare `postgres`
username — Supavisor uses the `.<ref>` suffix for tenant routing, but it's
not a real database user. The auto-derived URL therefore fails to authenticate
even with the correct password:

    password authentication failed for user "postgres.abcxyz"

Strip the suffix to `postgres` whenever the project-ref was successfully
extracted (same condition that triggers the host rewrite). The non-pooler
username branch is unaffected — preserved as-is to keep the port-only
fallback case working.

Hit while exercising v0.30.1's dual-pool routing on a real Supabase brain;
the kill switch (`GBRAIN_DISABLE_DIRECT_POOL=1`) papered over it locally
but every Supabase user with a stock pooler URL would silently fall through
to single-pool until the user-supplied a `GBRAIN_DIRECT_DATABASE_URL`
override. With this fix, dual-pool works out of the box for the canonical
Supabase shape.

Test additions:
  - 1 case asserting bare `postgres:secret@` in the derived URL when
    project-ref is parseable from the pooler URL (the new behavior)
  - extends the existing "falls back to port-only" case with an
    assertion that non-pooler usernames are preserved (unchanged behavior)

`bun run typecheck` clean. `deriveDirectUrl` test block passes 5/5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(cherry picked from commit ddf2c6a)

* fix(init): --help should not mutate config or scan filesystem

`gbrain init --help` (and `-h`) currently fall through to the smart-detection
branch in runInit(), which scans cwd for .md files and on a directory with
1000+ files prints "Found ~1500 .md files. For a brain this size, Supabase
gives faster search..." then defaults to PGLite — calling saveConfig() and
overwriting any existing Postgres config with `engine: 'pglite' +
database_path: ~/.gbrain/brain.pglite`.

Confirmed in the wild: ran `gbrain init --help` from $HOME on a machine where
~/.gbrain/config.json pointed at a Supabase Postgres brain with 10K+ pages.
The config was silently flipped to PGLite. The Supabase data was intact, but
gbrain stopped pointing at it until the config was manually restored.

Root cause: cli.ts:62-69 only routes --help → printOpHelp() for shared-op
commands; CLI_ONLY commands (init, embed, etc.) fall through to their handler
with --help still in argv. None of them check for it.

Fix: add a --help/-h guard at the top of runInit() that prints help text and
returns. Help should never mutate state — Postel's robustness principle for
CLI tools.

Help text covers all flags (engine selection, AI provider options, thin-client
mode) so users running `--help` get the canonical list rather than having to
read the source.

A wider architectural fix — adding --help routing for all CLI_ONLY commands in
cli.ts — is plausible follow-up, but each CLI_ONLY command would still need
its own help text. This per-command pattern matches how shared ops handle it
via printOpHelp(). Init is the highest-stakes case because it's the only
CLI_ONLY command that calls saveConfig().

Smoke test: from a directory with 1500 .md files, with GBRAIN_HOME pointed at
a fresh tempdir:
  - Before fix: ~/.gbrain/config.json materialized with engine: 'pglite'
  - After fix: help text printed, no config dir created

`bun run typecheck` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
(cherry picked from commit ed11fdd)

* test(frontmatter-install-hook): isolate hooksPath assertion from developer global config

The "installHook writes ... and sets core.hooksPath" test asserted
`git config --get core.hooksPath` returns `.githooks`, which falls
back to the global scope when local is unset. Developers who set
`core.hooksPath` globally (common with dotfiles managers pointing at
~/.config/git/hooks) saw a deterministic FAIL because installHook
intentionally respects an existing global value and skips writing
the local one — exactly the documented contract.

Fix: read via `git config --local --get core.hooksPath` (scope-locked)
and branch the assertion on whether a global is already set. Both
clean-CI (local should be '.githooks') and developer-with-global
(local should be empty; installHook correctly didn't clobber) now
pass deterministically.

No API change. installHook behavior is unchanged.

Verified locally with the affected test passing under
`GIT_CONFIG_GLOBAL=~/.gitconfig` carrying `core.hooksPath=...`.

(cherry picked from commit 0e4da2c)

* fix: guard against missing 'intent' field in routing-eval fixtures

Two defensive fixes:

1. normalizeText(): return empty string on null/undefined input instead
   of crashing with 'undefined is not an object (evaluating s.toLowerCase)'

2. loadRoutingFixtures(): validate that parsed fixture has 'intent' as a
   string before adding to fixtures array. Fixtures with wrong field
   names (e.g. 'input' instead of 'intent') are now reported as
   malformed with a helpful error message listing the actual keys found.

Root cause: a skill's routing-eval.jsonl used {"input": ...} instead
of {"intent": ...}. The JSON parsed fine but the cast to
RoutingFixture was unchecked, so fixture.intent was undefined.
normalizeText(undefined) then crashed. This made 'gbrain doctor'
completely unusable.

(cherry picked from commit b142bbd)

* fix(test): isolate HOME in run-e2e.sh to stop config corruption

Replaces garrytan#517 (re-ported fresh against current scripts/run-e2e.sh after
v0.23.1 rewrote the script — original cherry-pick would not apply).

E2E tests call setupDB which writes $HOME/.gbrain/config.json pointing at
the docker test container. When the container tears down, the user's real
autopilot daemon wedges trying to connect to a vanished postgres. Three
operators hit this within 16 days before the original PR filed.

Fix: wrapper exports HOME + GBRAIN_HOME to a mktemp tmpdir BEFORE bun
starts so config writes land in the tmpdir, with a post-run breach
detector that compares md5 of the user's real config against pre-run.
Both env vars required: loadConfig/saveConfig resolve via HOME while
configPath honors GBRAIN_HOME. HOME set before bun starts because
os.homedir() caches at first call.

Test seam: test/gbrain-home-isolation.test.ts updated to assert against
homedir() === configDir() when GBRAIN_HOME unset (correct under the
safety wrapper itself) instead of the prior "not /tmp/" sentinel.

Revert path: git revert <this-sha> if test:e2e regresses on master.

Co-Authored-By: orendi84 <orendi84@users.noreply.github.com>

* test(dream-cycle): add schema-suggest to EXPECTED_PHASES

v0.40.7.0 Schema Cathedral v3 added the 'schema-suggest' phase between
'orphans' and 'purge' in ALL_PHASES, but the E2E phase-order test was
not updated to match. ALL_PHASES vs EXPECTED_PHASES diverged and the
shape-pin test failed every run on master.

Surfaced during fix-wave: warm-narwhal E2E gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(autopilot-fanout): use relative timestamp inside freshness window

The 'end-to-end: updateSourceConfig persists timestamp visible to next
listAllSources' test pinned last_full_cycle_at to a hardcoded
'2026-05-22T15:00:00.000Z'. The 60-minute freshness window passed
within ~1 hour of write — every run after the deadline classified the
source as stale and dispatched it, breaking the test's
.skippedFresh expectation.

Switch to Date.now() - 30min relative timestamp (mirrors the prior
'source with last_full_cycle_at < 60min ago is skipped by gate' test).

Surfaced during fix-wave: warm-narwhal E2E gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(fresh-install-pglite): unset other provider keys in beforeEach

init.ts:455 fails loud when multiple embedding providers are env-ready
in non-TTY mode. The test sets ZEROENTROPY_API_KEY then runs init,
but developer machines commonly have OPENAI_API_KEY + VOYAGE_API_KEY +
ZEROENTROPY_API_KEY all set, so init sees 3 providers and exits 1.

Save+unset OPENAI_API_KEY + VOYAGE_API_KEY in beforeEach, restore in
afterEach. Now only ZE is env-ready, init picks it, schema sized to
zembed-1's 1280d as the test expects.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(voyage-multimodal): switch fixture from AVIF to PNG

Voyage's /multimodalembeddings endpoint rejects AVIF as of 2026-05
with 'Please provide a valid base64-encoded image'. The prior comment
('AVIF is fine for an embed call') held at v0.27.x and regressed
silently on the provider side.

Add test/fixtures/images/tiny.png (16x16 RGB PNG, 1307 bytes generated
via sips from the macOS default wallpaper). PNG is universally
accepted by Voyage and other multimodal providers.

Surfaced during fix-wave: warm-narwhal E2E gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(cycle/synthesize): prefix bare anthropic model ids before queue.add

queue.add's subagent capability validator (classifyCapabilities →
resolveRecipe) requires provider:model format and rejects bare ids
with 'unknown provider'. resolveModel returns the bare id from
TIER_DEFAULTS / DEFAULT_ALIASES (e.g. 'claude-sonnet-4-6'), which the
validator then rejects, dropping the synthesize phase to status:fail
with SYNTH_PHASE_FAIL.

Narrow fix at the call site: if config.model has no colon AND starts
with 'claude-', prefix 'anthropic:'. Other providers must already
declare a colon. Avoids changing TIER_DEFAULTS / DEFAULT_ALIASES
constant shapes, which would ripple across every resolveModel caller.

Surfaced by dream-synthesize-chunking E2E during fix-wave: warm-narwhal.
Affected tests: 'single-chunk transcript uses legacy idempotency key'
and 'multi-chunk transcript spawns N children with chunk-suffixed
idempotency keys' — both relied on result.details.children_submitted
which only the ok() path sets; the failed() path returns details: {}.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(mechanical): pin doctor init embedding model + clean non-default sources

Two fixes in the E2E Doctor Command describe block, both surfaced by
cross-file state pollution under the full sequential E2E run:

1. Pass --embedding-model openai:text-embedding-3-large to the init
   subprocess. Without the explicit flag, doctor inherits whatever the
   resolver picks from env keys (ZE if ZEROENTROPY_API_KEY is set,
   defaulting to zembed-1 at 1280d). The test's setupDB initialized
   schema at 1536d, so the dim mismatch fires
   embedding_width_consistency WARN, exiting doctor 1.

2. DELETE FROM sources WHERE id != 'default' in beforeAll. Prior E2E
   files leave non-default source rows (e.g. 'delta' from autopilot /
   sources tests). sync_freshness + cycle_freshness then FAIL on those
   orphans because they were never synced/cycled, exiting doctor 1.
   setupDB TRUNCATEs sources but schema.sql re-seeds 'default' via
   initSchema; this leaves only the canonical single-source brain
   the test expects.

Surfaced during fix-wave: warm-narwhal E2E gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(run-e2e): per-file connection flush + 180s outer timeout

Two cross-file isolation hardenings for the sequential E2E runner:

1. Terminate stale Postgres connections before each file. Without this,
   idle connections from the prior bun process's pool race with the
   next file's setupDB() TRUNCATE CASCADE, producing 'fixture pages
   disappear mid-test' failures. The terminate call is idempotent +
   ~50ms; first iteration is a no-op.

2. Hard outer timeout (180s per file) via gtimeout / timeout. bun's
   --timeout=60000 is per-test; if a PGLite WASM call hangs in
   beforeAll/afterAll (e.g. ingestion-roundtrip.test.ts wedging
   30+ minutes on macOS), --timeout never fires and the entire suite
   wedges. Outer SIGKILL lets the suite advance and the file is
   recorded as failed for triage. Falls through to bare bun if neither
   gtimeout nor timeout is on PATH.

Surfaced during fix-wave: warm-narwhal — 3 of 5 cross-file flakes
caught by the connection flush; ingestion-roundtrip 30-min wedge
caught by the outer timeout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump version and changelog (v0.41.3.0)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* docs: annotate synthesize.ts narrow prefix fix (v0.41.3.0)

CLAUDE.md gains the v0.41.3.0 note on src/core/cycle/synthesize.ts (narrow
anthropic: prefix at the queue.add boundary so resolveModel's bare ids
satisfy the subagent validator). llms-full.txt regenerated to match.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* chore: rebump v0.41.3.0 → v0.41.5.0 (queue drift; PR garrytan#1377 claimed .4.0)

Sibling fix-wave PR garrytan#1377 (garrytan/community-pr-wave) claimed v0.41.4.0
between my queue check (.3.0 was available) and PR creation. Re-bump to
the next available slot per workspace-aware allocator.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(cycle/synthesize): refuse empty brainDir + resolve relative paths

Pre-fix, runPhaseSynthesize accepted any brainDir string and passed it
to writeReversePages which does join(brainDir, '<slug>.md'). When
brainDir is '' or relative ('.' / './brain' / etc), join() produces a
relative path that writeFileSync resolves against cwd. Result: every
synthesize reverse-write spills into <cwd>/companies/<slug>.md,
<cwd>/people/<slug>.md, etc. instead of the intended brainDir tempdir.

Surfaced by the warm-narwhal wave when E2E test cleanup found orphan
synthesize pages (companies/novamind.md, people/sarah-chen.md,
meetings/2025-04-01-novamind-board-update.md) at the gbrain repo root
from a runCycle({brainDir: '.'}) chain that ran during morning E2E
execution.

Fix at the function entry, single location, all callers protected:
  1. Empty/whitespace brainDir → return failed(BRAINDIR_EMPTY) loud
     instead of silently resolving against cwd
  2. Relative brainDir → resolve(opts.brainDir) before any read/write
     can use it. opts.brainDir mutated so writeReversePages,
     writeSummaryPage, and every join() downstream see the absolute path

Regression test pins all 4 contracts:
  - empty string → fail(BRAINDIR_EMPTY)
  - whitespace-only → fail(BRAINDIR_EMPTY)
  - '.' → mutated to absolute on entry
  - already-absolute → unchanged

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(dream): resolve brainDir to absolute at CLI surface

Defense-in-depth for the synthesize-braindir spillage bug class. The
core fix lives in runPhaseSynthesize (commit 98222a0); this resolves
brainDir one layer earlier so the entire 9-phase runCycle gets the
absolute path, not just synthesize.

Two paths in resolveBrainDir get path.resolve():
  - explicit --dir argument (e.g., `gbrain dream --dir .`)
  - sync.repo_path config (in case it was ever stored relative)

resolveBrainDir already checked existsSync; resolve() just canonicalizes
before return. No behavior change for paths already absolute.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Matt Gunnin <mgunnin@esports.one>
Co-authored-by: Brandon Lipman <brandon@offdeck.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Jeremy Knows <jeremy@veefriends.com>
Co-authored-by: root <root@localhost>
Co-authored-by: orendi84 <orendigergo@gmail.com>
Co-authored-by: orendi84 <orendi84@users.noreply.github.com>
Co-authored-by: Garry Tan <garry@ycombinator.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants