Skip to content

fix(oauth): honor token_endpoint_auth_method: "none" for DCR public clients (PKCE-only)#909

Closed
ding-modding wants to merge 1 commit into
garrytan:masterfrom
ding-modding:ding/oauth-public-client-fix
Closed

fix(oauth): honor token_endpoint_auth_method: "none" for DCR public clients (PKCE-only)#909
ding-modding wants to merge 1 commit into
garrytan:masterfrom
ding-modding:ding/oauth-public-client-fix

Conversation

@ding-modding

@ding-modding ding-modding commented May 12, 2026

Copy link
Copy Markdown

Summary

registerClient (DCR) unconditionally generates a client_secret even when the client requests token_endpoint_auth_method: "none". The MCP SDK's clientAuth middleware then requires a secret on every /token exchange because client.client_secret is truthy from the stored row — which rejects every valid PKCE-only public-client flow (Claude Code, Cursor, MCP Inspector, …).

This PR makes registerClient honor "none" per RFC 7591 §2: no secret generated, NULL stored, no secret in the DCR response.

Repro

gbrain serve --http --enable-dcr --public-url https://<your-domain>

From a public-client MCP host (Claude Code shown — same behavior on Cursor):

claude mcp add --scope user --transport http gbrain "https://<your-domain>/mcp"
# trigger auth → browser opens → user clicks Allow → callback returns code

Before this PR: the /token exchange fails. The MCP SDK then surfaces:

Existing OAuth client information is required when exchanging an authorization code

After this PR: the same flow lands on a green ✓ Connected.

Root cause

src/core/oauth-provider.ts:172-194 always emits a fresh client_secret:

const clientSecret = generateToken('gbrain_cs_');
const secretHash = hashToken(clientSecret);
// ...
INSERT INTO oauth_clients (..., client_secret_hash, ...) VALUES (..., ${secretHash}, ...)

getClient reads client_secret_hash back as client.client_secret. The MCP SDK's middleware (@modelcontextprotocol/sdk/server/auth/middleware/clientAuth.js:19) then gates:

if (client.client_secret) {
  if (!client_secret) throw new InvalidClientError('Client secret is required');
  // ...
}

Public clients (per RFC 7591) cannot safely store a secret, so they send client_id + code_verifier only — and get rejected.

Fix

Gate secret generation on token_endpoint_auth_method:

const isPublicClient = client.token_endpoint_auth_method === 'none';
const clientSecret = isPublicClient ? undefined : generateToken('gbrain_cs_');
const secretHash = clientSecret ? hashToken(clientSecret) : null;

And omit client_secret from the DCR response for public clients. The oauth_clients.client_secret_hash column was already nullable — no schema migration needed. Confidential-client behavior is unchanged.

Tests

Two new cases in test/oauth.test.ts under client registration:

  • DCR with token_endpoint_auth_method "none" issues no client_secret — response has no secret, stored row's client_secret is falsy
  • DCR without token_endpoint_auth_method defaults to confidential client — regression guard so the default path still mints + stores a secret
$ bun test test/oauth.test.ts
 66 pass
 0 fail
 257 expect() calls

Spec references

  • RFC 7591 §2 — Client Metadata, token_endpoint_auth_method values including "none"
  • RFC 6749 §2.1, §10.1 — public vs confidential client classification
  • RFC 7636 — PKCE (the auth method public clients use instead of a secret)

Affected clients

Any MCP host that uses DCR with PKCE-only auth — i.e., any public OAuth 2.1 client. Confirmed in the wild for Claude Code and Cursor; the same break-mode applies to MCP Inspector and the reference clients in @modelcontextprotocol/sdk.


View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

…ients

RFC 7591 §2 specifies that clients registering via Dynamic Client
Registration with token_endpoint_auth_method "none" are public clients
and MUST NOT receive a client_secret. PKCE alone authenticates the
authorization code exchange.

Before this fix, registerClient unconditionally generated a client_secret
and stored its hash in oauth_clients.client_secret_hash. The MCP SDK's
clientAuth middleware (server/auth/middleware/clientAuth.js) inspects
the stored client and, when client.client_secret is truthy, requires
client_secret_post auth at /token. PKCE-only clients like Claude Code
and Cursor (both register with token_endpoint_auth_method: "none") sent
PKCE code_verifier with no secret, hit "Client secret is required", and
the SDK's client-side OAuth flow then surfaced "Existing OAuth client
information is required when exchanging an authorization code".

Fix: skip client_secret generation when token_endpoint_auth_method ===
"none", and omit client_secret from the DCR response in that case. The
oauth_clients.client_secret_hash column was already nullable (schema
unchanged). Confidential clients (default, or any non-"none" auth
method) continue to receive a secret exactly as before.

Tests: two new cases in client registration —
  - DCR with "none" issues no client_secret; stored secret is falsy
  - DCR without explicit auth method defaults to confidential (secret
    issued and stored)

Repro of the original bug:
  1. Start: gbrain serve --http --enable-dcr --public-url https://...
  2. From Claude Code: claude mcp add --transport http gbrain <url>
  3. Trigger auth → browser flow completes → token exchange fails with
     "Existing OAuth client information is required"
After fix, the same flow lands on a healthy token + green Connected.
garrytan added a commit that referenced this pull request May 15, 2026
…ederated_read + 3 more (#996)

* fix(mcp): skip stdin EOF handlers when MCP_STDIO=1

OpenClaw's bundle-mcp gateway and similar wrappers pipe the JSON-RPC
handshake on stdin then close their stdin half. Pre-fix, both stdin
'end' and 'close' listeners (server.ts:65-66 and serve.ts:204-206)
treated this as a permanent disconnect and shut the server down before
the first tool call arrived.

Guard both sites with `process.env.MCP_STDIO !== '1'`. Signal handlers
(SIGTERM/SIGINT/SIGHUP), transport.onclose, and the parent-process
watchdog still cover legitimate shutdown paths. The serve.ts site
threads the env read through an injectable `mcpStdio?: boolean` on
ServeOptions so tests stay isolated (no process.env mutation per
scripts/check-test-isolation.sh R1).

Tests: 3 new cases in test/serve-stdio-lifecycle.test.ts pin the
guard's invariants — mcpStdio=true must NOT trigger shutdown on stdin
EOF, signals must still drive shutdown with mcpStdio=true, and
mcpStdio=false (default) preserves existing CLI behavior. 25/25 pass.

Origin: PR #870.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(oauth): honor token_endpoint_auth_method=none for PKCE public clients

RFC 7591 §3.2.1: when a DCR client declares
token_endpoint_auth_method="none" (PKCE-only public clients like Claude
Code, Cursor), the authorization server MUST NOT issue a client_secret.
Pre-fix, registerClient unconditionally minted a secret, and the MCP
SDK's clientAuth middleware then rejected valid public-client flows on
/token because it expected client.client_secret to match.

Three changes to src/core/oauth-provider.ts:registerClient:

  - Gate clientSecret generation on isPublicClient = (auth_method === 'none').
    Public clients store client_secret_hash = NULL.
  - Omit client_secret from the response payload for public clients.
    Confidential clients (default client_secret_post and explicit
    client_secret_basic) keep their existing one-time-reveal shape.
  - Normalize NULL secret_hash to JS undefined in getClient so SDK
    middleware (which checks client.client_secret === undefined, not
    === null) correctly identifies public clients and skips the
    secret-comparison branch on /token.

Schema is already permissive (client_secret_hash TEXT, no NOT NULL on
both src/schema.sql and src/core/pglite-schema.ts) — no migration
needed.

Tests: 5 new cases in test/oauth.test.ts pin:
  - public client → no client_secret in response (#11 from plan)
  - default auth_method → secret unchanged (regression guard)
  - explicit client_secret_post → secret unchanged
  - getClient NULL→undefined normalization
  - PKCE full /authorize → /token end-to-end with no secret (#15 from plan)

69/69 oauth.test.ts cases pass. typecheck clean.

Origin: PR #909.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(serve-http): --bind HOST, default to loopback (127.0.0.1)

Adds `gbrain serve --http --bind <interface>` to control which network
interface the HTTP MCP server listens on. Default flipped from
`0.0.0.0` (pre-v0.34) to `127.0.0.1` (v0.34.0+).

Why the flip: gbrain's primary use case is a personal-knowledge brain on
a laptop. The previous default exposed brains on every interface — one
accidental `--http` invocation away from publishing the brain to a LAN.
Server operators who need remote access pass `--bind 0.0.0.0` (or a
specific interface). Codex's outside-voice on the original PR #864
correctly flagged that the additive flag wasn't actually the fix; the
default needed to change for the safety claim to hold.

If `--public-url` is set but `--bind` is unset, runServeHttp prints a
loud stderr WARN at startup recommending `--bind 0.0.0.0`. Declaring a
public URL while quietly binding loopback is almost always a
misconfiguration; we want the operator to see it on first start, not
silently fail remote requests.

Startup banner now includes a `Bind:` row so the listening interface is
visible alongside Port / Engine / Issuer.

Origin: PR #864, extended with D11 (default flip) per /plan-eng-review
codex outside-voice review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(mcp): seal source-isolation leak on read path (P0)

Pre-fix, an authenticated OAuth MCP client scoped to source-A could
enumerate source-B pages via six read-side ops: search, query (text
AND image paths), list_pages, traverse_graph, and find_experts. The
v0.31.8 source-scoping pattern shipped through dispatch.ts but the op
handlers never threaded ctx.sourceId into their engine calls, and
hybridSearch.ts:223's explicit SearchOpts rebuild dropped sourceId
even when callers passed it.

Sealing the leak:

  - src/core/operations.ts adds sourceScopeOpts(ctx), the canonical
    precedence ladder: ctx.auth.allowedSources (federated) wins over
    ctx.sourceId (scalar) wins over nothing. Threaded into all 5
    read-side op handlers + the query-image-path searchVector call
    (the 6th leak surface codex caught in plan review).

  - src/core/search/hybrid.ts:223 now threads sourceId + sourceIds
    fields through the inner SearchOpts rebuild. The explicit pick
    shape is preserved (HNSW inner-CTE ordering depends on it) but
    extended.

  - src/core/types.ts adds sourceIds?: string[] to SearchOpts +
    PageFilters (D9: federated read needs array-shaped engine filter
    or fan-out; array wins for hot retrieval).

  - src/core/operations.ts AuthInfo gains sourceId + allowedSources
    (D2: identity surface symmetric with the federated_read column
    #876 will add).

  - Both engines now apply WHERE source_id = $N (scalar) or = ANY($N::text[])
    (array) at the SQL layer for searchKeyword, searchKeywordChunks,
    searchVector, listPages, traverseGraph, traversePaths. Array form
    wins when both are set. The searchVector filter pushes into the
    inner HNSW CTE (codex flagged this placement during plan review).

  - traverseGraph + traversePaths signatures gain opts.sourceId +
    opts.sourceIds; engine.ts interface updated.

  - findExperts (the whoknows op, D3 5th leak surface) accepts
    sourceId + sourceIds and threads them into its internal
    hybridSearch call. PR #861 was authored before v0.33 shipped so
    this op wasn't covered in the original PR.

Auth wiring:

  - GBrainOAuthProvider.verifyAccessToken populates AuthInfo.sourceId
    from oauth_clients.source_id. JOIN guarded by isUndefinedColumnError
    so pre-v55 brains degrade to legacy projection rather than refusing
    every token verification.

  - GBrainOAuthProvider.registerClientManual gains a sourceId
    parameter (defaults to 'default'). DCR registerClient also sets
    source_id='default' on the inserted row.

  - serve-http.ts:929 cleanup: AuthInfo.sourceId is now a real typed
    field. The cast + GBRAIN_SOURCE env fallback chain is gone (D13).
    Legacy bearer tokens default to 'default' source in
    verifyAccessToken.

  - http-transport.ts (legacy access_tokens path) threads
    sourceId='default' through DispatchOpts so v0.22.7 callers stay
    source-scoped.

  - auth.ts CLI adds --source flag to gbrain auth register-client.

Migration v55 (D10 + D13):

  - ALTER TABLE oauth_clients ADD COLUMN source_id TEXT (nullable).
  - Backfill UPDATE source_id = 'default' WHERE source_id IS NULL —
    preserves v0.33 effective behavior verbatim for legacy clients.
  - ADD CONSTRAINT FK ... REFERENCES sources(id) ON DELETE SET NULL,
    wrapped in DO block so re-runs against fresh-install brains (where
    the FK already lives inline in SCHEMA_SQL) no-op cleanly.
  - CREATE INDEX idx_oauth_clients_source_id WHERE source_id IS NOT NULL
    for the verifyAccessToken JOIN.
  - GBRAIN_ACCEPT_SILENT_WIDEN env-flag wired through the runner via
    SET LOCAL gbrain.accept_silent_widen — reserved for future migrations
    that hit the silent-widen footgun codex flagged. This migration
    doesn't need it (column is brand new; no pre-existing stale values
    possible by definition).
  - src/core/pglite-schema.ts + src/schema.sql include the column +
    FK + index inline for fresh installs.

Tests: new test/e2e/source-isolation-pglite.test.ts with 13 regression
cases — one per leak surface (search/list_pages/traverse/etc.) plus
explicit AuthInfo.sourceId and AuthInfo.allowedSources op-handler
threading checks. Full unit suite: 6034 pass / 0 fail. PGLite
initSchema time dropped from 2.4s to 850ms after consolidating v55's
DO blocks (multiple DO blocks were slow on PGLite; one DO block for
the FK install only is fine).

Origin: PR #861 + plan-eng-review decisions D2/D3/D4/D9/D10/D13 + F2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(gateway): multimodal embedding for openai-compatible providers

Pre-fix, embedMultimodal hardcoded a recipe.id === 'voyage' branch and
threw AIConfigError for every other recipe. Multimodal-capable providers
fronted by LiteLLM (or any openai-compatible proxy) were unreachable
even when the operator had wired up the model.

The fix:

  - src/core/ai/gateway.ts adds embedMultimodalOpenAICompat() that
    POSTs to the standard /embeddings endpoint with content arrays
    carrying image_url entries. Routing comes from the existing
    recipe.implementation switch — Voyage stays on its own
    /multimodalembeddings path; every other openai-compatible recipe
    flows through the new helper.

  - src/core/ai/recipes/litellm-proxy.ts declares
    supports_multimodal: true so embedMultimodal accepts the recipe.
    No multimodal_models allow-list: LiteLLM is a passthrough proxy
    and the user owns model-id selection; provider rejection (400 from
    upstream) is the right enforcement layer there. Voyage's static
    allow-list shape stays unchanged (its 12 models share
    supports_multimodal but only one is multimodal-capable).

  - D12 runtime dimension validation: the new helper checks the
    returned vector length against the recipe's declared default_dims
    (preferred) or the brain's embedding_dimensions config. Mismatch
    throws AIConfigError with model id + observed + expected so the
    operator can swap models or rebuild the column. Pre-fix, a
    wrong-dim response would surface as a cryptic pgvector
    "vector dimension mismatch" at INSERT time.

  - Auth resolution routes through the existing defaultResolveAuth
    helper so optional-auth recipes (LiteLLM proxy with no
    LITELLM_API_KEY) and required-auth recipes both share one code
    path. Optional-auth sends "Authorization: Bearer unauthenticated"
    which servers like Ollama / llama-server ignore but the SDK
    contract requires.

Tests: 11 new cases in test/openai-compat-multimodal.test.ts cover
happy-path, multi-input batching, unauthenticated proxy, D12 dim
mismatch + default-dim fallback, 401 / 400 / malformed-JSON / non-array
error paths, and an explicit Voyage-regression test pinning that the
new openai-compat route doesn't accidentally hijack the Voyage path.
All 41 multimodal-related tests pass (existing voyage suite + new).
typecheck clean.

Origin: PR #875 + plan-eng-review D12 (runtime dim validation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(oauth): federated_read read scope (#876)

Pre-fix, OAuth clients had a single source-scope axis (source_id, added
in v55). A client could either write+read one source OR be a super-reader
across all sources (via NULL source_id). There was no middle ground —
WeCare-style L3 dept clients that need to write to dept-x but read
dept-x + parent canon + shared canon had no expression.

#876 adds federated_read TEXT[] as an orthogonal read-scope axis. source_id
is the WRITE authority; federated_read is the READ authority. They default
to matching values (read scope == write scope, the pre-v0.34 default)
when a client is registered without an explicit federated read list.

Migrations v56-v60 (six new migrations on top of v55):

  - v56: ALTER TABLE ... ADD COLUMN federated_read TEXT[] NOT NULL DEFAULT '{}'.
  - v57 (F5): explicit CASE backfill so source_id IS NULL → '{}' (not an
    array containing NULL — codex caught this ambiguity during plan review).
  - v58: post-backfill validation. Fails loud if any row's source_id isn't
    in its federated_read array, pointing at a logic bug in v57 if fired.
  - v59: flip the source_id FK from ON DELETE SET NULL to ON DELETE
    RESTRICT now that federated_read provides the alternative scope-loss
    path. Pre-flip, deleting a source could silently widen any oauth_client
    to super-reader; post-flip, source delete is refused if any client
    references it (operator must revoke/re-scope first).
  - v60: GIN index on federated_read for array-containment queries.

Auth wiring:

  - GBrainOAuthProvider.verifyAccessToken JOINs c.federated_read and
    populates AuthInfo.allowedSources. Pre-v56 / pre-v55 brains degrade
    via the existing isUndefinedColumnError fallback chain.
  - registerClientManual gains a federatedRead?: string[] parameter
    (defaults to [sourceId]).
  - DCR registerClient sets source_id='default' + federated_read=['default']
    on the inserted row.
  - auth.ts CLI adds --federated-read SRC1,SRC2,... flag. The
    register-client output now prints "Federated reads:" so operators
    confirm the scope they set.

Engines consume the federated array through the SearchOpts.sourceIds /
PageFilters.sourceIds field that #861 added (no engine changes here — the
plumbing was D9). sourceScopeOpts in operations.ts already prefers the
auth.allowedSources array over scalar ctx.sourceId when set.

Test seam:
  - test/book-mirror.test.ts now spawns the CLI with GBRAIN_HOME pointed
    at a tempdir so the test isn't sensitive to the developer's local
    ~/.gbrain/config.json. Pre-fix the test could silently inherit a real
    Postgres connection and hang past the default 5s test timeout. Fresh
    GBRAIN_HOME → "No brain configured" → exit 1 in <1s.
  - test/e2e/source-isolation-pglite.test.ts gains one more regression
    case: AuthInfo.allowedSources = [] (explicit empty) MUST NOT widen
    scope to "all sources" — the silent-widen footgun precedence ladder.
  - test/openai-compat-multimodal.test.ts is part of the wave's commits
    via the migrate.ts changes that bump the schema chain. typecheck-only
    fix on a captured-auth type was already in #875's tree.

6045 unit tests pass / 0 fail. typecheck clean. PGLite initSchema runs
v55-v60 in ~786ms total (within the test-harness budget for tests using
the canonical beforeAll engine pattern).

Origin: PR #876 + plan-eng-review F5 (CASE backfill).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* v0.34.0.0: MCP fix wave (#870 #909 #864 #861 #875 #876)

VERSION + package.json + CHANGELOG bump for the six-PR MCP fix wave.
Schema chain extends from v54 → v60; oauth_clients gains source_id +
federated_read columns; auth'd MCP clients now stay inside their scope
across all read-side ops; PKCE-only DCR works; --bind defaults to
loopback; LiteLLM multimodal embedding ships.

Contributed by @Hansen1018 (#870), @ding-modding (#909), @DukeDawg
(#864), @toilalesondev (#861 + #876), @yoelgal (#875).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: update project documentation for v0.34.0.0

Sync README, CLAUDE.md, SECURITY.md, docs/architecture/topologies.md,
and docs/mcp/DEPLOY.md to reflect the v0.34.0.0 MCP fix wave:

- README: document --bind HOST default (loopback), --source +
  --federated-read register-client flags, PKCE public-client gate
- SECURITY.md: note loopback-by-default for serve --http, update the
  trust-proxy contract to point at the new default
- CLAUDE.md: annotate operations.ts (sourceScopeOpts helper),
  oauth-provider.ts (verifyAccessToken JOIN + PKCE public clients),
  serve-http.ts (--bind flag), gateway.ts (openai-compat multimodal +
  dim validation), mcp/server.ts (MCP_STDIO guard), auth.ts (--source
  + --federated-read), migrate.ts (v58-v63 chain), engine.ts
  (sourceIds field). Add 4 new test-file entries for
  source-isolation-pglite, openai-compat-multimodal,
  serve-stdio-lifecycle, oauth.test.ts PKCE cases
- docs/architecture/topologies.md: source-scoped register-client
  example, --bind 0.0.0.0 for thin-client host setup
- docs/mcp/DEPLOY.md: --bind explanation in the ngrok section,
  source-scoped client recipe
- llms-full.txt: regenerated per the CLAUDE.md-edit chaser rule

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump v0.34.0.0 → v0.34.1.0

Renumbering the MCP fix wave from v0.34.0.0 to v0.34.1.0 so the
release slot lands between master's v0.33.2.1 and the next minor.

Touches every release-artifact mention:
- VERSION: 0.34.0.0 → 0.34.1.0
- package.json: same
- CHANGELOG.md header + "To take advantage" block
- CLAUDE.md key-files annotations (8 entries that document this wave)
- llms-full.txt (regen from CLAUDE.md)
- README.md / SECURITY.md / docs/architecture/topologies.md / docs/mcp/DEPLOY.md
- Wave code-comment markers ("// v0.34.0 (#NNN):" → "// v0.34.1 (#NNN):")

Test files renamed alongside since they were committed with the wave.

Commit subjects on the original 6 PR commits + the v0.34.0.0 bump
commit (4f533c76b47db7) intentionally NOT rewritten — those are
history. `git log` finds the implementation by message subject, not by
version tag.

6275 unit tests pass, typecheck clean, migration chain v58-v63 unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ding-modding

Copy link
Copy Markdown
Author

Landed via #996 (v0.34.1.0) as part of the MCP fix wave — closing this. Thanks!

garrytan added a commit that referenced this pull request May 21, 2026
…dential clients (#1253)

* fix(reindex-frontmatter): connect engine before query (#1225)

`createEngine()` from src/core/engine-factory.ts only constructs the
engine; callers MUST call connect() before any executeRaw. The
reindex-frontmatter CLI was constructing the engine and going
straight to countAffected, which crashed on PGLite with "PGLite not
connected. Call connect() first." even on --dry-run.

Fix follows the existing-command pattern (src/commands/auth.ts,
src/commands/backfill.ts, src/commands/integrity.ts all do the
same): pass toEngineConfig(cfg) into both createEngine() AND
engine.connect(), then engine.initSchema() (idempotent on a current
schema, ~1ms cost).

Pre-fix verification: codex outside-voice CF5 flagged the related
"can't import connectEngine from cli.ts" misdirection in the
original fix plan. This implementation uses the canonical sibling
pattern instead.

Regression test pinned at test/reindex-frontmatter-connect.test.ts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: bump VERSION to 0.37.7.0 + stub CHANGELOG

v0.37.5.0 claimed by #1229 (warsaw-v4); v0.37.6.0 by #1246
(OpenRouter recipe). v0.37.7.0 is the next free slot for this
fix wave.

CHANGELOG entry stubbed in user-facing voice per CLAUDE.md
"CHANGELOG voice + release-summary format" — ELI10 lead-first,
real fix details below. The "## To take advantage of v0.37.7.0"
block follows the v0.13+ self-repair pattern from CLAUDE.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(subagent): short-circuit terminal-on-resume (#1151)

Bug: when the worker resumed a subagent job whose persisted last
message was an assistant turn with text-only content (no tool_use
blocks), the replay reconciler at subagent.ts:241-247 had no
branch for that case. The main loop then called messages.create
against a conversation ending in assistant role, which Sonnet 4.6+
rejects with HTTP 400 "This model does not support assistant
message prefill." 3 retries later → dead-letter, despite all the
job's work having committed in earlier turns.

@zscgeek's bug report pinned this exactly: dream-cycle Otter
corpus runs hit ~7% dead-letter rate, every dead job's last
subagent_messages row was a text-only synthesis summary listing
slugs that already existed in `pages`. Their proposed fix mirrors
this implementation.

Fix: add an else branch to the assistant-tail check that mirrors
the live-loop terminal logic at subagent.ts:440-447 — reconstruct
finalText from the persisted text blocks, return
stop_reason='end_turn' immediately. No LLM call, no schema change.

Two new regression cases:
  - text-only terminal on resume returns immediately with zero
    messages.create calls
  - tool-use replay path unchanged (existing behavior preserved)

Codex outside-voice (CF13) initially flagged this fix as
mis-targeted, claiming subagent.ts already handled the case.
/investigate run revealed the live-loop terminal at :440-447 was
covered but the REPLAY-path terminal at :241-247 was missing —
both branches need symmetric handling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(autopilot): scope lockfile to GBRAIN_HOME (#1226)

The autopilot lockfile was hardcoded at `~/.gbrain/autopilot.lock`
(via `process.env.HOME`), bypassing GBRAIN_HOME. Two brains pointed
at different GBRAIN_HOME directories still wrote to the same global
lockfile; one would silently take over the other on each restart.

Fix: route through `gbrainPath('autopilot.lock')` from
src/core/config.ts (imported aliased as gbrainHomePath since the
local `gbrainPath` var in installAutopilot references the CLI
binary path). The mkdirSync(`~/.gbrain`) call also routes through
the helper so the directory is created in the right place too.

Co-authored with @rafaelreis-r — same fix shape as PR #1227,
re-implemented against current master per the wave's
"re-implement, credit, close" workflow.

Tests cover: one GBRAIN_HOME → one canonical lock; two
GBRAIN_HOME values → two distinct locks; default fall-through
still works.

Co-Authored-By: rafaelreis-r <noreply@github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(graph-query): foreign-edge footer + --include-foreign (#1153)

The graph-query CLI silently dropped edges to pages in other sources
on federated brains. Users had no signal those edges existed unless
they read the source code.

Fix:
- New --include-foreign flag (off by default, preserves the existing
  scoping contract; on = explicit cross-source traversal).
- After every traversal, count edges from rootSlug whose target page
  lives in a different source. When count > 0 AND user didn't opt in,
  emit a stderr footer:
    `(N edge(s) to foreign-source pages hidden; pass --include-foreign
     to include them)`
- The "no edges found" path also runs the count + footer so users
  discover foreign edges even when scoped traversal returned nothing.
- Thin-client path skips the count (engine query not available);
  future T1 work threads source resolution through MCP for that path.
- Single quotation correctness in count SQL: page_links table is
  `links` (not `page_links`); JOIN both endpoints to pages and compare
  source_id, NULL-safe via `IS NOT NULL` guards on both sides.
- Fail-open on missing source_id column for pre-v0.18 brains: return 0
  (no foreign edges to report) instead of throwing.

4 new test cases: footer fires on scoped query with foreign edge,
--include-foreign suppresses footer, zero-foreign no-footer case,
pluralization regression guard.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(sources): `gbrain sources current` + tier attribution (#1222)

Federated-brain users running destructive ops (extract, import,
purge) need a way to verify which source they're targeting BEFORE
the op runs. Pre-fix, the only way was to grep config files or run
the op with --dry-run and inspect output.

New command:
  gbrain sources current             # human output
  gbrain sources current --json      # machine-readable
  gbrain sources current --source X  # show what an explicit --source
                                     # X would resolve to (validates
                                     # X exists in the sources table)

Output names BOTH the resolved source id AND which tier of the 6-tier
resolution chain won (flag / env / dotfile / local_path /
brain_default / seed_default), plus a `detail` line naming the
winning signal (e.g. "GBRAIN_SOURCE=dept-x" or ".gbrain-source" or
"/work/gstack/src").

Implementation:
- New `resolveSourceWithTier()` in source-resolver.ts as an additive
  variant of `resolveSourceId()`. Walks the same 6 steps in the same
  order; just returns `{ source_id, tier, detail? }` instead of bare
  string. Existing `resolveSourceId()` unchanged — all callers
  continue working.
- New `SOURCE_TIER_NAMES` const + `SourceTier` type export so the
  CLI, doctor (Tier 5 follow-up), and future MCP consumers share one
  vocabulary instead of inlining strings.
- Help text updated; `current` subcommand registered in dispatcher.

11 new tests pin the 6-tier ladder + priority semantics. Existing
19 source-resolver tests still pass (regression preserved).

Per codex CF3 (the existing src/core/source-resolver.ts was missed
in the original plan). Re-uses the existing helper instead of
inventing a duplicate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(extract): --source-id scopes extraction to one brain source (#1204)

Federated brain users running `gbrain extract` had no way to scope
extraction to one source. The DB path walks all sources together via
listAllPageRefs(), which is correct for cross-source resolution but
sometimes the user wants to extract per-source explicitly (e.g.
re-running extract on a specific source after a manual import).

The pre-existing `--source` flag is the data-source axis (fs|db) and
can't be repurposed. New flag `--source-id <id>` joins it on the
brain-source-id axis:

  gbrain extract all --source db --source-id alpha
    -> walks only alpha-source pages; extracts links + timeline
       from those, into the alpha source

Important: the resolver maps (allSlugs + slugToSources) stay built
from the FULL listAllPageRefs result, not the scoped subset. This
ensures qualified cross-source wikilinks like `[[other-src:slug]]`
still resolve correctly even when the extract walk is scoped — the
filter is on which pages we extract FROM, not what we can resolve TO.

Threaded through both `extractLinksFromDB` and `extractTimelineFromDB`
with backward-compat: callers passing no opts get the old behavior.

4 new test cases pin: walks-all-without-flag baseline,
alpha-only-when-scoped-to-alpha, beta-only-when-scoped-to-beta,
empty-set-on-unknown-source.

Note: #1204's wider "silent 0 links" report on federated brains has
additional facets beyond this flag (resolver path edge cases on
overlapping slugs). The scoped-walk fix gives users an explicit
workaround AND closes the per-source extraction gap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(todos): file v0.37.7.0 follow-ups (#1173, #1204, T5N)

Three items deferred from v0.37.7.0:

1. #1173 .sql indexing — verify-first gate found
   tree-sitter-sql.wasm missing from src/assets/wasm/grammars/.
   Dedicated wave needed: vendor the wasm, add .sql to walker
   filter, address slug-shape collision with #1172.

2. #1204 deeper investigation — wave added --source-id flag as
   workaround. Underlying silent-zero-links bug on unscoped
   federated extracts needs its own /investigate pass against
   a cross-source-duplicate-slug fixture.

3. Tier 5N doctor sweep for dead-lettered subagent jobs matching
   the #1151 fingerprint. Deferred to v0.37.8+ behind the islamabad
   doctor.ts conflict resolution.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(sync): walker skips git submodule directories (#1169)

Sync walker descended into git submodules and indexed their markdown
content as if it belonged to the parent brain. Users with submodules
in their brain repo saw foreign content in their pages table.

Fix: pruneDir gains an optional `parentDir` arg. When set, the helper
stats `<parentDir>/<name>/.git` and skips the directory if `.git`
exists as a FILE (gitfile pointer — the canonical submodule shape).
Directories containing `.git` as a DIRECTORY (a real nested repo,
not a submodule) are descended into; the inner `.git` dir itself is
then dot-prefix-excluded.

Callers updated to pass parentDir:
- src/commands/extract.ts walkMarkdownFiles
- src/core/cycle/transcript-discovery.ts walker

Back-compat preserved: existing pruneDir(name) callers without
parentDir get the pre-v0.37.7.0 behavior unchanged.

Companion `.gitignore`-respect feature from PR #1159 (@jetsetterfl)
NOT in this wave — it would require adding the `ignore` npm package
as a dep, which the plan's "no new deps in this PR" gate excludes.
Filed as follow-up TODO for a dedicated wave.

5 new test cases pin the submodule shape + back-compat + nested-repo
ambiguity. Existing extract-fs / extract-db tests unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(brain-routing): document 6-tier source resolution chain (#1222)

The convention skill didn't have a tier-by-tier reference for how
gbrain resolves the active source. Users running federated brains
had to read the source code to know which signal wins.

Added:
- Canonical 6-tier table (flag → env → dotfile → local_path →
  brain_default → seed_default) matching src/core/source-resolver.ts.
- Pointer to `gbrain sources current` (new in v0.37.7.0) as the
  verification command.
- The CLI-layer trust boundary note: operations.ts handlers don't
  read env/dotfile (preserves v0.34.1.0 source-isolation work for
  MCP callers).
- Per-command flag map: --source, --source-id (extract), and
  --include-foreign (graph-query).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(import): --source-id flag routes pages to a brain source (#1167)

`gbrain import --source dept-x ./pages` silently fell back to the
default source because the CLI parser never consumed --source. PR
#707's design intent excluded the flag explicitly; users had no
signal their pages were going to the wrong place. #1167 + #1222
filed the regression.

Fix: parse `--source-id <id>` (matching v0.37.7.0 extract.ts T2's
naming convention — --source-id stays out of conflict with future
axes that may want --source). When set, the flag value wins over
any programmatic opts.sourceId; back-compat preserved for callers
that pass sourceId via opts only.

Also threaded into the positional-dir arg parser's flagValues set
so `--source-id <value> <dir>` doesn't treat <value> as the dir.

Note on related surfaces:
- `gbrain query "X" --source_id dept-x` already routed correctly
  via the operations.ts query op (added in v0.34) — no fix needed.
- `gbrain extract --source-id <id>` shipped in T2.
- `gbrain sync --source <id>` already worked (pre-existing).
- `gbrain sources current` (shipped in T4) is the verification
  tool — run it before destructive ops to confirm routing.

Closes the silent-fallback for the import path. Co-authored with
@tyad67-netizen (#1168), @hnshah (#1124, #1120), whose patches
informed the shape; re-implemented against current master per
the wave's "re-implement, credit, close" workflow.

3 new test cases pin: default-without-flag, --source-id-routes-correctly,
flag-value-not-treated-as-dirArg.

Co-Authored-By: tyad67-netizen <noreply@github.com>
Co-Authored-By: hnshah <noreply@github.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(autopilot): reconnect classifier + launchd ThrottleInterval (#1162)

Pre-fix: when database_url was unset/malformed, the DB-health-check
reconnect loop logged `config.database_url undefined` forever
because the catch swallowed every error type uniformly. launchd's
KeepAlive=true respawned immediately on any exit, so even when
the process did exit, it came right back into the same bad state.
@colin477 reported the daemon-thrash pattern.

Two-part fix:

1. In-process error classifier — `classifyReconnectError(err)`:
   - `unrecoverable` (database_url missing/empty/malformed, auth
     failure, no-brain-configured): exit immediately with a clear
     stderr line. Pattern-matched against postgres / config-loader
     error shapes. Tests pin the matcher against the #1162
     fingerprint exactly.
   - `recoverable` (network blip, pool saturated, connection refused
     on a port coming up, Supabase 503): retry. Up to
     GBRAIN_AUTOPILOT_MAX_RECONNECT_FAILS (default 30 = ~5min) before
     finally giving up with `max_reconnect_fails_exceeded`.
   - Counter resets on every successful health probe or reconnect.

2. launchd plist gains `ThrottleInterval=60`. Combined with the
   in-process exit, launchd waits 60s before relaunching instead
   of immediate respawn. Pure-function `generateLaunchdPlist()`
   exported for tests.

16 new test cases:
- 11 classifier cases (database_url shapes, malformed URL, auth,
  role-does-not-exist with quoted name, network blip, pool
  saturated, 503, non-Error inputs, case-insensitivity)
- 5 plist generator cases (ThrottleInterval=60, KeepAlive
  preserved, wrapper path, XML escaping, StandardErrorPath).

Pre-existing autopilot-lock-path tests unchanged — both fixes
land cleanly side-by-side.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(oauth): confidential clients via custom /token middleware (#1166)

v0.34.1.0 (#909) fixed PUBLIC PKCE clients (client_secret=undefined)
by normalizing NULL → undefined in getClient. Confidential clients
regressed: the MCP SDK's clientAuth middleware does plaintext
`client.client_secret !== presented_secret` compare, but gbrain
stores SHA-256 hashes, so the SDK's compare always failed for
authorization_code and refresh_token grants on confidential clients.
Result: /token returned `invalid_client` for every confidential
exchange.

Fix shape per locked-decision-5: custom /token middleware BEFORE the
SDK's authRouter, similar to the pre-existing client_credentials
handler. The middleware:

1. Detects confidential auth via `client_secret` in body
   (client_secret_post) OR `Authorization: Basic` header
   (client_secret_basic per RFC 6749 §2.3.1).
2. Falls through to the SDK when neither is present (public PKCE
   path stays canonical, preserves v0.34.1.0 behavior).
3. Calls new `verifyConfidentialClientSecret(clientId, presented)`
   on the provider which does SHA-256 hash compare ourselves
   (same shape as exchangeClientCredentials' existing hash check).
4. On verification success, calls existing
   `exchangeAuthorizationCode` / `exchangeRefreshToken` directly
   with the validated client.
5. RFC 6749 §5.2 error semantics: 401 invalid_client for auth
   failures, 400 invalid_grant for code/token problems.

Per CLAUDE.md "GBRAIN:RLS_EXEMPT" annotation contract: this surface
sits in front of the SDK's clientAuth and doesn't depend on the
SDK's plaintext compare working — the SDK's middleware never
fires for confidential paths the new middleware claims.

7 new test cases pin: correct-secret-returns-client, wrong-secret
opaque rejection, non-existent client, public-client refuses
the confidential path, case-sensitivity, soft-deleted revocation,
verify-then-exchange-refresh round-trip with second-use rejection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(doctor): 3 new checks — source routing + oauth + autopilot lock (T12/T13/T14)

Three v0.37.7.0 doctor checks landing in one atomic commit (single
file, shared merge-conflict surface with garrytan/islamabad-v3 per
locked decision 1):

1. source_routing_health (T12 / #1167):
   Sample non-default sources for pages; warn when a registered
   source has zero pages (silent-collapse-to-default fingerprint).
   D5 lock: total-sample cap of 200 pages across all sources, with
   per-source cap = min(50, ceil(200/N)) so a 20-source CEO brain
   pays 200 selects, not 1000. Fix hint paste-ready to
   `gbrain sources current --json` for verification.

2. oauth_confidential_client_health (T13 / #1166):
   Probe every oauth_clients row. Confidential clients (auth_method
   != 'none') must have a non-NULL client_secret_hash; if any row
   claims confidential auth but stores NULL hash, that's the
   pre-v0.37.7.0 regression. Public clients (auth_method='none')
   correctly keep NULL hash per v0.34.1.0 #909. Fix hint:
   `gbrain auth revoke-client + register-client` OR `gbrain upgrade`.
   Pre-OAuth schemas (missing oauth_clients table) skip gracefully.

3. autopilot_lock_scope (T14 / #1226):
   Detect stale ~/.gbrain/autopilot.lock outside the current
   GBRAIN_HOME. Codex CF11: dangerous to paste-ready `rm` without
   verifying the owning PID isn't a live process. Hint reads the
   PID file and gives the user a `ps -p <pid>` check before any
   delete — matches sshd-style stale-lock recovery hints.

9 new test cases pin the canonical paths. Pre-existing 80+ doctor
checks unchanged.

Expected to conflict with garrytan/islamabad-v3 at merge time. The
3 new check functions live in their own block far from the
islamabad skill_brain_first check; the conflict surface should be
limited to the `checks.push(...)` call site near the end of
runDoctor's DB-checks phase (~10 lines).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(test): withEnv wrapper in source-resolver-with-tier (test-isolation lint)

The new source-resolver-with-tier.test.ts from T4 mutated
process.env.GBRAIN_SOURCE directly in two cases, which violates
scripts/check-test-isolation.sh R1 (env mutations leak across
parallel-loaded test files in the same shard process).

Fix: wrap both mutation sites in withEnv() from test/helpers/with-env.ts,
which saves+restores via try/finally per the canonical pattern in
CLAUDE.md.

Pure refactor — all 11 cases still green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: update project documentation for v0.37.7.0

CHANGELOG.md — populated the "What landed" stub with the 18-commit
brisbane wave (source-id flag threading, sources current subcommand,
graph-query foreign-edge footer, autopilot lockfile scope + reconnect
classifier + launchd ThrottleInterval, OAuth confidential client
middleware, reindex-frontmatter connect fix, subagent terminal-on-resume
fix, sync walker submodule skip, 3 new doctor checks, brain-routing.md
convention skill). Voice: ELI10 lead, capability table, paste-ready
verification, "what's safe to know" + "what we caught" sections.

CLAUDE.md — extended Key Files annotations for the v0.37.7.0 changes:
import/extract --source-id flags, sources current subcommand, graph-query
--include-foreign, resolveSourceWithTier() additive helper, autopilot
classifyReconnectError + generateLaunchdPlist exports, OAuth confidential
client middleware, pruneDir submodule detection, subagent terminal
short-circuit, 3 new doctor checks. Pinned by their test files.

llms-full.txt — regenerated via `bun run build:llms` (CI guard at
test/build-llms.test.ts will fail otherwise).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: rafaelreis-r <noreply@github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant