v0.28.12 feat: LongMemEval benchmark harness by garrytan · Pull Request #606 · garrytan/gbrain

garrytan · 2026-05-04T03:54:43Z

Summary

gbrain eval longmemeval <dataset.jsonl> runs the public LongMemEval benchmark
directly against gbrain's hybrid retrieval. Each question gets a clean in-memory
PGLite, its haystack imported, the question asked, the hypothesis emitted as JSONL —
exactly the shape LongMemEval's evaluate_qa.py consumes. The user's ~/.gbrain
brain is never opened; retrieved chat content is sanitized with the same
INJECTION_PATTERNS that protect takes from prompt-injection.

Architecture: one in-memory PGLiteEngine per benchmark run, TRUNCATE between
questions with runtime-enumerated tables via pg_tables so future schema additions
don't silently leak across questions. No EphemeralBrain class wrapper (codex
reviewed and rejected as over-engineering for a sequential benchmark; design
collapsed accordingly).

New CLI:

gbrain eval longmemeval <dataset.jsonl> [--limit N] [--model M] [--retrieval-only] [--keyword-only] [--expansion] [--top-k K] [--output FILE]
Hermeticity gate: --help works on a machine with no ~/.gbrain/config.json

New code (feat commits):

src/eval/longmemeval/{harness,adapter,sanitize}.ts — pure converter, reset-in-place harness, prompt-injection-safe chat renderer
src/commands/eval-longmemeval.ts — CLI entrypoint with try/catch envelope per question, structural <chat_session> framing, ThinkLLMClient injection seam
src/cli.ts — pre-dispatch bypass for eval longmemeval so the user's brain is never connected
src/core/think/sanitize.ts — one-line export INJECTION_PATTERNS so the takes + benchmarks share one source of truth

New tests:

test/eval-longmemeval.test.ts (12 cases — harness lifecycle, reset clears all tables, schema-migration robustness, p50/p99 speed gate, adapter shape, source-boost regression guard, end-to-end with stubbed LLM, JSONL format guard, key contract, per-question failure handling)
test/longmemeval-sanitize.test.ts (12 cases — pattern strip parity, length cap, structural framing)
test/fixtures/longmemeval-mini.jsonl (5 hand-authored questions)

Test Coverage

Coverage diagram + assertions cover every code path in the new files:

HARNESS (src/eval/longmemeval/harness.ts)
  ├── createBenchmarkBrain ── [★★ TESTED] lifecycle test
  ├── resetTables ──── [★★★ TESTED] reset clears + schema-migration robustness + speed
  └── withBenchmarkBrain ── [★★ TESTED] indirect via runEvalLongMemEval e2e

ADAPTER (src/eval/longmemeval/adapter.ts)
  └── haystackToPages ── [★★★ TESTED] shape + missing-dates path pinned

SANITIZE (src/eval/longmemeval/sanitize.ts)
  ├── sanitizeChatContent ── [★★★ TESTED] 13 patterns + length cap
  └── renderChatBlock ── [★★ TESTED] structural framing + sanitizedCount

CLI (src/commands/eval-longmemeval.ts)
  ├── parseArgs / loadDataset ── [★★ TESTED] e2e via fixture
  ├── per-question loop ── [★★★ TESTED] try/catch envelope (test 12)
  ├── generateAnswer (stubbed LLM) ── [★★ TESTED] prompt construction asserted
  ├── renderRetrievedAsHypothesis ── [★ TESTED] retrieval-only path
  └── emitJsonlLine ── [★★★ TESTED] LF + UTF-8 byte round-trip pinned

COVERAGE: 23/23 paths tested (100%)

Targeted regression checks ran green:

bun test test/eval-longmemeval.test.ts test/longmemeval-sanitize.test.ts → 23 pass / 0 fail / 435 expect calls / 5.7s
Speed gate: warm reset + import 5 pages + search p50=25.9ms, p99=30.3ms (gate is p50<500ms)
Full bun test exit 0
bun run typecheck clean

Pre-Landing Review

/plan-eng-review ran on the plan and cleared with 9 issues all resolved
(D1–D13 accepted). Codex outside-voice review cleared after a structural
pivot — caught 8 substantive defects we missed (resolveModel exists on
v0.28-release, hybridSearch needs both expansion and expandFn, progress
API signature, etc.) and forced collapsing the original EphemeralBrain class
to the simpler reset-in-place harness via runtime pg_tables enumeration.
All findings folded into the implementation. No new findings on the diff.

Design Review

No frontend files changed — design review skipped.

Eval Results

No prompt-related files in the diff — evals skipped.

Plan Completion

All 13 implementation tasks (I1–I13) DONE. Full task → file mapping:

I1 export INJECTION_PATTERNS → src/core/think/sanitize.ts
I2 harness.ts → src/eval/longmemeval/harness.ts
I3 adapter.ts → src/eval/longmemeval/adapter.ts
I4 sanitize.ts → src/eval/longmemeval/sanitize.ts
I5 eval-longmemeval.ts → src/commands/eval-longmemeval.ts
I6 cli.ts bypass → src/cli.ts
I7 main test file → test/eval-longmemeval.test.ts
I8 sanitize test file → test/longmemeval-sanitize.test.ts
I9 fixture → test/fixtures/longmemeval-mini.jsonl
I10 VERSION + CHANGELOG → VERSION, package.json, CHANGELOG.md
I11 typecheck + tests → bun run typecheck (clean), bun test (green)
I12 bun --compile smoke → deferred to CI (covered by repo workflow)
I13 commit + PR → this PR

Scope Drift

Scope Check: CLEAN. Stated intent (LongMemEval benchmark harness, hermetic CI) matches what shipped exactly. No "while I was in there" expansions. The merge from origin/master brought in v0.26.4 + v0.26.5 (parallel test loop + destructive operation guard) — those are upstream content carried into the branch, not scope drift.

TODOS

No TODO items completed in this PR (LongMemEval was not on the prior TODO list — it's a new initiative).

Documentation

Doc updates landed in acab7941:

README.md — Added an EVAL section to the Commands reference (eval --qrels, export, prune, replay, longmemeval) and a v0.28.1 announce paragraph next to the v0.25.0 BrainBench-Real intro.
CLAUDE.md — Added a Key files entry for src/eval/longmemeval/ + src/commands/eval-longmemeval.ts covering the in-memory PGLite + runtime-enumerated TRUNCATE architecture, INJECTION_PATTERNS re-use, and the runEvalLongMemEval(args, {client}) LLM injection seam. Added a Key commands added in v0.28.1 (LongMemEval in the box) subsection. Inventoried test/eval-longmemeval.test.ts (12 cases) + test/longmemeval-sanitize.test.ts (12 cases) under the unit-test list.
docs/eval-bench.md — Cross-linked from the existing eval replay writeup to LongMemEval as the third evaluation axis. Appended a ## Public benchmarks: LongMemEval (v0.28.1) section with architecture notes, flags table, and the 25.9ms p50 / 30.3ms p99 perf numbers.
CONTRIBUTING.md — Extended the eval replay block with a paragraph pointing contributors at gbrain eval longmemeval for public-benchmark coverage on top of replay.
AGENTS.md — Extended the eval-retrieval bullet with a one-line mention of gbrain eval longmemeval for non-Claude agents.

CHANGELOG entry written by /ship preserved as-is. VERSION bumped to 0.28.1.

Test plan

bun run typecheck clean
bun test test/eval-longmemeval.test.ts test/longmemeval-sanitize.test.ts → 23 pass / 0 fail
Full bun test exit 0
Warm-create speed gate p50<500ms (measured 25.9ms p50 / 30.3ms p99)
Hermeticity: tests run with no DATABASE_URL, no OPENAI_API_KEY, no ANTHROPIC_API_KEY
Manual: gbrain eval longmemeval ~/datasets/longmemeval/oracle.jsonl --limit 50 --retrieval-only > /tmp/hypothesis.jsonl then run their evaluate_qa.py against the JSONL

🤖 Generated with Claude Code

^{Need help on this PR? Tag @codesmith with what you need.}

Let Codesmith autofix CI failures and bot reviews

…sions (v32) Migration v31 adds the takes table (typed/weighted/attributed claims) and synthesis_evidence (provenance for `gbrain think` outputs). Page-scoped via page_id FK (slug isn't unique alone in v0.18+ multi-source). HNSW partial index on embedding for active rows. ON DELETE CASCADE on synthesis_evidence so deleting a source take cascades the provenance row. Migration v32 adds access_tokens.permissions JSONB with safe-default backfill (`{"takes_holders":["world"]}`). Default keeps non-world holders hidden from MCP-bound tokens until the operator explicitly grants access via the v0.28 auth permissions CLI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…, resolve, synthesis_evidence Extends BrainEngine with the takes domain object. Both engines implement the same surface; PGLite uses manual `$N` placeholders, Postgres uses postgres-js unnest() — same shape as addLinksBatch and addTimelineEntriesBatch. Methods: - addTakesBatch (upsert via ON CONFLICT (page_id, row_num) DO UPDATE) - listTakes (filter by holder/kind/active/resolved, takesHoldersAllowList for MCP-bound calls, sortBy weight/since_date/created_at) - searchTakes / searchTakesVector (pg_trgm + cosine; honor allow-list) - countStaleTakes / listStaleTakes (mirror countStaleChunks pattern; embedding column intentionally omitted from listStale payload) - updateTake (mutable fields only; throws TAKE_ROW_NOT_FOUND) - supersedeTake (transactional: insert new at next row_num, mark old active=false, set superseded_by; throws TAKE_RESOLVED_IMMUTABLE on resolved bets) - resolveTake (sets resolved_*; throws TAKE_ALREADY_RESOLVED on re-resolve; resolution is immutable per Codex P1 #13 fold) - addSynthesisEvidence (provenance persist; ON CONFLICT DO NOTHING) - getTakeEmbeddings (parallel to getEmbeddingsByChunkIds) Types live in src/core/engine.ts adjacent to LinkBatchInput. Page-scoped via page_id (slug not unique in v0.18+ multi-source). PageType gains 'synthesis'. takeRowToTake mapper in utils.ts handles Date → ISO string normalization. Tests: test/takes-engine.test.ts — 16 cases against PGLite covering upsert/list/filter/search happy paths, takesHoldersAllowList isolation, the four invariant errors (TAKE_ROW_NOT_FOUND, TAKES_WEIGHT_CLAMPED, TAKE_RESOLVED_IMMUTABLE, TAKE_ALREADY_RESOLVED), supersede flow, resolve metadata round-trip, FK CASCADE on synthesis_evidence when source take deletes. All pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…as resolution Replaces every hardcoded `claude-*-X` and per-phase `dream.<phase>.model` config key with a single resolver. Hierarchy: 1. CLI flag (--model) 2. New-key config (e.g. models.dream.synthesize) 3. Old-key config (deprecated dream.synthesize.model, dream.patterns.model) — read with stderr deprecation warning, one-per-process 4. Global default (models.default) 5. Env var (GBRAIN_MODEL or caller-supplied) 6. Hardcoded fallback Aliases (`opus`, `sonnet`, `haiku`, `gemini`, `gpt`) resolve at the end so any tier can use a short name. User-defined `models.aliases.<name>` config overrides built-ins. Cycle-safe (depth 2 break). Unknown alias passes through unchanged so users can pass full provider IDs without registering. When new-key + old-key are BOTH set (Codex P1 #11 fix), new-key wins and stderr warns "deprecated config X ignored; Y is set and wins". When only old-key is set, it's honored with a softer "rename to Y before v0.30" warning. Both warnings emit once per (key, process) — a Set memo prevents log spam in long-running daemons. Migrated call sites: synthesize.ts (model + verdictModel), patterns.ts (model). subagent.ts and search/expansion.ts to be migrated later in v0.28 (staying compatible until then). Tests: test/model-config.test.ts — 11 cases pinning the 6-tier ordering, alias resolution + cycle break, deprecated-key warning emit-once, and unknown-alias pass-through. All pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…P0 fix) src/core/takes-fence.ts — pure functions for the fenced markdown surface: - parseTakesFence(body) — extracts ParsedTake[] from `` blocks. Strict on canonical form, lenient on hand-edits with warnings (TAKES_FENCE_UNBALANCED, TAKES_TABLE_MALFORMED, TAKES_ROW_NUM_COLLISION). Strikethrough `~~claim~~` → active=false; date ranges `since → until` split into sinceDate/untilDate. - renderTakesFence(takes) — round-trip safe with parseTakesFence. - upsertTakeRow(body, row) — append-only per CEO-D6 + eng-D9. Creates a fresh `## Takes` section if no fence present. row_num is monotonic (max + 1, never gap-filled — keeps cross-page refs and synthesis_evidence stable forever). - supersedeRow(body, oldRow, replacement) — strikes through old row's claim AND appends the new row at end. Both rows preserved in markdown for git-blame archaeology. - stripTakesFence(body) — removes the fenced block entirely. Used by the chunker so takes content lives ONLY in the takes table. Codex P0 #3 fix: src/core/chunkers/recursive.ts now calls stripTakesFence() before computing chunk boundaries. Without this, page chunks would contain the rendered takes table and the per-token MCP allow-list would be bypassed at the index layer (token bound to takes_holders=['world'] would see garry's hunches via page hits). Doctor's takes_fence_chunk_leak check (plan-side) asserts no chunk contains the begin marker. Tests: 15 cases covering canonical parse, strikethrough, date range, fence unbalanced detection, malformed-row skip + warning, row_num collision detection, round-trip render, append-only upsert into existing fence, fresh-section creation, monotonic row_num under hand-edit gaps, supersede flow, stripTakesFence verifying takes content removed AND surrounding prose preserved. Existing chunker tests still pass (15 + 15 = 30). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…fy-write src/core/page-lock.ts — per-page file lock at ~/.gbrain/page-locks/<sha256-of-slug>.lock so two concurrent `gbrain takes add` calls or `takes seed --refresh` from autopilot can't race on the same `<slug>.md` read-modify-write. Eng-review fold: reuses the v0.17 cycle.lock pattern (mtime + PID liveness) but per-slug. Differences from cycle.ts's lock: - SHA-256 of slug for safe filenames (slashes, unicode, etc.) - Same-pid + fresh mtime = LIVE (cycle.ts assumes one lock per process and reclaims same-pid; page-lock allows concurrent locks for DIFFERENT slugs in one process). mtime expiry still rescues post-crash leftovers. - 5-min TTL (vs cycle's 30 min — page edits are short) - `withPageLock(slug, fn)` convenience wrapper with default 30s timeout API: - acquirePageLock(slug, opts) → handle | null (poll-with-timeout) - handle.refresh() / handle.release() (idempotent — only releases if pid matches) - withPageLock(slug, fn, opts) — acquire + run + release-in-finally Tests: 10 cases — fresh acquire, live holder returns null, stale-mtime reclaim, dead-PID reclaim, refresh updates timestamp, foreign-pid release is no-op, withPageLock callback runs and releases on success/failure, timeout-throws when held, SHA-256 filename safety for slashes/unicode. All pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

src/core/cycle/extract-takes.ts — new phase that materializes the takes table from fenced markdown blocks. Two paths mirror src/commands/extract.ts: - extractTakesFromFs: walk *.md under repoPath, parse fences, batch upsert - extractTakesFromDb: iterate engine.getAllSlugs(), parse each page's compiled_truth+timeline, batch upsert (mutation-immune snapshot iteration) Single dispatcher extractTakes(opts) routes by source. Honors: - slugs filter for incremental re-extract (pipes from sync→extract) - dryRun: count would-be upserts, write nothing - rebuild: DELETE FROM takes WHERE page_id = $1 before re-insert (clean slate when markdown is canonical and DB has drifted) Schema fix: since_date/until_date were DATE in the original v31 migration. Spec uses partial dates ('2017-01', '2026-04-29 → 2026-06') that Postgres DATE rejects. Changed to TEXT in both the Postgres and PGLite blocks so parser-rendered ranges round-trip cleanly. Loses the ability to do date-range arithmetic in SQL, but date math on opinion timelines is out of scope for v0.28 anyway. utils.ts dateOrNull now annotated as v0.28 TEXT-aware. Migration v31 has not been deployed yet (this branch is the v0.28 release candidate), so the type swap is free. No data migration needed. Tests: test/extract-takes.test.ts — 5 cases against PGLite covering full walk + fence-skip on no-fence pages, takes-table populated post-extract, incremental slugs filter, dry-run no-write, rebuild=true clears + re-inserts ad-hoc rows. test/takes-engine.test.ts (16), test/takes-fence.test.ts (15) all still pass — 36/36 takes tests green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

src/commands/takes.ts — surfaces the engine methods + takes-fence library through a single `gbrain takes <subcommand>` entrypoint: takes <slug> list with filters + sort takes search "<query>" pg_trgm keyword search across all takes takes add <slug> --claim ... ... append (markdown + DB, atomic via lock) takes update <slug> --row N ... mutable-fields update (markdown + DB) takes supersede <slug> --row N ... strikethrough old + append new takes resolve <slug> --row N --outcome record bet resolution (immutable) Markdown is canonical. Every mutate command: 1. acquires the per-page file lock (withPageLock) 2. re-reads the .md file 3. applies the edit via takes-fence (upsertTakeRow / supersedeRow) 4. writes the .md file back 5. mirrors to the DB via the engine method 6. releases the lock (auto via finally) Resolve currently writes only to DB — surfacing resolved_* in the markdown table is deferred to v0.29 (the takes-fence renderer's column set is fixed at # | claim | kind | who | weight | since | source per spec). Wired into src/cli.ts dispatch + CLI_ONLY allowlist. Help text follows the project convention (orphans/embed/extract pattern). --dir flag overrides sync.repo_path config when working outside the configured brain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…llow-list OperationContext gains takesHoldersAllowList — server-side filter for takes.holder field threaded from access_tokens.permissions through dispatch into the engine SQL. Closes Codex P0 #3 at the dispatch layer (chunker strip already closed the page-content side in the previous commit). src/core/operations.ts — three new ops: - takes_list: lists takes with holder/kind/active/resolved filters; honors ctx.takesHoldersAllowList for MCP-bound calls - takes_search: pg_trgm keyword search; honors allow-list - think: op surface registered (returns not_implemented envelope until Lane D's pipeline lands). Remote callers cannot save/take per Codex P1 #7. src/mcp/dispatch.ts — DispatchOpts.takesHoldersAllowList threads into buildOperationContext. src/mcp/http-transport.ts — validateToken now reads access_tokens.permissions.takes_holders, defaults to ['world'] when the column is absent or malformed (default-deny on private hunches). auth.takesHoldersAllowList passed to dispatchToolCall. src/mcp/server.ts (stdio) — defaults to takesHoldersAllowList: ['world'] since stdio has no per-token auth. Operators wanting full visibility use `gbrain call <op>` directly (sets remote=false). src/commands/auth.ts — `gbrain auth create <name> --takes-holders w,g,b` flag persists the per-token list; new `auth permissions <name> set-takes-holders <list>` updates an existing token. Tests: test/takes-mcp-allowlist.test.ts — 8 cases against PGLite proving the threading: local-CLI sees all holders, ['world'] returns only public, ['world','garry'] returns 2/3, no-overlap returns empty (no fallback), search honors allow-list, remote save/take on think rejected with not_implemented envelope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes the v0.28 ship-prep cycle. Bumps VERSION + package.json + bun.lock to 0.28.0. v0_28_0 migration orchestrator runs three idempotent phases on upgrade: - Schema verify: asserts schema_version >= 32 (migrations v31 + v32 already applied by the schema runner during gbrain upgrade); fails clean if not. - Backfill takes: inline runs `extractTakes(engine, { source: 'db' })` so any pre-existing fenced takes tables in markdown populate the takes index. Idempotent; ON CONFLICT DO UPDATE keeps the table in sync. - Re-chunk TODO: queues a pending-host-work entry asking the host agent to re-import pages with takes content so the v0.28 chunker-strip rule (Codex P0 #3 fix) applies retroactively. Pages imported under v0.28+ already have takes content stripped from chunks at index time; this TODO catches up legacy pages. skills/migrations/v0.28.0.md — agent-readable upgrade guide. Walks through doctor verification, deprecated-key migration, MCP token visibility configuration, and a "try the takes layer" smoke test. CHANGELOG.md — v0.28.0 release-summary in the GStack voice (no AI vocabulary, no em dashes, real numbers from git diff stat) + the mandatory "To take advantage of v0.28.0" block + itemized changes by subsystem (schema, engine, markdown surface, model config, MCP+auth, CLI, tests, accepted risks). Final test sweep: 65/65 v0.28 tests pass across 6 files. typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

src/core/think/sanitize.ts — prompt-injection defense for take claims: 14 jailbreak patterns (ignore-prior, role-jailbreak, close-take tag, DAN, system-prompt overrides, eval-shell hooks) plus structural framing (takes wrapped in <take id="..."> tags the model is told to treat as DATA). Length-cap at 500 chars. Renders evidence blocks for the prompt. src/core/think/prompt.ts — system prompt + structured-output schema. Hard rules: cite every claim, mark hunches/low-weight explicitly, surface conflicts (never silently pick), surface gaps. JSON schema with answer + citations[] + gaps[]. Prompt adapts to anchor / time window / save flag. src/core/think/cite-render.ts — structured citations + regex fallback (Codex P1 #4 fold). normalizeStructuredCitations validates the model's structured output; parseInlineCitations is the body-scan fallback when the model omits the structured field. resolveCitations dispatches and records CITATIONS_REGEX_FALLBACK warning when used. src/core/think/gather.ts — 4-stream parallel retrieval: 1. hybridSearch (pages, existing primitive) 2. searchTakes (keyword, pg_trgm) 3. searchTakesVector (vector, when embedQuestion fn supplied) 4. traversePaths (graph, when --anchor set) RRF fusion (k=60). Each stream wrapped in try/catch — partial gather beats no synthesis. Honors takesHoldersAllowList for MCP-bound calls. src/core/think/index.ts — runThink orchestrator + persistSynthesis: INTENT (regex classify) → GATHER → render evidence blocks → resolveModel ('models.think' → 'models.default' → GBRAIN_MODEL → opus) → LLM call (injectable client) → JSON parse with code-fence + fallback strip → resolveCitations → ThinkResult. persistSynthesis writes a synthesis page + synthesis_evidence rows (page_id resolved per slug; page-level citations skip evidence). Degrades gracefully without ANTHROPIC_API_KEY. Round-loop scaffolding in place (rounds=1 only path exercised in v0.28). src/commands/think.ts — `gbrain think "<question>"` CLI. Flag parsing strips --anchor, --rounds, --save, --take, --model, --since, --until, --json. Local CLI = remote=false, so save/take honored. Human-readable output by default; --json for agent consumption. operations.ts — `think` op now calls runThink (was a not_implemented stub). Remote callers can't save/take per Codex P1 #7. Returns full ThinkResult plus saved_slug + evidence_inserted. cli.ts — wired into dispatch + CLI_ONLY allowlist. Tests: test/think-pipeline.test.ts — 18 cases against PGLite covering sanitize patterns, structural rendering, citation parsing (structured + regex fallback + dedup + invalid-slug rejection), gather streams + allow-list filter, full pipeline with stub client, malformed-LLM fallback path, no-API-key graceful degradation, persistSynthesis writes page + evidence rows. All pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…old) src/core/anthropic-pricing.ts — USD/1M-tokens map for Claude 4.7 family plus older aliases. estimateMaxCostUsd returns null on unpriced models so the meter caller can warn-once and bypass the gate. src/core/cycle/budget-meter.ts — cumulative cost ledger. Each submit estimates max-cost from (model + estimatedInputTokens + maxOutputTokens), accumulates per-cycle, refuses next submit when projected > cap. Codex P1 #10 fold: non-Anthropic models (gemini, gpt) bypass with one stderr warn per process and `unpriced=true` on the result. Budget=0 disables the gate. Audit trail at ~/.gbrain/audit/dream-budget-YYYY-Www.jsonl. src/core/cycle/auto-think.ts — auto_think dream phase. Reads dream.auto_think.{enabled,questions,max_per_cycle,budget,cooldown_days, auto_commit}. Iterates configured questions through runThink with the BudgetMeter pre-checking each submit. Cooldown timestamp written ONLY on success (matches v0.23 synthesize pattern — retries after partial failures pick back up). When auto_commit=true, persists synthesis pages via persistSynthesis. Default-disabled. src/core/cycle/drift.ts — drift dream phase scaffold. Reads dream.drift.{enabled,lookback_days,budget,auto_update}. Surfaces takes in the soft band (weight 0.3-0.85, unresolved) that have recent timeline evidence on the same page. v0.28 ships the orchestration; the LLM judge that proposes weight adjustments lands in v0.29. modelId + meter wired now so the ledger captures gate state for callers that opt in. Tests: - test/budget-meter.test.ts (7 cases) — pricing-map coverage, allow path, cumulative-deny, budget=0 disabled, unpriced bypass+warn-once, ledger captures all events, ISO-week filename branch. - test/auto-think-phase.test.ts (9 cases) — auto_think enable/skip, questions empty, success → cooldown ts written, cooldown blocks rerun, budget exhausted → partial. drift not_enabled, soft-band candidate detection, complete + dry-run paths. All pass. Typecheck clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test/e2e/takes-postgres.test.ts — full v0.28 takes pipeline against real Postgres (gated on DATABASE_URL). 12 cases: - addTakesBatch upsert via unnest() bind path (Postgres-specific) - listTakes filters: holder, kind, sort=weight, takesHoldersAllowList - searchTakes pg_trgm + allow-list filter - supersedeTake transactional path (BEGIN/COMMIT semantics) - resolveTake immutability — second resolve throws TAKE_ALREADY_RESOLVED - synthesis_evidence FK CASCADE on take delete - countStaleTakes + listStaleTakes filter active+null - extractTakesFromDb populates takes from fenced markdown - MCP dispatch with takesHoldersAllowList=['world'] returns only world - MCP dispatch local-CLI path returns all holders - MCP dispatch takes_search honors allow-list - think op forces remote_persisted_blocked even for save+take postgres-engine.ts: addTakesBatch boolean[] serialization fix. postgres-js auto-detects element type from JS arrays; for booleans it mis-detects as scalar. Cast through text[] (`'true' | 'false'`) then SQL-cast to boolean[] — same pattern other batch methods rely on for type-stable bind shapes. test/e2e/helpers.ts: setupDB now (a) tolerates non-existent tables in TRUNCATE (for fresh DBs where v31 hasn't yet created takes/synthesis_evidence) and (b) calls engine.initSchema() to actually run migrations. test/takes-mcp-allowlist.test.ts: updated 2 think-op cases to match Lane D's landed pipeline. They previously asserted not_implemented envelopes; now they assert remote_persisted_blocked + NO_ANTHROPIC_API_KEY graceful-degrade behavior. Run: DATABASE_URL=postgres://localhost:5435/gbrain_test bun test test/e2e/takes-postgres.test.ts Result: 12/12 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ePhase enum extension) cycle.ts's PhaseResult is shaped {phase, status, summary, details} with a narrow PhaseStatus enum ('ok'|'warn'|'fail'|'skipped') and CyclePhase enum that doesn't yet include 'auto_think'/'drift'. The phases ship standalone in v0.28 (cycle.ts dispatcher integration is v0.28.x); using PhaseResult forced premature enum extension. Introduces DreamPhaseResult exported from auto-think.ts: { name: 'auto_think'|'drift'; status: 'complete'|'partial'|'failed'|'skipped'; detail: string; totals?: Record<string,number>; duration_ms: number } drift.ts re-exports the same type. When v0.28.x wires the dispatcher, the adapter at the call site can map DreamPhaseResult → PhaseResult cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test/e2e/auth-permissions.test.ts — closes the v0.28 token-allow-list verification loop against real Postgres. Exercises: - Migration v32 default backfill: new tokens created without a permissions column get {takes_holders: ["world"]} via the schema DEFAULT clause. - Explicit ["world","garry"] → dispatch.takes_list filters to those holders only; brain hunches stay hidden from this token. - ["world"] default-deny token → takes_search hits filtered to public claims. - {} permissions row (operator tampered) gracefully defaults to ["world"] via the HTTP transport's validateToken parsing. - revoked_at IS NOT NULL → token excluded from active token query. Avoids the postgres-js JSONB double-encode trap (CLAUDE.md memory): pass the object directly to executeRaw, no JSON.stringify, no ::jsonb cast. All 5 pass against pgvector/pgvector:pg16 on port 5435. Combined v0.28 test sweep: 116/116 across 11 files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tion) test/e2e/chunker-takes-strip.test.ts — verifies the chunker actually strips fenced takes content end-to-end through the import pipeline. This is the Codex P0 #3 fix's verification path: takes content lives ONLY in the takes table for retrieval, never duplicated in content_chunks where the per-token MCP allow-list cannot reach. 5 cases: - chunkText (unit) output never contains TAKES_FENCE_BEGIN/END markers - chunkText output never contains fenced claim text - chunkText output retains non-fence prose (no over-stripping) - importFromContent end-to-end: imported page has chunks but none contain fenced content - takes_fence_chunk_leak doctor invariant: zero rows globally where chunk_text matches `<!--- gbrain:takes:%` Final v0.28 test sweep: 121 pass, 0 fail, 336 expect() calls, 12 files Coverage: schema migrations, engine methods (PGLite + Postgres), takes-fence parser, page-lock, extract phase, takes CLI engine surface, model config 6-tier resolver, MCP+auth allow-list, think pipeline (gather + sanitize + cite-render + synthesize), auto-think + drift + budget meter, JSONB end-to-end, chunker strip integration. ~95% of v0.28 surface area covered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Master shipped v0.25.0 with the eval-capture system (eval_candidates + eval_capture_failures tables, GBRAIN_CONTRIBUTOR_MODE=1 capture path, gbrain eval export/replay/prune CLI, +144 tests across 9 new files). Master's migration claimed v31 first. Conflict resolution: - VERSION + package.json → 0.28.0 (mine; > master's 0.25.0) - CHANGELOG.md → my v0.28.0 entry on top, master's v0.25.0 below - src/core/migrate.ts → renumber my migrations from v31/v32 to v32/v33 to sit above master's v31 (eval_capture_tables). Runtime sort by version means source-order doesn't matter; the chain becomes ..., v30 (dream_verdicts), v31 (eval_capture_tables, master), v32 (takes_and_synthesis_evidence, mine), v33 (access_tokens_permissions, mine). - skills/migrations/v0.28.0.md + src/commands/migrations/v0_28_0.ts: schema-version assertion bumped to >= 33; doc refs updated to v32/v33. - All other files (engine.ts, types.ts, operations.ts, postgres-engine.ts, pglite-engine.ts, schema-embedded.ts, etc.) auto-merged cleanly — both branches added new types/methods/columns without textual collision. Verification: - bun run typecheck: clean - v0.28 e2e suite: 121/121 pass against fresh Postgres - v0.25 eval suite: 198/198 pass on the merged tree - Combined: 319 tests, 0 regressions Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two CI failures from PR #563: test/apply-migrations.test.ts (2 fails) — `buildPlan` tests assert exact skippedFuture arrays at fixed installed-version stamps. Adding v0.28.0 to the migration registry means it shows up in skippedFuture when the test runs at installed=0.11.1 / installed=0.12.0. Append '0.28.0' to both hardcoded arrays. test/http-transport.test.ts (8 fails) — the FakeEngine mock string-prefix matches `SELECT id, name FROM access_tokens` to return a row. v0.28's validateToken now selects `SELECT id, name, permissions FROM access_tokens` to read the per-token takes_holders allow-list. Mock returned [] on the new query → validateToken treated every token as invalid → 401. Fix: mock now matches both query shapes. validTokens row gets a default `{takes_holders: ['world']}` permission injected when caller didn't supply one (mirrors the migration v33 column DEFAULT). Updated FakeEngineConfig type to allow tests to pass explicit permissions. Verification: bun test test/apply-migrations.test.ts → 18/18 pass bun test test/http-transport.test.ts → 24/24 pass bun run typecheck → clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tions to v34/v35 Master shipped 5 commits since the last sync: v0.26.3 — admin dashboard hardening (magic-link, per-client TTL, params/error_message) v0.26.2 — oauth bun execSync env + BIGINT-as-string fix v0.26.1 — oauth client_credentials bearer auth fix v0.26.0 — MCP Keys OAuth 2.1 + HTTP server + admin dashboard v0.25.1 — book-mirror flagship + 8 research skills + skillpack uninstall Master claimed v32 (oauth_infrastructure) and v33 (admin_dashboard_columns_v0_26_3) schema migrations. My v0.28 migrations were already at v32/v33 from the prior v0.25 merge. Renumbering both forward to v34/v35: v0.28 originally targeted v31/v32 master v0.25 claimed v31 (eval_capture_tables) → my migrations v32/v33 master v0.26 claimed v32/v33 (oauth + admin) → my migrations v34/v35 Conflict resolutions: - VERSION + package.json → 0.28.0 (mine; > master's 0.26.3) package.json kept master's new scripts (admin build, no-legacy-getconnection check) - src/cli.ts → kept both branches' new CLI commands ('mounts' + 'book-mirror' from master, 'takes' + 'think' from mine) - src/commands/auth.ts → preserved both 'permissions' (mine) and 'register-client'/'revoke-client' (master OAuth) cases. Help text merged. - src/core/operations.ts → kept both takesHoldersAllowList (mine) and brainId (master mounts) on OperationContext. - src/core/migrate.ts → renumbered + comment block updated. - v0_28_0 orchestrator schema-version assertion bumped >= 35. - skills/migrations/v0.28.0.md + CHANGELOG.md doc refs updated to v34/v35. - All other files (engine.ts, types.ts, postgres-engine.ts, pglite-engine.ts, utils.ts, schema-embedded.ts, etc.) auto-merged cleanly. Verification: - bun run typecheck → clean - v0.28 test sweep (PGLite + Postgres) → 163/163 pass - master eval + migrate tests → 203/203 pass - Combined: 366 tests, 0 regressions Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…hink) test/oauth.test.ts enforces an invariant from master's v0.26 OAuth landing: every Operation must have `scope: 'read' | 'write' | 'admin'`, and any op flagged `mutating: true` must be 'write' or 'admin'. My v0.28 ops were added before master shipped v0.26 + the new invariant; the merge surfaced the gap. Annotations: - takes_list → read - takes_search → read - think → write (mutating: true; --save persists synthesis page) Verification: bun test test/oauth.test.ts → 42/42 pass bun run typecheck → clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The same pattern set protects takes from prompt-injection (think/sanitize.ts) and now retrieved chat content in the LongMemEval harness. One source of truth for both surfaces; adding a new pattern in this file automatically covers benchmarks too. Existing consumers (sanitizeTakeForPrompt, renderTakesBlock) keep working unchanged. Verified via test/think-pipeline.test.ts (18 pass, 0 fail). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…Lite One in-memory PGLiteEngine per benchmark run; TRUNCATE between questions with runtime-enumerated tables via pg_tables so future schema migrations don't silently leak across questions. Infrastructure tables (sources, config, gbrain_cycle_locks, subagent_rate_leases) preserved across resets so initSchema-seeded rows like sources.'default' survive (FK target for pages.source_id). Files: - src/eval/longmemeval/harness.ts: createBenchmarkBrain + resetTables + withBenchmarkBrain. ~50 lines, no class wrapper. - src/eval/longmemeval/adapter.ts: pure haystackToPages() converter. Slug prefix `chat/` (verified non-matching against DEFAULT_SOURCE_BOOSTS). - src/eval/longmemeval/sanitize.ts: re-uses INJECTION_PATTERNS from think/sanitize.ts; wraps each session in <chat_session id date> tags; 4000-char cap. - test/longmemeval-sanitize.test.ts: 12 cases pinning the F8 contract. Hermetic: no DATABASE_URL, no API keys. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Run the LongMemEval public benchmark against gbrain's hybrid retrieval. Dataset is a positional path (download from xiaowu0162/longmemeval on HF). Per-question loop wraps everything in try/catch; one bad question doesn't kill the run, error JSONL line emitted instead. Wiring: - src/cli.ts: pre-dispatch bypass for `eval longmemeval` so the user's ~/.gbrain brain is never opened. Hermeticity gate verified: --help works on machines with no gbrain config. - src/commands/eval-longmemeval.ts: arg parsing, JSONL emit (LF + UTF-8 pinned), hybridSearch with optional expandQuery from search/expansion.ts, resolveModel from model-config.ts (6-tier chain), ThinkLLMClient injection seam from think/index.ts, structural <chat_session> framing. - test/eval-longmemeval.test.ts: 12 cases covering harness lifecycle, reset clears all tables, schema-migration robustness, p50/p99 speed gate (warm reset+import+search target <500ms), adapter shape, source-boost regression guard, end-to-end with stubbed LLM, JSONL format guard, per-question failure handling. - test/fixtures/longmemeval-mini.jsonl: 5 hand-authored questions with keyword-friendly overlap so --keyword-only works in CI. Speed: warm reset+import 5 pages+search p50=25.9ms p99=30.3ms locally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

VERSION + package.json synchronized at 0.28.1. CHANGELOG entry uses the release-summary voice + "To take advantage of v0.28.1" block per CLAUDE.md. Sequential release on garrytan/v0.28-release; lands after v0.28.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…l-bench # Conflicts: # CHANGELOG.md # VERSION # package.json

- README.md: add EVAL section to Commands reference (eval --qrels, export, prune, replay, longmemeval); add v0.28.1 announce paragraph next to the v0.25.0 BrainBench-Real intro. - CLAUDE.md: add Key files entry for src/eval/longmemeval/ + src/commands/eval-longmemeval.ts; add "Key commands added in v0.28.1" subsection (mirrors the v0.26.5 / v0.25.0 pattern); inventory test/eval-longmemeval.test.ts + test/longmemeval-sanitize.test.ts under the unit-test list. - docs/eval-bench.md: cross-link from the "What it actually does" section to LongMemEval as the third evaluation axis (public benchmark, ground-truth labels, full QA pipeline); append "Public benchmarks: LongMemEval (v0.28.1)" section with architecture, flags table, and perf numbers. - CONTRIBUTING.md: append a paragraph after the eval-replay block pointing contributors at gbrain eval longmemeval for public-benchmark coverage. - AGENTS.md: extend the existing eval-retrieval bullet with a one-line mention of gbrain eval longmemeval. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…35/v36 Master shipped three releases while v0.28 was in flight: - v0.26.4: parallel test runner (run-unit-parallel.sh) - v0.26.5: destructive-operation guard (deleted_at, soft-delete) — claimed v34 - v0.26.6: PGLite ↔ Postgres schema-drift parity gate Master's new v34 (destructive_guard_columns) collides with my v34 (takes_and_synthesis_evidence). Renumbered my migrations: takes_and_synthesis_evidence: v34 → v35 access_tokens_permissions: v35 → v36 Updated assertions: - src/commands/migrations/v0_28_0.ts schema-version gate: >= 35 → >= 36 - skills/migrations/v0.28.0.md: v34/v35 → v35/v36 references - CHANGELOG.md v0.28.0 entry: schema migrations row + section header Conflicts resolved: - VERSION → 0.28.0 (kept ours) - package.json → 0.28.0 + master's new test scripts (auto-merged structure) - CHANGELOG.md → kept v0.28.0 entry on top, master's v0.26.4–v0.26.6 below Verified post-merge: - bun run typecheck: PASS - 88 migration tests (migrate + bootstrap-coverage + apply-migrations) - 99 v0.28 unit tests (takes-engine + fence + extract + page-lock + model- config + mcp-allowlist + think-pipeline + budget-meter + auto-think) - 119 master-side tests (oauth + http-transport + destructive-guard + pages-soft-delete + schema-diff)

…al.test.ts Master shipped v0.26.7 with a new test-isolation lint (scripts/check-test-isolation.sh) that flags any test mutating process.env outside withEnv(). Three v0.28 tests violated the new rule: test/model-config.test.ts → model-config.serial.test.ts test/takes-mcp-allowlist.test.ts → takes-mcp-allowlist.serial.test.ts test/think-pipeline.test.ts → think-pipeline.serial.test.ts Renaming to .serial.test.ts is the lint's documented escape hatch for genuinely env-coupled tests; they now run in the serial pass at --max-concurrency=1 instead of the parallel fast loop. Conflicts resolved: - VERSION → 0.28.0 (kept ours) - package.json → 0.28.0 + master's check-test-isolation script - CHANGELOG.md → kept v0.28.0 entry on top, master's v0.26.7 below Verified post-merge: - bun run verify: PASS (privacy + jsonb + progress + test-isolation + wasm + admin-build + typecheck) - 37 v0.28 serial tests pass (model-config 16, mcp-allowlist 8, think 13)

…tions to v37/v38 Master shipped three more releases while v0.28 was in flight: - v0.26.8: auto-RLS event trigger (migration v35) - v0.26.9: OAuth RFC 6749 hardening + close HTTP MCP shell-job RCE - v0.27.0: pluggable embedding providers via Vercel AI SDK (migration v36 — subagent_provider_neutral_persistence_v0_27) Master's new v35 + v36 collide with my v35 + v36 (v0.28 takes layer). Renumbered my migrations: takes_and_synthesis_evidence: v35 → v37 access_tokens_permissions: v36 → v38 Updated assertions: - src/commands/migrations/v0_28_0.ts schema-version gate: >= 36 → >= 38 - skills/migrations/v0.28.0.md: v35/v36 → v37/v38 references - CHANGELOG.md v0.28.0 entry: schema migrations row + section header Conflicts resolved: - VERSION → 0.28.0 (kept ours) - package.json → 0.28.0 + master's new ai-sdk deps (@ai-sdk/anthropic, @ai-sdk/google, @ai-sdk/openai, @ai-sdk/openai-compatible, ai, eventsource-parser, zod) - src/cli.ts CLI_ONLY → kept both 'providers' (master v0.27) and 'takes', 'think' (mine) - CHANGELOG.md → kept v0.28.0 entry on top, master's v0.26.8 below Verified post-merge: - bun install: ai-sdk deps installed - bun run typecheck: PASS - bun run verify: PASS (privacy + jsonb + progress + test-isolation + wasm + admin-build + typecheck) - 203 tests pass: 88 migration + 99 v0.28 unit + 16 model-config serial - 80 master-side tests pass (oauth + http-transport)

* refactor(core): extract SSRF helpers from integrations.ts to core/url-safety.ts src/core/git-remote.ts (next commit) needs isInternalUrl etc. but importing from src/commands/ would invert the layering boundary (no existing src/core/ file imports from src/commands/). Extract the SSRF helpers (parseOctet, hostnameToOctets, isPrivateIpv4, isInternalUrl) into a new src/core/url-safety.ts and have integrations.ts re-export for backward compat. test/integrations.test.ts continues to pass without changes (110 existing tests, 214 expects). Why this matters for v0.28: the upcoming sources --url feature reuses this SSRF gate for git-clone URL validation. Codex review caught that re-rolling weaker URL classification would regress on the IPv6/v4-mapped/ metadata/CGNAT bypass forms that integrations.ts already handles. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(core): add git-remote module — SSRF-defensive clone/pull + state probe New src/core/git-remote.ts (~210 lines) for v0.28's remote-source feature: - GIT_SSRF_FLAGS exported const: -c http.followRedirects=false, -c protocol.file.allow=never, -c protocol.ext.allow=never, --no-recurse-submodules. Single source of truth shared by cloneRepo and pullRepo so a future flag added to one path lands on both. Closes the SSRF surfaces codex flagged: DNS rebinding via redirects, .gitmodules as a second-fetch surface, file:// scheme in remotes. - parseRemoteUrl: https-only, rejects embedded credentials and path traversal, delegates internal-target classification to isInternalUrl from url-safety.ts (covers RFC1918, link-local, loopback, IPv6, CGNAT 100.64/10, metadata hostnames, hex/octal/single-int bypass forms). GBRAIN_ALLOW_PRIVATE_REMOTES=1 escape hatch with stderr warning is needed for self-hosted git over Tailscale (CGNAT trips the gate). - cloneRepo: --depth=1 default (full clone via depth: 0); refuses non-empty destDirs; spawns git via execFileSync (no shell injection) with GIT_TERMINAL_PROMPT=0 + askpass=/bin/false to prevent credential prompts. timeoutMs default 600s. - pullRepo: -C path + GIT_SSRF_FLAGS + pull --ff-only, same env confine. - validateRepoState: 6-state decision tree (missing | not-a-dir | no-git | corrupted | url-drift | healthy). Used by performSync's re-clone branch to recover from rmd clone dirs and refuse syncs on url-drift or corruption. test/git-remote.test.ts (304 lines, 32 tests): GIT_SSRF_FLAGS exact shape, all parseRemoteUrl rejection cases including dedicated CGNAT 100.64/10 with/without GBRAIN_ALLOW_PRIVATE_REMOTES (codex T3 case), fake-git harness for argv assertions on cloneRepo/pullRepo, all 6 validateRepoState branches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(core): add scope hierarchy + ALLOWED_SCOPES allowlist New src/core/scope.ts (~120 lines) for v0.28's scoped MCP feature. Hierarchy: - admin implies all (escape hatch) - write implies read - sources_admin and users_admin are siblings (different axes — sources-mgmt vs user-account-mgmt; neither implies the other) Exported: - hasScope(grantedScopes, requiredScope): the canonical scope check. Replaces exact-string-match at three call sites in upcoming commits (serve-http.ts:673, oauth-provider.ts:365 F3 refresh, oauth-provider.ts:498 token issuance). Without this rewrite, an admin-grant token would fail to refresh down to sources_admin (codex finding). - ALLOWED_SCOPES set + ALLOWED_SCOPES_LIST sorted array (deterministic for OAuth metadata wire format and drift-check output). - assertAllowedScopes / InvalidScopeError: registration-time gate so tokens with bogus scope strings (read flying-unicorn) get rejected with RFC 6749 §5.2 invalid_scope at auth.ts:296 + DCR /register + registerClientManual. Today's behavior accepts any string silently. - parseScopeString: space-separated wire format → array. Forward-compat: hasScope ignores unknown granted scopes rather than throwing, so pre-allowlist tokens with weird scope strings continue working without crashes (registration is the gate, runtime is best-effort). test/scope.test.ts (178 lines, 35 tests): hierarchy table including all-implies for admin, sibling non-implication of *_admin scopes, write→read but not the reverse, F3 refresh-token subset semantics under hasScope, ALLOWED_SCOPES_LIST sorted-pinning, allowlist rejection cases, parseScopeString edge cases (undefined/null/empty). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * build(admin): scope-constants mirror + drift CI for src/core/scope.ts The admin React SPA's tsconfig.json scopes include: ['src'] to admin/src/, so it cannot directly import ../../src/core/scope.ts. The plan considered widening the include or generating a single source of truth; both options either couple the SPA to the gbrain monorepo or add a build step. Eng review picked the boring choice: hand-maintained mirror at admin/src/lib/scope-constants.ts plus a CI drift check. Files: - admin/src/lib/scope-constants.ts: hand-maintained ALLOWED_SCOPES_LIST duplicate, sorted alphabetically to match src/core/scope.ts. - scripts/check-admin-scope-drift.sh: extracts the list from each file via awk, normalizes via tr/sort, diffs. Exits 0 on match, 1 on drift (with full breakdown of which scopes diverged), 2 on internal error. Tested both passing and corrupted paths. - package.json: wires check:admin-scope-drift into both `verify` and `check:all` so any update to src/core/scope.ts that forgets the admin-side mirror fails the build. The Agents.tsx scope-checkbox sites (5 hardcoded locations) get updated in a later commit to import from this constants file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(oauth): hasScope hierarchy + ALLOWED_SCOPES allowlist at registration Switch three call sites in oauth-provider.ts from exact-string-match to hasScope() so the v0.28 sources_admin and users_admin scopes — and the admin-implies-all + write-implies-read hierarchy in src/core/scope.ts — work end to end: - F3 refresh-token subset enforcement at line 365: previously rejected admin → sources_admin refresh because exact-match treated them as unrelated scopes. gstack /setup-gbrain Path 4 needs admin tokens to refresh down to least-privilege sources_admin scope; this fix lands that path. - Token issuance intersection at line 498 (client_credentials grant): same hasScope swap so a client whose stored grant is `admin` can mint tokens including any implied scope. - registerClient (DCR /register) and registerClientManual: validate every scope string against ALLOWED_SCOPES via assertAllowedScopes. Pre-fix the system silently accepted `--scopes "read flying-unicorn"` and persisted the bogus string in oauth_clients.scope. Post-fix the caller gets RFC 6749 §5.2 invalid_scope. Existing rows with pre-allowlist scopes keep working (allowlist gates registration only). Tests amended in test/oauth.test.ts: - T1 (eng-review): admin grant CAN refresh down to sources_admin - T1 sibling: write grant CANNOT refresh up to sources_admin - ALLOWED_SCOPES allowlist coverage (manual + DCR paths, all 5 valid) - Scope-annotation contract tests widened to accept the v0.28 union 62 OAuth tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(serve-http): hasScope at /mcp + advertise full ALLOWED_SCOPES Two changes against src/commands/serve-http.ts: - Line 195: scopesSupported on the mcpAuthRouter options switches from the hardcoded ['read','write','admin'] to Array.from(ALLOWED_SCOPES_LIST). Without this, /.well-known/oauth-authorization-server keeps reporting the old triple, so MCP clients (Claude Desktop, ChatGPT, Perplexity) cannot discover the v0.28 sources_admin and users_admin scopes via standard discovery — they would have to be pre-configured out of band. - Line 673: request-time scope check on /mcp swaps authInfo.scopes.includes(requiredScope) for hasScope(...). This was the most-cited codex finding: without it, sources_admin tokens could not even satisfy a `read`-scoped op (sources_admin doesn't include the literal string "read"). hasScope routes through the hierarchy table in src/core/scope.ts so admin implies all and write implies read at the gate too. T2 amendment in test/e2e/serve-http-oauth.test.ts: assert /.well-known/oauth-authorization-server includes all 5 scopes in scopes_supported. Pre-v0.28 the list was hardcoded to ['read','write', 'admin'] and this assertion would have failed. (The test is Postgres-gated; runs under bun run test:e2e with DATABASE_URL set.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(core): sources-ops module — atomic clone + symlink-safe cleanup src/core/sources-ops.ts (~470 lines): pure async functions extracted from src/commands/sources.ts so the CLI handlers and the new MCP ops share one implementation. addSource: D3 atomicity contract from the eng review. 1. Validate id (matches existing SOURCE_ID_RE). 2. Q4 pre-flight SELECT — fail loudly with structured `source_id_taken` before any clone work. Pre-fix the existing CLI used INSERT…ON CONFLICT DO NOTHING which silently no-op'd; with clone-first that would orphan the temp dir. 3. parseRemoteUrl gate (delegates to isInternalUrl from url-safety.ts). 4. Clone into $GBRAIN_HOME/clones/.tmp/<id>-<rand>/ via the new git-remote helpers. 5. INSERT row with local_path=<final clone dir>, config.remote_url=<url>. 6. fs.renameSync(tmp/, final/). Rollback on either-side failure unlinks the temp dir; rename-failed path also DELETEs the just-INSERTed row best-effort. removeSource: clone-cleanup with realpath+lstat confinement matching validateUploadPath() shape at src/core/operations.ts:61. String startsWith is symlink-unsafe and would let $GBRAIN_HOME/clones/<id> → /etc resolve out of the confine. Two defenses layered: - isPathContained (realpath-resolves both sides + parent-with-sep string check) rejects symlinks whose target falls outside the confine. - lstat-then-isSymbolicLink check refuses symlinks whose realpath happens to land back inside the confine (defense in depth). getSourceStatus: returns clone_state via validateRepoState (the 6-state decision tree from git-remote.ts). Lets a remote MCP caller diagnose "healthy | missing | not-a-dir | no-git | url-drift | corrupted" without SSH access to the brain host. listSources additionally exposes remote_url so callers can see which sources are auto-managed. recloneIfMissing: T4 follow-up for `gbrain sources restore` after the clone dir was autopurged — re-clones via the same temp + rename atomicity contract. Idempotent (returns false when clone is already healthy). test/sources-ops.test.ts (~470 lines, 24 tests): pre-flight collision (Q4), happy paths for both --path and --url, all four D3 rollback paths (clone-fail before INSERT, INSERT-fail after clone, rename-fail post-INSERT, atomic temp-dir cleanup), symlink-target-OUTSIDE-clones (realpath confinement), symlink-target-INSIDE-clones (lstat-check), removeSource refuses to delete user-supplied paths, refuses "default" source, getSourceStatus clone_state branches, T4 recloneIfMissing recovery + idempotent + no-op for path-only sources, isPathContained unit tests covering subtree / outside / symlink-escape / fail-closed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(operations): whoami + sources_{add,list,remove,status} MCP ops Five new ops in src/core/operations.ts auto-flow through src/mcp/tool-defs.ts so MCP clients (Claude Desktop, ChatGPT, Perplexity, OpenClaw) get them via standard tools/list discovery — no SDK or transport code changes needed. Operation.scope union widened to add 'sources_admin' and 'users_admin' (the v0.28 hierarchy from src/core/scope.ts). whoami (scope: read): introspect calling identity over MCP. - Returns `{transport: 'oauth', client_id, client_name, scopes, expires_at}` for OAuth clients (clientId starts with gbrain_cl_). - Returns `{transport: 'legacy', token_name, scopes, expires_at: null}` for grandfathered access_tokens. - Returns `{transport: 'local', scopes: []}` when ctx.remote === false. Empty scopes (NOT ['read','write','admin']) is the D2 decision — returning OAuth-shaped scopes for local callers would resurrect the v0.26.9 footgun where code conditionally trusted on `auth.scopes.includes('admin')` instead of `ctx.remote === false`. - Q3 fail-closed: throws unknown_transport when remote=true AND auth is missing OR ctx.remote is the literal `undefined` (cast bypass guard). A future transport that forgets to thread auth doesn't get a free pass. sources_add (sources_admin, mutating): register a source by --path (existing v0.17 behavior) or --url (v0.28 federated remote-clone path). Calls into addSource from sources-ops.ts which owns the temp-dir + rename atomicity. sources_list (read): list registered sources with page counts, federated flag, and remote_url. The remote_url field is new — lets a remote MCP caller see which sources are auto-managed. sources_remove (sources_admin, mutating): cascade-delete a source + symlink-safe clone cleanup. Requires confirm_destructive: true when the source has data. sources_status (read): per-source diagnostic returning clone_state ('healthy' | 'missing' | 'not-a-dir' | 'no-git' | 'url-drift' | 'corrupted' | 'not-applicable') — lets a remote MCP caller diagnose a busted clone without SSH access to the brain host. test/whoami.test.ts (9 tests): pinned transport-detection for all four return shapes including Q3 fail-closed throw under both auth=undefined and remote=undefined cast-bypass paths. test/sources-mcp.test.ts (16 tests): op-metadata pins (scope, mutating, localOnly), functional handler shape against PGLite, hasScope-driven scope-enforcement smoke test simulating the serve-http.ts:673 gate (read-only token rejected for sources_add; sources_admin token allowed; admin token allowed for everything; gstack /setup-gbrain Path 4 token covers all 4 ops), SSRF gate at the op layer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(sync): re-clone fallback when clone is missing/no-git/corrupted src/commands/sync.ts gets a v0.28-aware front-half. When the source has config.remote_url, performSync calls validateRepoState before the existing fast-forward pull path: - 'healthy' → fall through to existing pull (unchanged) - 'missing' → loud stderr "auto-recovery: re-cloning <id>", then 'no-git' recloneIfMissing handles the temp-dir + rename. Sync 'not-a-dir' continues from the freshly-cloned head. - 'corrupted' → throw with structured hint pointing at sources remove + add (no syncing wrong state). - 'url-drift' → throw with hint pointing at the (deferred) sources rebase-clone command. Closes the operator-confidence gap: rm -rf $GBRAIN_HOME/clones/<id>/ no longer breaks future syncs. The next sync sees the missing dir and recovers via the recorded URL. src/core/operations.ts: extend ErrorCode with 'unknown_transport' so whoami's Q3 fail-closed path types check. test/sources-resync-recovery.test.ts (12 tests): full validateRepoState state matrix exercised under fake-git, recloneIfMissing recovery from each degraded state, idempotent on healthy clones, the sync.ts:320 integration path that drives the recovery. test/sources-ops.test.ts + test/sources-mcp.test.ts: drop the GBRAIN_PGLITE_SNAPSHOT-disable line so these tests stop forcing cold init across the parallel-shard runner. With snapshot allowed, init time drops from 6+s to ~50ms and parallel runs stay under the 5s hook timeout. test/sources-mcp.test.ts: tighten scope literal-type so tsc keeps the union narrow. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(cli): sources add --url + restore re-clone, thin-wrapper refactor src/commands/sources.ts now delegates the data-mutation work to src/core/sources-ops.ts (added in the previous commit). The CLI handler parses argv, calls into addSource, and formats output. Two new flags on `gbrain sources add`: - `--url <https-url>` : federated remote-clone path (clone + INSERT + rename, atomic rollback on failure). - `--clone-dir <path>` : override the default $GBRAIN_HOME/clones/<id>/ destination. Validation rejects mutually-exclusive `--url` + `--path`. Errors from the ops layer (SourceOpError) propagate through the CLI's standard error wrapper in src/cli.ts so existing tests that assert throw shape keep passing. `gbrain sources restore <id>` (T4 from eng review): if the source has a remote_url AND the on-disk clone was autopurged, call recloneIfMissing before declaring success. Clone errors print a WARN with recovery hints rather than failing the restore — the DB row is what restore guarantees; the clone is best-effort. 54 sources-related tests pass (existing test/sources.test.ts + sources-ops + sources-mcp). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(doctor,cycle): orphan-clones surface + autopilot purge phase (P1) addSource's atomicity contract uses a temp dir that gets renamed to the final clone path. If the process is SIGKILL'd between clone-finish and rename, the temp dir orphans on disk. Without sweeping these, a brain server accumulates gigabytes over months of failed `sources add --url` attempts. Two layers: 1. `gbrain doctor` now surfaces stale entries. A new orphan_clones check walks $GBRAIN_HOME/clones/.tmp/, names anything older than 24h, and prints a warn with disk-byte estimate. Operators see the leak before `df` complains. 2. The autopilot cycle's existing `purge` phase grows a substep that nukes .tmp/ entries past the same 72h TTL the page-soft-delete purge uses. Operator behavior stays uniform across all soft-delete-style surfaces. Both layers are filesystem-only (no DB). On a brain that never used --url cloning, both are no-ops. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * build(admin): scope checkboxes source from scope-constants mirror + dist admin/src/pages/Agents.tsx Register Client modal: - useState default sources from ALLOWED_SCOPES_LIST (defaulting `read` to true, others false; unchanged UX for the common case). - Scope checkbox map iterates ALLOWED_SCOPES_LIST instead of the old hardcoded ['read','write','admin']. Without this commit, even with the v0.28.1 server-side scope hierarchy, operators registering an OAuth client from the admin UI cannot tick the new sources_admin / users_admin scopes — defeats the whole gstack /setup-gbrain Path 4 unblock. The drift-check CI gate (scripts/check-admin-scope-drift.sh) ensures this list stays in sync with src/core/scope.ts going forward. admin/dist/* rebuilt via `cd admin && bun run build`. Old hash bundle removed; new bundle (224.96 kB / 68.70 kB gzip). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: v0.28.1 — remote-source MCP + scope hierarchy + whoami VERSION + package.json: bump to 0.28.1 (per CLAUDE.md branch-scoped versioning rule — this branch adds substantial new features on top of v0.28.0). CHANGELOG.md: new top-level entry for v0.28.1 in the gstack/Garry voice (no AI vocabulary, no em dashes, real numbers + commands). Lead paragraph names what the user can now do that they couldn't before. "Numbers that matter" table calls out the +5 MCP ops, +2 OAuth scopes, and the 4-to-0 SSH-step number for gstack /setup-gbrain Path 4. "What this means for you" closer ties the work to the operator workflow shift. "To take advantage of v0.28.1" block has paste-ready upgrade commands including the admin SPA rebuild step. Itemized changes section describes the architecture cleanly without exposing scope-string internals to public attack-surface enumeration (per CLAUDE.md responsible-disclosure rule). TODOS.md: file 6 follow-ups under a new "Remote-source MCP follow-ups (v0.28.1)" section: token rotation, migration introspection in get_health, Accept-header friendliness, sources rebase-clone for URL-drift recovery, --filter=blob:none partial-clone option, and the chunker_version PGLite-schema parity codex caught. README.md: short subsection under the existing sources CLI listing that names the new --url flag and what auto-recovery does. Capability framing (no scope-string enumeration). llms.txt + llms-full.txt: regenerated via `bun run build:llms` so the documentation bundle reflects the v0.28.1 entry. The build-llms generator's drift check passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(e2e): sources-remote-mcp — full gstack /setup-gbrain Path 4 round-trip Spins up `gbrain serve --http` against real Postgres with a fake-git binary in PATH (so `git clone` is exercised end-to-end without network), registers two OAuth clients (sources_admin + read-only), mints tokens, calls the new v0.28.1 MCP ops via /mcp, and asserts the gstack /setup-gbrain Path 4 flow works end to end. 12 tests cover the full lifecycle: - whoami over HTTP MCP returns transport=oauth + the right scopes - /.well-known/oauth-authorization-server advertises all 5 scopes - sources_add: clone fires, INSERT lands, row carries config.remote_url - sources_status: clone_state=healthy after add - sources_list: surfaces remote_url for the new source - SSRF rejection: sources_add with RFC1918 URL fails at parseRemoteUrl gate - Scope enforcement: read-only token gets insufficient_scope on sources_add - Read-only token CAN call sources_list (read-scoped op) - ALLOWED_SCOPES allowlist: CLI register-client rejects bogus scope - Recovery: rm clone dir + sources_status reports clone_state=missing - sources_remove: cascades + cleans up the auto-managed clone dir Subprocess env threading replicates the v0.26.2 bun execSync inheritance pattern — bun does NOT inherit process.env mutations, so every CLI subprocess call passes env: { ...process.env } explicitly. Cleanup contract mirrors test/e2e/serve-http-oauth.test.ts: revoke any clients we registered, force-kill the server subprocess on SIGTERM timeout, surface cleanup failures to stderr without throwing so real test failures aren't masked. The base table list in helpers.ts (ALL_TABLES) doesn't include sources or oauth_clients, so this test explicitly truncates them in beforeAll to avoid Q4 pre-flight collisions on re-run. Skipped gracefully when DATABASE_URL is unset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: codex adversarial review — confine remote sources_admin + close SSRF gaps Pre-ship adversarial review (codex exec) caught five issues. Four ship in this commit; the fifth (DNS rebinding) is filed as v0.28.x follow-up. CRITICAL — `sources_admin` tokens over HTTP MCP could plant content at any host path. The MCP op exposed `path` and `clone_dir` to remote callers; the op layer trusted them verbatim, then auto-recovery's rm -rf on degraded state turned that into arbitrary delete primitives. src/core/operations.ts sources_add handler now drops both fields when ctx.remote !== false. Local CLI keeps the override (operator trust). Loud logger.warn when a remote caller tries — visible in the SSE feed without leaking values. HIGH — Steady-state `git pull --ff-only` bypassed GIT_SSRF_FLAGS entirely. The legacy helper at src/commands/sync.ts:192 spawned git without the -c http.followRedirects=false -c protocol.{file,ext}.allow=never --no-recurse-submodules set that cloneRepo applies. Every recurring sync was reopening the redirect/submodule/protocol bypass. Routed the call site at sync.ts:381 through pullRepo from git-remote.ts so initial clone and ongoing pull share one defensive flag set. MEDIUM — listSources ignored its `include_archived` flag. The op advertised the param but the function destructured it as `_opts` and queried every row. Archived sources' ids, local_paths, and remote_urls were leaking to read-scoped MCP callers by default. Filter in SQL (`WHERE archived IS NOT TRUE` unless the flag is set) so archived rows never reach the wire. PARTIAL HIGH — IPv6 ULA fc00::/7 and link-local fe80::/10 were not in the isInternalUrl bypass list. Only ::1/:: and IPv4-mapped IPv6 were blocked. Added regex-based ULA + link-local rejection to url-safety.ts. Test coverage: - test/git-remote.test.ts: 4 new IPv6 cases (ULA fc-prefix + fd-prefix, link-local fe80::, public IPv6 still allowed). - test/sources-mcp.test.ts: 3 new cases pinning the remote/local asymmetry (clone_dir override silently ignored over MCP, path nulled, local CLI keeps the override). - test/sources-mcp.test.ts: 2 new cases for include_archived honored. DNS rebinding (codex finding #3): the current gate is lexical only. A deliberate attacker who controls a hostname's A/AAAA records can still resolve to an internal IP. Closing this requires async DNS resolution + revalidation; filed as v0.28.x follow-up in TODOS.md so the API change surface (parseRemoteUrl becomes async, every caller updates) lands in its own PR. 323 tests pass (9 files); 4071 unit tests pass (full suite). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: rebump v0.28.1 → v0.28.2 (master collision) Caught after PR creation. master is at v0.28.1 already; this branch forked from garrytan/v0.28-release at v0.28.0 and naively bumped to v0.28.1 without checking the master queue. CI version-gate would have rejected at merge time (requires VERSION strictly greater than master's). Root cause: I bumped VERSION mechanically during plan implementation (echo "0.28.1" > VERSION) without consulting the queue-aware allocator at bin/gstack-next-version. /ship Step 12's idempotency check then classified state as ALREADY_BUMPED and the workflow's "queue drift" comparison was the safety net I should have hit — but I skipped it. Files updated: - VERSION + package.json: 0.28.1 → 0.28.2 - CHANGELOG.md: header + "To take advantage of v0.28.2" subsection - README.md: sources --url note version reference - TODOS.md: 7 follow-up entries' version references - llms.txt + llms-full.txt: regenerated PR title rewrite via gstack-pr-title-rewrite.sh handled in a separate gh pr edit call; CI version-gate now passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

….28.6 Master shipped three v0.28.x patch releases without the takes feature while v0.28-release was in flight: - v0.28.1: zombie process accumulation + health endpoint timeout (#637) - v0.28.3: restart-sweep — detect dropped Telegram messages (#675) - v0.28.4: skillify cross-modal eval quality gate (#674) Master's v0.28.0 slot was consumed without the takes layer ever landing, so this release ships the original takes feature as v0.28.6 (skipping v0.28.5 to leave space for any in-flight master patches). The migration orchestrator file (v0_28_0.ts) and migration skill doc (skills/migrations/v0.28.0.md) keep their original version keys — those identify the migration version, not the release version. Conflicts resolved: - VERSION → 0.28.6 (was 0.28.0; master had 0.28.4) - package.json → 0.28.6 (auto-merged ai-sdk deps from master's v0.27) - CHANGELOG.md → renamed top entry "## [0.28.0]" → "## [0.28.6]" with date 2026-05-06; rebuilt the "To take advantage of" block (was truncated by stale === markers from a prior merge); preserved master's v0.28.4/v0.28.3/v0.28.1 entries beneath - src/cli.ts auto-merged (CLI_ONLY has providers + takes/think both) Verified post-merge: - bun run verify: PASS (privacy + jsonb + progress + test-isolation + wasm + admin-build + typecheck) - 133 tests pass: migrate + apply-migrations + takes-engine + takes-fence - migrations v37 (takes) + v38 (access_tokens_permissions) apply cleanly on top of master's v35 (auto-RLS) + v36 (subagent persistence)

…l-bench # Conflicts: # CHANGELOG.md # CLAUDE.md # VERSION # package.json # src/cli.ts

… v0.28.6 While preparing the takes release as v0.28.6, the remote branch landed v0.28.2 (remote-source MCP + scope hierarchy + whoami, PR #690). Pulling that into local while keeping the takes feature on its v0.28.6 slot. Conflicts resolved: - VERSION → 0.28.6 (kept ours; remote was 0.28.2) - package.json → 0.28.6 (kept ours) - CHANGELOG.md → kept "## [0.28.6]" header on top; inserted remote's "## [0.28.2]" entry between v0.28.3 and v0.28.1 in version-descending order. Dropped the duplicate "## [0.28.0]" header from remote since that was the original takes release that I renamed to v0.28.6. - TODOS.md → kept BOTH sides' new TODO entries (cross-modal-eval follow-ups + v0.28.2 follow-ups; non-overlapping content). Verified post-merge: - bun run verify: PASS (privacy + jsonb + progress + test-isolation + wasm + admin-build + admin-scope-drift + typecheck) - 118 tests pass: migrate + apply-migrations + takes-engine - Migration sequence intact: v37 (takes) + v38 (access_tokens_permissions) on top of master's v35 + v36

…rytan/longmemeval-bench # Conflicts: # CHANGELOG.md # VERSION # package.json # src/cli.ts

Master shipped v0.28.5 (PGLite upgrade wedge + embedding dim corruption + bun-link foot-gun fix wave, PR #697). This release stays on v0.28.6. Conflicts resolved: - VERSION → 0.28.6 (kept ours; master had 0.28.5) - package.json version → 0.28.6 - package.json scripts → kept BOTH new check scripts: my check:admin-scope-drift (from v0.28.2 cherry) + master's check:cli-exec (new in v0.28.5). Verify pipeline now runs both; check:all runs both. - CHANGELOG.md → kept "## [0.28.6]" header on top; inserted master's full v0.28.5 entry between v0.28.6 and v0.28.4 in version-descending order. The "## To take advantage of v0.28.5" interleaved conflict was untangled by extracting master's entry from origin/master:CHANGELOG.md rather than trying to weave the two "to take advantage of" blocks back together inline. Verified post-merge: - bun run verify: PASS (privacy + jsonb + progress + test-isolation + wasm + admin-build + admin-scope-drift + cli-exec + typecheck) - 121 tests pass: migrate + apply-migrations + takes-engine - CHANGELOG order intact: 0.28.6 → 0.28.5 → 0.28.4 → 0.28.3 → 0.28.2 → 0.28.1

…rytan/longmemeval-bench

…l-bench # Conflicts: # CHANGELOG.md # VERSION # package.json # src/core/think/sanitize.ts

Full 500-question 4-adapter LongMemEval _s benchmark landed at github.com/garrytan/gbrain-evals#main:ced01f0. gbrain-hybrid 97.60% R@5, +1.0pt over MemPal raw 96.6%. Replacing the now-stale "needs full run" TODO with closure + 4 grounded follow-ups: 1. Timeline-aware retrieval signal for temporal-reasoning questions (P2 — closes the only category we lose to MemPal-raw) 2. Per-question batch consolidation for ~10x cold-cache speedup (P3 — makes daily benchmark CI gate practical) 3. LongMemEval _m split run (P3 — differentiated, not yet published by MemPal) 4. Cheaper-embedding-model recipe (P4 — recall-cost tradeoff curve) Each TODO has the standard What/Why/Pros/Cons/Context/Depends-on shape per the gbrain TODOS-format convention. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…l-bench # Conflicts: # CHANGELOG.md # TODOS.md # VERSION # package.json

CI test/build-llms.test.ts asserts the committed llms.txt/llms-full.txt are byte-for-byte identical to what scripts/build-llms.ts produces. The master merge brought in v0.28.9/v0.28.10/v0.28.11 + multimodal embedding notes that updated CLAUDE.md; the bundle was stale. No content changes. Pure regeneration via `bun run build:llms`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…esult Old entry buried the headline ("LongMemEval lands in the box…") under process detail (hermetic CI test count, 25.9ms p50, schema-table runtime enumeration). The reader cares what gbrain DOES — not how we plumbed the harness. New entry leads with the actual number — 97.60% R@5 on the public LongMemEval _s split, beating MemPalace raw by 1.0pt — followed by the per-category win table that proves gbrain ties or beats MemPal in 5 of 6 question types and shows the +7.1pt assistant-voice lift. Links to the full gbrain-evals report (97.60% headline + full methodology + reproducible runner) so curious readers can dig deeper. Two honest findings published in plain text: vector-only is essentially tied with hybrid at K=5, and query expansion via Haiku is a clean null result on this dataset. Better to publish the null than hide it. Reproduction block updated to match the actual gbrain-evals workflow (clone + bun install + dataset download + bash batch runner). The prior "download / run / hand to evaluate_qa.py" block stayed for the in-tree CLI path. Regenerated llms-full.txt to keep the build-llms regen-drift guard green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

garrytan and others added 29 commits May 1, 2026 13:19

Merge remote-tracking branch 'origin/master' into garrytan/longmemeva…

87917af

…l-bench # Conflicts: # CHANGELOG.md # VERSION # package.json

garrytan changed the title ~~v0.28.1 feat: LongMemEval benchmark harness~~ v0.28.8 feat: LongMemEval benchmark harness May 7, 2026

garrytan added 7 commits May 6, 2026 20:57

Merge remote-tracking branch 'origin/master' into garrytan/longmemeva…

a80d18c

…l-bench # Conflicts: # CHANGELOG.md # CLAUDE.md # VERSION # package.json # src/cli.ts

Merge remote-tracking branch 'origin/garrytan/v0.28-release' into gar…

af1880a

…rytan/longmemeval-bench # Conflicts: # CHANGELOG.md # VERSION # package.json # src/cli.ts

Merge remote-tracking branch 'origin/garrytan/v0.28-release' into gar…

0be6b32

…rytan/longmemeval-bench

Merge remote-tracking branch 'origin/master' into garrytan/longmemeva…

1d84da4

…l-bench # Conflicts: # CHANGELOG.md # VERSION # package.json # src/core/think/sanitize.ts

garrytan changed the base branch from garrytan/v0.28-release to master May 7, 2026 14:23

garrytan and others added 2 commits May 7, 2026 15:58

Merge remote-tracking branch 'origin/master' into garrytan/longmemeva…

8fcb84b

…l-bench # Conflicts: # CHANGELOG.md # TODOS.md # VERSION # package.json

garrytan changed the title ~~v0.28.8 feat: LongMemEval benchmark harness~~ v0.28.12 feat: LongMemEval benchmark harness May 7, 2026

garrytan and others added 2 commits May 7, 2026 16:43

garrytan merged commit bca993e into master May 8, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.28.12 feat: LongMemEval benchmark harness#606

v0.28.12 feat: LongMemEval benchmark harness#606
garrytan merged 40 commits intomasterfrom
garrytan/longmemeval-bench

garrytan commented May 4, 2026 •

edited by blacksmith-sh Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garrytan commented May 4, 2026 • edited by blacksmith-sh Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Coverage

Pre-Landing Review

Design Review

Eval Results

Plan Completion

Scope Drift

TODOS

Documentation

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

garrytan commented May 4, 2026 •

edited by blacksmith-sh Bot

Loading