v0.33.2.0 feat(search-lite): token budget + semantic query cache + intent weighting#897
Merged
garrytan merged 22 commits intoMay 13, 2026
Merged
Conversation
…ting
Adds three additive features to the hybrid search pipeline. All
backward-compatible: existing callers see identical behavior unless they
opt in to the new options.
## 1. Token Budget Enforcement (src/core/search/token-budget.ts)
Cap the cumulative token cost of returned results so search payloads
fit downstream context windows. Greedy top-down walk; preserves caller
ordering; no re-rank. char/4 heuristic for token counting (no
tokenizer dependency \u2014 keeps the bun --compile bundle small).
SearchOpts.tokenBudget \u2014 numeric cap. Default undefined = no-op.
HybridSearchMeta.token_budget = { budget, used, kept, dropped }
HTTP query op: pass `token_budget` param.
## 2. Semantic Query Cache (src/core/search/query-cache.ts + migration v52)
Cache search results keyed by query embedding similarity. HNSW lookup:
`embedding <=> $1 < 0.08` (cosine similarity >= 0.92). Per-source
isolation so multi-source brains don\u2019t bleed. Per-row TTL (default 3600s).
Best-effort writes; all errors swallowed so the cache never breaks the
search hot path.
Migration v52 creates query_cache table with HALFVEC where pgvector >= 0.7;
falls back to VECTOR with the resolved config.embedding_dimensions dim.
New `gbrain cache` CLI: stats / clear --yes / prune.
Config keys: search.cache.enabled / similarity_threshold / ttl_seconds.
HybridSearchMeta.cache = { status, similarity?, age_seconds? }
Routed through new `hybridSearchCached(engine, query, opts)` wrapper;
the operations.ts query op now uses this wrapper so MCP/CLI calls
benefit automatically. Skipped for two-pass walks + non-default
embedding columns where cache semantics don\u2019t hold.
## 3. Zero-LLM Intent Weighting (src/core/search/intent-weights.ts)
Builds on the existing query-intent classifier (4 intents: entity /
temporal / event / general). New weight-adjustment layer applies subtle
per-intent nudges:
entity \u2192 boost keyword RRF + exact slug/title match
temporal \u2192 default recency=on when caller left it unset
event \u2192 boost keyword RRF (rare named entities) + soft recency
general \u2192 no-op (1.0 multipliers everywhere)
All adjustments are SUBTLE (max 1.25x). Caller-explicit options ALWAYS
win \u2014 intent weighting never silently overrides recency / salience.
Default ON; opt out via `opts.intentWeighting = false`. LLM query
expansion (expansion.ts) is still available and opt-in via
`opts.expansion = true` \u2014 it just isn\u2019t the default anymore.
HybridSearchMeta.intent now surfaces classifier output for debugging.
## Tests
test/token-budget.test.ts (10 tests, pure module)
test/intent-weights.test.ts (13 tests, pure module)
test/query-cache.test.ts (12 tests, PGLite)
test/hybrid-search-lite.serial.test.ts (9 tests, PGLite e2e)
Plus 105 pre-existing search tests still pass. `bun run verify` clean.
Co-authored-by: Wintermute <agents@garrytan.com>
Resolved single conflict in src/core/migrate.ts: master claimed v52/53/54 (eval_contradictions_cache + eval_contradictions_runs + cjk_wave). PR garrytan#897's v52 query_cache migration renumbered to v55 to sit after. Typecheck clean.
…ybridSearch Three named modes (conservative / balanced / tokenmax) that bundle the search-lite knobs from PR garrytan#897 into a single config key. Mode resolution lives in bare hybridSearch (NOT just the cached wrapper) so eval-replay and eval-longmemeval — which call bare hybridSearch — test the same mode-affected behavior as production. See [CDX-5+6] in the plan. The mode bundle supplies DEFAULTS for intentWeighting, tokenBudget, expansion, and searchLimit when the caller leaves those undefined. Per-call SearchOpts and per-key config overrides still win (matches the v0.31.12 model-tier resolution chain at model-config.ts:resolveModel). knobsHash() exposes a stable SHA-256 of the resolved knob set; the cache contamination hotfix (next commit) consumes it to prevent a tokenmax write from being served to a conservative read. Three new fields on HybridSearchMeta: - mode (resolved mode name) - existing token_budget meta now fires from bare hybridSearch too Bare hybridSearch now applies tokenBudget at all three return paths (no-embedding-provider, keyword-only-fallback, main). Previously only hybridSearchCached enforced budget; eval commands missed it. Tests: 37 unit cases pin the 3x7 bundle table cell-by-cell, the resolution chain semantics, knobs hash determinism + cross-mode separation, and the config-table parser. All 72 search-lite tests pass. Bisect-friendly: this commit ONLY adds mode resolution. The cache-key contamination hotfix [CDX-4] is a separate atomic commit (next).
PR garrytan#897's query_cache keyed rows on sha256(source_id::query_text) only. A tokenmax search (expansion=on, limit=50) populated a row that a subsequent conservative call (no expansion, limit=10) read back, serving the wrong-shape results. This is a real bug in PR garrytan#897 today, regardless of the v0.32.3 mode picker work — Codex caught it in plan review. Fix: - Migration v56 adds query_cache.knobs_hash TEXT column + composite (source_id, knobs_hash, created_at) index. Existing rows have NULL knobs_hash and are excluded from lookups (silently re-populated with the right hash on first hit — no orphan data, no destructive migration). - cacheRowId(query, source, knobsHash) — knobsHash now part of the PK so a tokenmax write and a conservative write for the same (query, source) land in distinct rows. - SemanticQueryCache.lookup({knobsHash}) filters WHERE knobs_hash = $. - SemanticQueryCache.store({knobsHash}) writes the resolved hash. - hybridSearchCached threads knobsHash from resolveSearchMode through every cache call. Cache config (enabled/threshold/TTL) now reads from the resolved mode bundle, not directly from the config table. Tests (test/query-cache-knobs-hash.test.ts, 11 cases): - cacheRowId bifurcates by knobsHash - Tokenmax write does NOT contaminate conservative lookup - Three modes coexist as distinct rows for same query - Legacy NULL-knobs_hash rows are excluded from lookup - Same-mode write updates in place (no duplicate rows) All 58 cache + mode tests pass. Migration v56 applies cleanly on a fresh PGLite brain. Bisect-friendly: this commit is the cache-key hotfix alone. Mode resolution wiring lives in the previous commit.
…able
Migration v57 creates search_telemetry (date, mode, intent, count,
sum_results, sum_tokens, sum_budget_dropped, cache_hit, cache_miss,
first_seen, last_seen). PK (date, mode, intent) caps growth at ~4380
rows/year. Sums + counts only — averages derive at read time so
concurrent ON CONFLICT writes from multiple gbrain processes accumulate
correctly [CDX-17].
In-memory bucket flushed periodically (60s OR 100 calls) + on process
beforeExit/SIGINT/SIGTERM with a 2-second cap. The search hot path NEVER
waits on this write [D2, CDX-19].
Date-bucketed cache_hit / cache_miss columns make hit rate over --days N
derivable [CDX-18]. query_cache.hit_count is a lifetime counter and
can't be sliced by window.
Wired into bare hybridSearch via emitMeta: every search call sync-bumps
a bucket. flush() drains atomically by swapping the map before SQL writes
so a record() during flush lands in the new map.
readSearchStats(engine, {days}) returns the StatsWindow shape that
gbrain search stats consumes (next commit).
Tests: 16 unit cases pin record/flush/read semantics including
ON-CONFLICT-adds-raw-values, concurrent-flush coalescing, cache hit-rate
math, missing-table graceful degradation, and window clamping.
53 migrations apply on a fresh PGLite brain.
…+8+9]
CDX-8: gbrain config has no unset path today. Required before
`gbrain search modes --reset` can clear search.* overrides.
- BrainEngine.unsetConfig(key) → returns rows deleted (0|1)
- BrainEngine.listConfigKeys(prefix) → exact-literal prefix match
with LIKE-escape on user-supplied % / _ / \ characters
- PGLiteEngine + PostgresEngine implementations
- `gbrain config unset <key>` and `gbrain config unset --pattern <prefix>`
sub-subcommands
CDX-9: readLine has no EOF detection or timeout. Mode-picker plan calls
out "TTY closes mid-prompt → defaults to balanced" but the raw helper
hangs forever. New readLineSafe(prompt, defaultValue, timeoutMs=60s):
- Returns defaultValue on stdin 'end' event
- Returns defaultValue on timeout
- Returns defaultValue on empty Enter
- Non-TTY stdin returns defaultValue immediately (e2e safe)
- Returns trimmed user input otherwise
Exported so install picker (next task) can use it.
Tests: 9 cases pin unset semantics + prefix matcher edge cases
(glob-wildcard escape, sort order, idempotent loop, search.* sweep).
All 53 migrations apply on a fresh PGLite brain.
Install picker (src/commands/init-mode-picker.ts):
- Runs as a phase inside `gbrain init` AFTER engine.initSchema() so DB
config writes work [CDX-7].
- Idempotent: skipped on re-init if search.mode is already set.
- Smart auto-suggestion via recommendModeFor() reads
models.tier.subagent / models.default / OPENAI_API_KEY:
* Opus default/subagent → tokenmax (quality ceiling)
* Haiku subagent → conservative (4K budget keeps cost down)
* No OpenAI key → conservative (no LLM expansion possible)
* Sonnet / unknown → balanced (safe default)
- TTY shows menu via readLineSafe (60s timeout, defaults on EOF/empty).
- Non-TTY auto-selects + emits operator hint:
[gbrain] search mode: X (auto-selected — reason)
[gbrain] To change: gbrain config set search.mode <...>
- --json mode emits structured `{phase: 'search_mode_picker', ...}` event.
- Wired into both initPGLite and initPostgres flows.
Upgrade banner (src/commands/upgrade.ts):
- One-shot stderr banner in runPostUpgrade.
- State persisted via config key `search.mode_upgrade_notice_shown=true`
— fires at most once per install.
- Copy corrected per [CDX-1+2+3]: production query op STILL defaults
expand=true and limit=20. The banner reframes from "behavior is
regressing" to "named modes available + here's how to preserve
exact current shape."
Tests (test/init-mode-picker.test.ts, 16 cases):
- recommendModeFor heuristic for all 4 input shapes
- parseModeInput accepts numeric/named/case-insensitive, rejects garbage
- runModePicker non-TTY auto-selects + writes config
- Idempotent + --force re-prompt + JSON output
- Opus → tokenmax, Haiku → conservative real wiring through engine
Three sub-subcommands mirroring the gbrain models (v0.31.12) shape:
gbrain search modes [--json]
Read-only routing dashboard. Shows the three mode bundles, the active
mode, and the source of every resolved knob:
cache_enabled = true [override: search.cache.enabled]
tokenBudget = 4000 [mode: conservative]
Plus knob descriptions for legibility.
gbrain search modes --reset [--source <mode>]
Clears every search.* override (NOT search.mode itself). Preserves
the upgrade-notice state key. --source <mode> is a dry-run that
lists what --reset would change without writing — the paved path
[CDX-8] flagged as missing.
gbrain search stats [--days N] [--json]
Observability. Reads the search_telemetry rollup over the window
(clamps to [1, 365]). Prints cache hit rate, mode mix, intent mix,
budget drops, avg results/tokens. JSON output includes
_meta.metric_glossary block per [CDX-25].
gbrain search tune [--apply] [--json]
Recommendation engine. 5 rules cover the bug class:
- Insufficient data → "no_recommendations" status
- Conservative + high budget-drop rate → suggest balanced
- High cache hit rate (>85%) → suggest similarity threshold bump
- Tokenmax + Haiku subagent → suggest balanced (cost mismatch)
- Cache disabled but stats show usage → suggest re-enabling
--apply mutates config via setConfig / unsetConfig with a paste-ready
revert command printed at the end.
Registered in src/cli.ts dispatch table. 17 unit cases pin:
- Dashboard report shape + per-knob source attribution
- --reset preserves search.mode + notice key
- --source dry-run never writes
- stats reads telemetry rollup; --days clamps
- tune recommendation rules fire on real telemetry data
- --apply mutates config
- --help + unknown subcommand exit codes
… guard
Single source of truth at src/core/eval/metric-glossary.ts. Every entry
carries 3 fields:
- industry_term (canonical IR/NLP literature name, preserved verbatim)
- eli10 (plain-English a 16-year-old can follow)
- range (numeric range + interpretation)
Covers 4 metric families:
- Retrieval: P@k, R@k, MRR, nDCG@k
- Stability: Jaccard@k, top-1 stability
- Statistical: p-value (paired bootstrap + Bonferroni), 95% CI
- Operational: cache hit rate, avg results/tokens, cost per query, p99 latency
Public surface:
- getMetricGloss(metric) → full entry or null
- eli10For(metric) → plain-English string or null
- buildMetricGlossaryMeta(metrics[]) → {metric → eli10} record for
JSON `_meta.metric_glossary` blocks per [CDX-25]. ONE block per
response, NOT sibling `_gloss` fields on every metric.
- renderMetricGlossaryMarkdown() → deterministic Markdown for the doc
Auto-generation:
scripts/generate-metric-glossary.ts emits docs/eval/METRIC_GLOSSARY.md.
Deterministic (same input → same bytes) so the CI guard can diff.
CI guard:
scripts/check-eval-glossary-fresh.sh regenerates into a temp file and
diffs against the committed doc. Out-of-date doc fails the build.
Wired into `bun run verify` (and therefore `bun run test:full`).
Tests (test/metric-glossary.test.ts, 18 cases):
- Every documented metric is present
- Every entry has all 3 required fields
- Accessors return null on unknown metrics (no throw)
- buildMetricGlossaryMeta silently drops unknown metrics
- renderer output is deterministic across calls
- Renderer groups metrics into 4 sections
docs/eval/METRIC_GLOSSARY.md: 5491 bytes, 124 lines, fresh.
src/core/eval/drift-watch.ts — curated retrieval watch-list [CDX-6].
Five patterns covering the surface that actually affects retrieval quality:
- src/core/search/ (search pipeline)
- src/core/embedding.ts (embedding shape)
- src/core/chunkers/ (chunk granularity)
- src/core/ai/recipes/anthropic.ts + openai.ts (expansion + embed routing)
- src/core/operations.ts (the query op definition)
Adding to the list is a deliberate act — requires a CHANGELOG line so
coverage grows on purpose, not by accident. Pure functions:
- matchesWatchPattern(path) — trailing-slash = prefix, bare = equality
- filesDriftedSince(repoRoot, sha?) — git diff --name-only wrapper
- watchedFilesDrifted(repoRoot, sha?) — composite
src/commands/doctor.ts — two new checks.
checkSearchMode [CDX-20]: status stays 'ok' (never warns, never docks
health score). Hint in message field. Three branches:
- unset → "search.mode is unset (using balanced fallback). Run
`gbrain search modes` to see what is running and pick a mode."
- mode + no overrides → "Mode: X (no per-key overrides — mode bundle
is canonical)."
- mode + overrides → "Mode: X with N per-key override(s) (k1, k2, …).
To consolidate to the pure mode bundle: gbrain search modes --reset"
Upgrade-notice state key (search.mode_upgrade_notice_shown) is excluded
from the override roster — it's not a knob.
checkEvalDrift [CDX-6]: surfaces uncommitted changes to retrieval-watched
files. Always 'ok'; operator-facing reminder. Names up to 3 drifted files
in the message + paste-ready re-eval command.
Both helpers exported (was: file-private) so tests can pin behavior
without walking the full runDoctor pipeline.
Tests: 12 drift-watch cases + 7 doctor-check cases. Pin watch-list shape,
prefix-vs-equality matcher semantics, missing-repo graceful failure, and
all three search_mode branches.
Per-mode --mode flag plumbed into:
- gbrain eval longmemeval --mode <conservative|balanced|tokenmax>
Sets search.mode in the benchmark brain's config table; config is
in PRESERVE_TABLES so resetTables doesn't wipe it between questions.
Mode surfaces in the per-question NDJSON row.
- gbrain eval replay --mode <m> + --compare-limit N
--compare-limit forces a constant K across modes [CDX-13]; without
it, Jaccard@k against the captured baseline measures K-drift, not
quality. Mode is set once before the replay loop.
- NOT cross-modal per [CDX-11]: cross-modal scores OUTPUT against
TASK; it doesn't retrieve. Adding --mode there is theater.
New: gbrain eval run-all orchestrator (src/commands/eval-run-all.ts):
- Sweeps every requested mode × suite combination
- Sequential default per D9; --parallel N opt-in (clamped to mode count)
- Cost guard with split caps [CDX-15+16]:
--budget-usd-retrieval N (default $5)
--budget-usd-answer N (default $20)
Non-TTY refuses with exit 2 unless --yes AND explicit --budget-usd-*
flags pass. TTY refuses without --yes (defense against agent loops).
- estimateRunCost computes per-(suite,mode) breakdown including the
expansion-Haiku surcharge for tokenmax.
- Audit trail: appends to <repo>/.gbrain-evals/eval-results.jsonl
[CDX-23]. Personal brain (~/.gbrain) NEVER touched.
- v0.32.3 ships orchestrator + argv + guard + persist hook.
In-process per-suite invocation is a v0.32.4 follow-up (operator
runs the per-suite CLIs with the documented --mode flag for now;
each completion calls persistRunRecord to log).
New: gbrain eval compare report (src/commands/eval-compare.ts):
- Reads eval-results.jsonl, groups by (suite, mode), renders MD or JSON
- Most-recent (suite, mode, commit) wins when duplicates exist
- JSON output has schema_version=2 + _meta.metric_glossary block per
[CDX-25] (ONE block per response, not sibling _gloss fields)
- _meta.methodology field names the paired-bootstrap + Bonferroni
discipline per [CDX-14] so haters can reproduce
- Missing file → friendly hint pointing at `gbrain eval run-all`
Wired into eval dispatch table in src/commands/eval.ts.
Metric glossary fuzzy fallback: `recall@10` → `recall@k` lookup
(the glossary documents the family; report rows carry specific K
values). Routes through getMetricGloss for every call site.
Tests (42 cases total — all green):
- eval-run-all.test.ts (19): argv parser, cost estimate, guard
semantics for all 4 (over/under × tty/non-tty) shapes, persist hook
NDJSON shape.
- eval-compare.test.ts (5): JSON + MD output shapes, glossary
integration, missing-file graceful, mode filter, most-recent-wins.
- metric-glossary.test.ts (18): unchanged but updated assertions to
cover the fuzzy `@N` → `@k` fallback.
Pre-existing eval-replay / eval-longmemeval / eval-export / eval-prune
tests (42 cases) still pass — --mode + --compare-limit are additive.
docs/eval/SEARCH_MODE_METHODOLOGY.md — haters-immune 8-section template. Documents what the eval measures + does NOT measure, datasets + sizes (LongMemEval n=500, Replay n=200, BrainBench n=1240 docs / 350 qrels), random seed 42, run procedure verbatim, threats to validity (LongMemEval English+technical skew, char/4 heuristic ~5-10% off, expansion ~97.6% relative lift on this corpus), per-question raw outputs, pre-registered expectations (tokenmax wins R@10 by 5-15pp, conservative wins cost by 5-15x, balanced lands within 3pp), re-run cadence anchored to the src/core/eval/drift-watch.ts watch-list. Statistical-significance section pins paired bootstrap with 10,000 resamples + Bonferroni correction across 3 modes × 4 metrics [CDX-14]. CLAUDE.md gets two new sections: ## Search Mode (3-mode table + resolution chain + [CDX-4] cache contamination fix note + CLI commands) and ## Eval discipline (single-source-of-truth glossary, methodology doc, eval_results in repo NOT personal brain per [CDX-23]). README.md Quick Start gets a paragraph naming the install picker, mode heuristic, and the methodology link. skills/conventions/search-modes.md NEW — convention file consumed by brain-ops + query + signal-detector skills via the existing `> **Convention:**` callout pattern. Routes "what mode" / "tune retrieval" / "compare modes" queries to the right CLI surface. skills/RESOLVER.md gets two new trigger rows pointing at gbrain search * and gbrain eval compare.
bun run build:llms — picks up the new CLAUDE.md sections (Search Mode + Eval discipline) and the docs/eval/SEARCH_MODE_METHODOLOGY.md addition. build-llms.test.ts gate now passes.
… flow The v0.32.3 search_mode + eval_drift helpers were inserted into the DB-checks sub-helper at runDbChecks (line 345-355), but runDoctor itself maintains its own check list and only calls the helpers' subset. Push the two checks into the main runDoctor path (after the existing sync_freshness check at line 2347) so they actually appear in `gbrain doctor --json` output. Both checks gated on engine !== null. Progress reporter heartbeat fires for each. Both still return status 'ok' per [CDX-20] so health score is preserved. Verified end-to-end on a real Postgres brain: gbrain doctor --json now includes 'search_mode' and 'eval_drift' in the checks array.
Two root causes for the hang, both fixed.
1. DATABASE_URL leak in claw-test scripted harness
The harness inherits the parent process's env via `...process.env`
for every phase child (init / import / query / extract / doctor).
When the e2e runner sets DATABASE_URL (for OTHER e2e tests), it
leaks into claw-test's children. `loadConfig` at src/core/config.ts:143
then flips inferredEngine to 'postgres' for every subsequent phase,
breaking the hermetic-PGLite-tempdir contract: phases race against
each other on a shared test Postgres while pointing at different
brain states.
Fix: strip DATABASE_URL + GBRAIN_DATABASE_URL from the child env
before forwarding. Re-apply GBRAIN_HOME / GBRAIN_FRICTION_RUN_ID
after the merge so a parent's override can't win. The harness is
PGLite-only by design.
2. Telemetry beforeExit deadlock
v0.32.3's recordSearchTelemetry installed a `process.on('beforeExit',
drainOnExit)` hook that wrapped the flush in `Promise.race([flush(),
setTimeout(2000)])`. beforeExit fires when the event loop empties,
but the hook enqueued NEW async work (the race's setTimeout +
pending flush), so the event loop never re-emptied. Short-lived
CLI invocations (`gbrain query "the"` finishing in ~100ms) ended
up waiting on the DB write indefinitely.
The claw-test harness spawns several short-lived gbrain queries.
Each one hung after its real work finished. The harness then waited
forever on its child subprocess's exit code.
Fix: drop the beforeExit + SIGINT + SIGTERM hooks. Per [CDX-19]'s
"stats are directional, not exact" contract, losing one unflushed
bucket on process exit is acceptable. The unref'd setInterval
handles long-running processes (HTTP MCP, autopilot, jobs work).
Short-lived CLI invocations exit immediately.
Verified:
- `gbrain query "the"` on a fresh PGLite brain exits in <1s (was
hanging forever).
- `bun test test/e2e/claw-test.test.ts` → 3 pass / 0 fail / 3.86s
(was hanging at the banner indefinitely).
- 85/85 e2e files / 574/574 tests pass including claw-test, with
DATABASE_URL set (the configuration that originally repro'd the
hang).
- 6235/6235 unit tests pass.
- Typecheck clean.
The two bugs interacted: the DATABASE_URL leak meant queries hit the
real Postgres (slow), making the beforeExit deadlock visible. Fixing
either alone would have masked the other. Both fixed in this commit.
…docs The install picker already asks explicitly (1/2/3 menu, default to the recommendation on Enter). What was missing: a way to reason about the cost tradeoff. Without numbers, "tokenmax" looks free and "conservative" sounds restrictive; with numbers, the operator picks intentionally. Cost anchors added everywhere the user encounters the mode choice: - Install picker MENU_TEXT (gbrain init) - Upgrade banner (gbrain upgrade post-upgrade) - CLAUDE.md ## Search Mode section - README.md Quick Start - docs/eval/SEARCH_MODE_METHODOLOGY.md (with the math) Anchors at Sonnet 4.6 downstream ($3/M input): conservative ~$0.012/query ~$12/mo @ 1K ~$1,200/mo @ 100K balanced ~$0.030/query ~$30/mo @ 1K ~$3,000/mo @ 100K tokenmax ~$0.060/query ~$60/mo @ 1K ~$6,000/mo @ 100K Plus tokenmax's Haiku expansion overhead: ~$1.50 per 1K queries on top. Cache hits roughly halve these on a brain with repeat-query traffic. The math is documented in SEARCH_MODE_METHODOLOGY.md so a reviewer can audit each variable (T = ~400 tokens/chunk from the recursive chunker's 300-word target; N = `searchLimit` cap; R = downstream model rate from src/core/anthropic-pricing.ts). Drift away from these numbers requires updating CLAUDE.md + the picker + the methodology doc in lockstep — a regression test pins the picker's anchor strings to enforce this. The framing also names the cost rule honestly: the dominant cost isn't gbrain (semantic cache is free; Haiku expansion is rounding-error). It's the downstream agent reading retrieved chunks back into its context. Operators who don't realize this pick badly. Tests: 5 new regression cases in init-mode-picker.test.ts pin every cost string in MENU_TEXT. Total 21/21 picker tests pass; 6240/6240 unit tests pass; verify gate green.
The per-query cost framing in the picker (~$0.012/$0.030/$0.060) is honest but theoretical — it treats each search as an isolated billable event. Real agent loops amortize a lot of context across turns via Anthropic prompt caching, so the per-query 5x ratio doesn't translate 1:1 into total agent spend. Added a "Realistic-scale anchor" section to SEARCH_MODE_METHODOLOGY.md representing one heavy power-user agent loop running tokenmax: - ~860 turns/mo (~29/day, one active agent) - ~900K tokens/turn (system + tools + history + reasoning + search) - ~$0.85/turn → ~$700/mo total agent spend at tokenmax - ~88% Anthropic prompt-cache hit rate Scaling balanced + conservative DOWN from that anchor: - tokenmax → ~$700/mo, search ~22% of total spend - balanced → ~$620/mo, search ~12% (saves ~$78/mo vs tokenmax) - conservative → ~$575/mo, search ~5% (saves ~$124/mo vs tokenmax) Honest takeaway: at realistic agent-loop scale WITH disciplined prompt caching, mode choice saves 10-20% of total agent spend, not 5x. The per-query math kicks back in for setups WITHOUT cache discipline (churn the prompt prefix every turn → search payload becomes a larger fraction). Both framings live in the doc. CLAUDE.md ## Search Mode gets a forward-pointer paragraph naming the "per-query math vs real-world spend" delta so agents reading the section find the methodology footnote. Numbers in the doc are anonymized + scaled away from any specific deployment. No model names, no specific dollar figures from a real production setup — just the per-turn / cache-hit-rate / search-count shape ratios that a thoughtful operator can validate against their own billing dashboard.
Previous version showed mode costs assuming Sonnet-only downstream.
That muted the spread to 5x and made mode choice look minor. Reality:
the downstream model tier is the BIGGER cost lever — pairing mode with
model is where the 25x spread lives.
New 3×3 matrix in the install picker, CLAUDE.md, methodology doc, README:
Haiku 4.5 Sonnet 4.6 Opus 4.7
($1/M input) ($3/M input) ($5/M input)
conservative $400/mo $1,200/mo $2,000/mo
balanced $1,000/mo $3,000/mo $5,000/mo
tokenmax $2,000/mo $6,000/mo $10,000/mo
(per-query cost @ 100K queries/mo, full search payload, no cache savings)
The methodology doc gets a new "Mode × Model matrix" section above the
realistic-scale anchor with concrete right-sizing guidance:
- tokenmax + Haiku: wrong direction. Haiku can't filter 50 chunks → noise
not signal. Pay Haiku rates, get sub-Haiku quality.
- conservative + Opus: wasted Opus. 200K context window starved on
retrieval depth. Pay Opus rates, get conservative-shape retrieval.
- Natural pairings span ~4x; the matrix corners span 25x. The natural
diagonal is where most users should land.
Realistic-scale anchor refreshed:
- tokenmax + Opus: ~$700/mo at 860 turns
- balanced + Sonnet: ~$430/mo
- conservative + Haiku: ~$170/mo
Plus a "mismatched pairings" section showing the math for tokenmax+Haiku
and conservative+Opus — both burn budget for no improvement.
Regression test updated: pins the 25x framing + the four anchor cells
(two corners + two diagonal mids) + the three downstream model rates.
22/22 picker tests pass. 6241/6241 unit tests pass. CI guards green.
… single user)
Most users running gbrain are single-user installs at ~10K queries/month,
not the 100K fleet-scale used in the original matrix. The picker numbers
($400 to $10,000/mo) looked alien to the actual audience. Rescaled to
10K with an explicit linear-scaling callout.
New matrix in picker, CLAUDE.md, README, methodology doc:
Haiku 4.5 Sonnet 4.6 Opus 4.7
($1/M) ($3/M) ($5/M)
conservative $40/mo $120/mo $200/mo
balanced $100/mo $300/mo $500/mo
tokenmax $200/mo $600/mo $1,000/mo
Still 25x corner-to-corner. Still 4x natural-diagonal spread. But now in
numbers a single user picks up and reasons about: "balanced + Sonnet at
$300/mo, that's fine" or "tokenmax + Opus at $1,000/mo, that's a
deliberate choice for max-quality high-stakes work."
Every surface updated:
- Install picker MENU_TEXT (with "scales linearly — multiply by 10
for 100K/mo" footnote so heavier users still see their number)
- CLAUDE.md ## Search Mode table + scaling prose
- README Quick Start
- methodology doc Mode × Model matrix section
- upgrade banner (post-upgrade notice)
Regression test updated: pins the 3 new anchor cells ($40, $300, $1,000)
+ the 10K/mo volume frame + the linear-scaling callout. 23/23 picker
tests pass, 6241/6241 unit tests pass, verify gate green.
Methodology doc's existing 1K/10K/100K Monthly cost breakdown tables
left intact (they already show the linear scaling explicitly).
…NT] directive
DX gap: an agent installing gbrain (OpenClaw, Hermes, Codex, Cursor) ran
gbrain init non-TTY, saw 2 stderr lines flash by, and silently auto-applied
a default search mode. The operator never saw the cost matrix or the choice.
At 25x corner-to-corner cost spread, that's surprise-spend territory.
Five surfaces fixed:
1. **Auto-suggest default flipped balanced → tokenmax.** The Sonnet/unknown
fallback now recommends tokenmax (preserves v0.31.x retrieval shape:
expand=on, generous result set). Haiku subagent → conservative still
wins (cost-sensitive signal). No-OpenAI-key → conservative still wins
(vector search not possible). Heuristic reordered: Haiku check now
fires BEFORE the Opus check, because a Haiku subagent loop signalling
cost sensitivity should win over a default-model heuristic.
2. **gbrain init non-TTY output rebuilt.** Previously: 2 stderr lines.
Now: the full 3×3 cost matrix + an explicit [AGENT] directive block
telling the agent to relay the matrix to its operator before
continuing. Includes a pointer to INSTALL_FOR_AGENTS.md Step 3.5 for
the full protocol.
3. **gbrain upgrade banner same treatment.** Existing v0.32.3 banner now
includes [AGENT] directive at the top so upgrading agents relay the
matrix to their operator instead of silently accepting v0.31.x →
v0.32.x default-applied behavior.
4. **INSTALL_FOR_AGENTS.md Step 3.5 NEW** with the matrix verbatim, the
exact paraphrasable ask-the-user wording, and the gbrain config set
commands to run after the operator picks. Plus a paragraph in the
Upgrade section pointing back at Step 3.5.
5. **AGENTS.md install checklist** gets a new Step 4 ("STOP — ask the
user about search mode") between init and the rest of the flow. The
agent's job description now explicitly says: silent acceptance is
the wrong default.
Tests (24/24 pass):
- Updated recommendModeFor heuristic order (Haiku floor > Opus default)
- New regression test: non-TTY output contains the matrix corners +
[AGENT] directive + INSTALL_FOR_AGENTS.md pointer
- withEnv() helper used for OPENAI_API_KEY mutation (test-isolation lint)
- Default-recommendation tests updated: Sonnet / unknown → tokenmax
Privacy + test-isolation gates clean. 6256/6256 unit tests pass.
brandonlipman
added a commit
to brandonlipman/gbrain
that referenced
this pull request
May 29, 2026
* upstream/master: v0.35.1.0: embedder shootout prereqs (pricing + gateway export + --resume-from) (garrytan#1055) v0.35.0.0 feat: ZeroEntropy zembed-1 + zerank-2 reranker (garrytan#1008) v0.34.4.0 fix(embed): cursor-paginated --stale hardening wave (D2/D3/D4/D6/D7/D8 + regression test) (garrytan#991) v0.34.3.0 fix: supervisor treats code=0 watchdog exits as crashes (garrytan#1003) v0.34.2.0 fix(import): path-based checkpoint resume — kills parallel-drop + failed-file-skip + sort-flip bugs (garrytan#988) v0.34.1.0 fix(mcp): MCP fix wave — source-isolation P0 + PKCE DCR + federated_read + 3 more (garrytan#996) v0.34.0.0 feat: Cathedral III — recursive code intelligence + Leiden clusters + eval gate (garrytan#994) v0.33.3.0 feat(v0.33.3): code intelligence MCP foundation (v0.34 W0a-c + W3) (garrytan#934) v0.33.2.1 docs: fork-PR workflow for garrytan-agents (garrytan#992) fix(sync): raise maxBuffer to 100 MiB to prevent silent ENOBUFS crash (garrytan#982) v0.33.2.0 feat(search-lite): token budget + semantic query cache + intent weighting (garrytan#897) v0.33.1.1 fix: Voyage output_dimension + flexible-dim guard + OOM-cap rethrow (garrytan#962)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Three additive search features inspired by brain-kit. All backward-compatible: existing callers see identical behavior unless they opt in to the new options.
1. Token Budget Enforcement (
src/core/search/token-budget.ts)Cap the cumulative token cost of returned results so search payloads fit downstream context windows. Greedy top-down walk; preserves caller ordering; no re-rank. char/4 heuristic for token counting (no tokenizer dependency — keeps the bun
--compilebundle small).SearchOpts.tokenBudget— numeric cap. Defaultundefined= no-op.HybridSearchMeta.token_budget = { budget, used, kept, dropped }token_budgetparam.2. Semantic Query Cache (
src/core/search/query-cache.ts+ migration v52)Cache search results keyed by query embedding similarity. HNSW lookup:
embedding <=> $1 < 0.08(cosine similarity ≥ 0.92). Per-source isolation. Per-row TTL (default 3600s). Best-effort writes; all errors swallowed so the cache never breaks the search hot path.query_cachetable. HALFVEC where pgvector ≥ 0.7; VECTOR otherwise. Embedding dim resolved fromconfig.embedding_dimensionsso non-OpenAI brains work.gbrain cacheCLI:stats/clear --yes/prune.search.cache.enabled/search.cache.similarity_threshold/search.cache.ttl_seconds.HybridSearchMeta.cache = { status, similarity?, age_seconds? }hybridSearchCached(engine, query, opts)wrapper. The query op inoperations.tsnow uses this wrapper so MCP/CLI calls benefit automatically. Skipped for two-pass walks + non-default embedding columns where cache semantics don't hold.3. Zero-LLM Intent Weighting (
src/core/search/intent-weights.ts)Builds on the existing 4-intent classifier (
src/core/search/query-intent.ts). New weight-adjustment layer applies subtle per-intent nudges:recency='on'when caller left it unsetAll adjustments are subtle (max 1.25×). Caller-explicit options always win — intent weighting never silently overrides explicit
recency/salience. Default ON; opt out viaopts.intentWeighting = false. The LLM query expansion path is still available and opt-in viaopts.expansion = true— it just isn't the default anymore.HybridSearchMeta.intentnow surfaces classifier output for debugging.Files changed (13)
Test results
test/token-budget.test.ts— 10 tests (pure module)test/intent-weights.test.ts— 13 tests (pure module)test/query-cache.test.ts— 12 tests (PGLite, including migration v52 schema verification + TTL expiration + source isolation)test/hybrid-search-lite.serial.test.ts— 9 tests (PGLite e2e throughhybridSearchCached)test/search.test.ts,test/query-intent*.test.ts,test/hybrid-meta.test.ts,test/search-limit.test.ts,test/search-lang-symbol-kind.test.ts,test/search-image-column.test.ts)bun run verifyclean (privacy + jsonb + progress + test-isolation + wasm + admin-build + admin-scope-drift + cli-exec + system-of-record + tsc --noEmit)bun run src/cli.ts doctor --fastruns successfully — pre-existingresolver_health/minions_migrationfailures on master are unrelated to this PR.Design notes
bun --compilebundle). Migration uses the same HALFVEC/VECTOR pgvector-version probe pattern as v45 (facts) and v40, so it works on both Postgres and PGLite without extra config.query_cachetable is a derived cache, not authoritative state.gbrain cache clearis the documented invalidation surface; the system-of-record check passes.Need help on this PR? Tag
@codesmithwith what you need.