v0.40.7.1 feat: llama-server reranker — local Qwen3 / self-hosted ZE via llama.cpp#1326
v0.40.7.1 feat: llama-server reranker — local Qwen3 / self-hosted ZE via llama.cpp#1326kohai-ut wants to merge 4 commits into
Conversation
…via llama.cpp Adds local reranker support so users can point gbrain's reranker call at their own llama.cpp server instead of ZeroEntropy's hosted API. One new recipe (`llama-server-reranker`), a `path?: string` + `default_timeout_ms?: number` extension on `RerankerTouchpoint`, env passthrough wiring, budget-tracker `FREE_LOCAL_RERANK_PROVIDERS` set so `--max-cost` callers don't TX2 hard-fail on local rerank, and a doctor-probe divergence fix (probe and live search now read the same `search.reranker.model` path via `loadSearchModeConfig` + `resolveSearchMode`). ZE-hosted users are unchanged. Voyage / Cohere / vLLM rerankers stay out of scope — different wire shapes need adapter hooks designed against their actual shapes in a follow-up plan. Verification: - `bun run verify` (typecheck + 13 pre-checks): clean - `bun run check:all` (15 historical checks): clean - 107/107 expect() calls pass across 5 affected test files - /codex review against the full diff: GATE PASS (caught one [P2] /v1 path doubling bug pre-merge; fixed by changing recipe path to leaf `/rerank`) - Claude adversarial subagent: 7 net-new findings filed as v0.40.7+ TODOs (none currently exploitable; hardening for future contributor traps) Test surface (107 cases, 5 files): - test/ai/rerank.test.ts: path override (exact URL match), default_timeout_ms honored, empty models[] accepts any id, ZE regression - test/ai/recipe-llama-server-reranker.test.ts: recipe shape regression guard + base_url + path concat assertion (codex-caught /v1/v1/ regression) - test/search-mode.test.ts: timeout precedence chain (per-call > config > recipe > bundle), ZE no-recipe-default regression, unknown provider fallthrough - test/models-doctor-reranker.test.ts: divergence-fix helper across DB-plane read, mode default, disabled, override, DB-error graceful fallback - test/core/budget/budget-tracker.test.ts: free-local rerank pricing + arbitrary model id + chat-kind TX2 hard-fail preserved Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…r-reranker) The hand-curated llms-config.ts doc map never included docs/ai-providers/, so both zeroentropy.md (since v0.35.0.0) and the new llama-server-reranker.md were invisible to the AI-facing llms.txt / llms-full.txt index. Adds an "AI providers" section with both. Marked includeInFull: false (setup walkthroughs belong in the index but would push the single-fetch bundle past FULL_SIZE_BUDGET) — same treatment CHANGELOG.md gets. Caught by the /ship document-release subagent. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…eranker # Conflicts: # CHANGELOG.md # TODOS.md # VERSION # package.json # src/core/search/mode.ts
|
Thanks for this contribution — and apologies for the slow triage. We did a full pass over the entire PR backlog. gbrain has moved fast, and the maintainer's larger "cathedral" rewrites have superseded a big share of community PRs: the AI gateway + recipes + user_provided_models system replaced almost all individual provider PRs; #1805 fixed the whole Postgres module-singleton class; #1542 unified the type taxonomy; #1657 the retrieval path; #1802 the doctor; and so on. We're closing this one in that cleanup — either the fix already landed on master, it duplicates another PR or merged change, or it's outside the current merge bar. Where a closed PR carried a genuinely valuable idea, we've recorded it in docs/designs/COMMUNITY_IDEAS.md so nothing good is lost (a few may graduate into TODOs). Please don't read the close as a judgment of the work — thank you for contributing. If you believe the underlying issue is still live on the latest master, reopen with a quick note and we'll take another look. 🙏 |
Summary
Adds local llama.cpp reranker support so users can route gbrain's reranker call at their own llama.cpp server instead of ZeroEntropy's hosted API. ZE-hosted users are unchanged. The change set is one new recipe + a path/timeout extension on
RerankerTouchpoint+ plumbing around the edges so things like cold-start timeouts and--max-costbudget caps work for local rerank.This is for you if you already run llama.cpp for embeddings (the existing
llama-serverrecipe), want to keep retrieval data inside your LAN, or want to use Qwen3-Reranker without paying a per-rerank fee. Self-hosting ZeroEntropy's open-weightzerank-2via llama.cpp uses the same recipe.Setup is in
docs/ai-providers/llama-server-reranker.md. The README + integrations doc surface it on the canonical provider lists.Voyage / Cohere / vLLM rerankers stay out of scope — they have different wire shapes (different field names, different response shape) and need adapter hooks designed against their actual shapes in a follow-up plan.
Key changes
llama-server-rerankeratsrc/core/ai/recipes/llama-server-reranker.ts. Distinct from the existingllama-serverembedding recipe because llama.cpp's--rerankingand--embeddingsflags are mutually exclusive at server-launch time — one process per mode, two recipes for two base URLs (8081 vs 8080).RerankerTouchpoint.path?: string+default_timeout_ms?: numberinsrc/core/ai/types.ts. Optional fields; absent on ZE's recipe so behavior there is unchanged.src/core/search/mode.tsreranker timeout precedence: per-call >search.reranker.timeout_msconfig > recipe touchpoint default > mode-bundle default (5000). Closes the dead-default-timeout bug class where recipe-level timeouts never fired because hybridSearch always passed the bundle's value.src/core/budget/budget-tracker.tsFREE_LOCAL_RERANK_PROVIDERSset zero-pricesllama-server-reranker:*for the rerank kind, so--max-cost-bounded callers no longer TX2 hard-fail when configured for local rerank. Chat kind on the same provider still triggersno_pricing(defense in depth).src/cli.ts buildGatewayConfigmapsLLAMA_SERVER_RERANKER_BASE_URLtoprovider_base_urls.llama-server-reranker. Sibling of the existingLLAMA_SERVER_BASE_URLmapping.src/commands/models.tsreranker probes now resolve the model the same way live search does (vialoadSearchModeConfig+resolveSearchMode). Closes a file-plane / DB-plane divergence where doctor said "not configured" while live search was actively reranking.Test Coverage
107/107 expect() calls pass across 5 affected test files (126ms wall):
Coverage gate: 94% (Step 7 subagent verdict). 3 low-risk gaps documented and deferred (env→config glue with benign failure modes; covered by the recipe-id assertions).
Pre-Landing Review
/codex reviewagainst the diff caught one real [P2] bug that 5 of my own tests AND plan-eng-review missed:path: '/v1/rerank'on the recipe +base_url_default: 'http://localhost:8081/v1'were concatenating to…/v1/v1/rerank. Headline feature would have shipped broken. Fixed in same wave — recipe path is now leaf/rerank, and the looseendsWith('/v1/rerank')test assertion is tightened to exact-URLtoBe('http://localhost:8081/v1/rerank')so the regression can't recur.Cross-model: Claude adversarial subagent (independent fresh-context pass after codex) found 7 net-new findings codex missed. None are currently exploitable; they harden the new surface against future contributor traps. All 7 filed as
v0.40.7+ llama-server-reranker follow-upsin TODOS.md:_BASE_URLenv vars (pre-existing pattern; this PR adds one more env var to the gap; should be its own focused PR using the existingsrc/core/ssrf-validate.tshelpers).FREE_LOCAL_RERANK_PROVIDERSinvariant (theoretical bypass requires a future caller skippinggateway.rerank()'sassertTouchpointcheck)./v1/v1/mistake — gate at configure time instead of relying on test assertions).search.reranker.model; narrow theresolveLiveRerankerModelcatch-all; validatemodelStrshape before allocating probe timeout.Plan Completion
20/23 plan items DONE, 3 CHANGED (test files landed at slightly different paths than the plan specified —
test/search-mode.test.tsvs plan'stest/search/mode.test.tsetc., because that matches gbrain's existing test layout), 0 NOT DONE.Full plan + decision audit:
~/.claude/plans/write-up-a-plan-silly-fairy.md.TODOS
v0.40.7.1 llama-server-reranker follow-ups (v0.40.7+)in TODOS.md (the adversarial findings above).Documentation
docs/ai-providers/llama-server-reranker.md(161 lines) — build llama.cpp, launch with--alias+--reranking, gbrain config commands with correct keys (provider_base_urls.llama-server-reranker,search.reranker.model), verification viagbrain models doctor, cold-start timeout note, budget-cap interaction.README.md— addedRerankersbullet naming hosted ZE default + newllama-server-rerankerrecipe.docs/integrations/embedding-providers.md— expanded the Voyage-onlyReranking pairdecision-tree bullet, addedLocal reranking (no API spend)bullet.CLAUDE.md— new recipe annotation.llms-full.txtregenerated.CHANGELOG.mdv0.40.7.1 entry. (Pushed as separatedocs: post-ship documentation synccommit.)Verification
Done locally (16GB LXC — can't run the full 600-file parallel suite due to PGLite WASM heap accumulation; the CI's beefier hardware will do that):
bun run verify— typecheck + 13 pre-checks: cleanbun run check:all— 15 historical checks (privacy, jsonb, source-id-projection, progress-to-stdout, no-legacy-getconnection, test-isolation across 600 files, trailing-newline across 1009 files, wasm-embedded, exports-count, admin-build, admin-scope-drift, cli-executable, skill-brain-first): cleanbun test test/ai/rerank.test.ts test/ai/recipe-llama-server-reranker.test.ts test/search-mode.test.ts test/models-doctor-reranker.test.ts test/core/budget/budget-tracker.test.ts— 107/107 pass/codex review --base master— GATE PASS, 1 [P2] caught + fixed in-wave/plan-eng-reviewagainst the plan — CLEAR; outside-voice codex caught 10 plan-stage misses, all absorbed before code was writtenReal-world end-to-end (recommended for reviewer)
Test plan
base_url + path === '…/v1/rerank'(not/v1/v1/rerank)🤖 Generated with Claude Code