v0.40.7.1 feat: llama-server reranker — local Qwen3 / self-hosted ZE via llama.cpp by kohai-ut · Pull Request #1326 · garrytan/gbrain

kohai-ut · 2026-05-23T17:51:20Z

Summary

Adds local llama.cpp reranker support so users can route gbrain's reranker call at their own llama.cpp server instead of ZeroEntropy's hosted API. ZE-hosted users are unchanged. The change set is one new recipe + a path/timeout extension on RerankerTouchpoint + plumbing around the edges so things like cold-start timeouts and --max-cost budget caps work for local rerank.

This is for you if you already run llama.cpp for embeddings (the existing llama-server recipe), want to keep retrieval data inside your LAN, or want to use Qwen3-Reranker without paying a per-rerank fee. Self-hosting ZeroEntropy's open-weight zerank-2 via llama.cpp uses the same recipe.

Setup is in docs/ai-providers/llama-server-reranker.md. The README + integrations doc surface it on the canonical provider lists.

Voyage / Cohere / vLLM rerankers stay out of scope — they have different wire shapes (different field names, different response shape) and need adapter hooks designed against their actual shapes in a follow-up plan.

Key changes

New recipe llama-server-reranker at src/core/ai/recipes/llama-server-reranker.ts. Distinct from the existing llama-server embedding recipe because llama.cpp's --reranking and --embeddings flags are mutually exclusive at server-launch time — one process per mode, two recipes for two base URLs (8081 vs 8080).
RerankerTouchpoint.path?: string + default_timeout_ms?: number in src/core/ai/types.ts. Optional fields; absent on ZE's recipe so behavior there is unchanged.
src/core/search/mode.ts reranker timeout precedence: per-call > search.reranker.timeout_ms config > recipe touchpoint default > mode-bundle default (5000). Closes the dead-default-timeout bug class where recipe-level timeouts never fired because hybridSearch always passed the bundle's value.
src/core/budget/budget-tracker.ts FREE_LOCAL_RERANK_PROVIDERS set zero-prices llama-server-reranker:* for the rerank kind, so --max-cost-bounded callers no longer TX2 hard-fail when configured for local rerank. Chat kind on the same provider still triggers no_pricing (defense in depth).
src/cli.ts buildGatewayConfig maps LLAMA_SERVER_RERANKER_BASE_URL to provider_base_urls.llama-server-reranker. Sibling of the existing LLAMA_SERVER_BASE_URL mapping.
src/commands/models.ts reranker probes now resolve the model the same way live search does (via loadSearchModeConfig + resolveSearchMode). Closes a file-plane / DB-plane divergence where doctor said "not configured" while live search was actively reranking.

Test Coverage

107/107 expect() calls pass across 5 affected test files (126ms wall):

USER FLOW                       CODE PATH                       TEST FILE
setup llama-server locally  -> recipe registry lookup        -> recipe-llama-server-reranker
                               base_url + auth_env shape        (8 cases incl. URL concat regression)

gbrain search (rerank call) -> resolveSearchMode              -> search-mode.test
                               timeout precedence chain         (5 cases: per-call > config >
                                                                 recipe > bundle)
                            -> gateway.rerank()               -> rerank.test
                               tp.path concat (/v1/rerank)      (5 new + 21 existing; tightened
                               ZE fallthrough /models/rerank    endsWith → exact-URL assertion
                               empty models[] allowlist         after codex-found /v1/v1/ bug)

--max-cost with local       -> BudgetTracker.reserve          -> budget-tracker.test
reranker                      FREE_LOCAL_RERANK_PROVIDERS       (3 new: zero-pricing rerank kind,
                              zero pricing on rerank kind       arbitrary model id, chat-kind
                                                                still TX2-fails)

gbrain models doctor        -> resolveLiveRerankerModel       -> models-doctor-reranker.test
                              (file/DB plane divergence fix)    (5 cases incl. DB-error fallback)

Coverage gate: 94% (Step 7 subagent verdict). 3 low-risk gaps documented and deferred (env→config glue with benign failure modes; covered by the recipe-id assertions).

Pre-Landing Review

/codex review against the diff caught one real [P2] bug that 5 of my own tests AND plan-eng-review missed: path: '/v1/rerank' on the recipe + base_url_default: 'http://localhost:8081/v1' were concatenating to …/v1/v1/rerank. Headline feature would have shipped broken. Fixed in same wave — recipe path is now leaf /rerank, and the loose endsWith('/v1/rerank') test assertion is tightened to exact-URL toBe('http://localhost:8081/v1/rerank') so the regression can't recur.

Cross-model: Claude adversarial subagent (independent fresh-context pass after codex) found 7 net-new findings codex missed. None are currently exploitable; they harden the new surface against future contributor traps. All 7 filed as v0.40.7+ llama-server-reranker follow-ups in TODOS.md:

P1: SSRF scheme validation sweep for all 6 openai-compat _BASE_URL env vars (pre-existing pattern; this PR adds one more env var to the gap; should be its own focused PR using the existing src/core/ssrf-validate.ts helpers).
P2: Document FREE_LOCAL_RERANK_PROVIDERS invariant (theoretical bypass requires a future caller skipping gateway.rerank()'s assertTouchpoint check).
P2: Recipe path-concat sanity check at gateway-init (catches future contributors making the same /v1/v1/ mistake — gate at configure time instead of relying on test assertions).
P3 x3: debug-log on malformed search.reranker.model; narrow the resolveLiveRerankerModel catch-all; validate modelStr shape before allocating probe timeout.

Plan Completion

20/23 plan items DONE, 3 CHANGED (test files landed at slightly different paths than the plan specified — test/search-mode.test.ts vs plan's test/search/mode.test.ts etc., because that matches gbrain's existing test layout), 0 NOT DONE.

Full plan + decision audit: ~/.claude/plans/write-up-a-plan-silly-fairy.md.

TODOS

Filed: 6 new entries under v0.40.7.1 llama-server-reranker follow-ups (v0.40.7+) in TODOS.md (the adversarial findings above).
No items marked complete from prior TODOs (this is a net-new feature wave).

Documentation

NEW: docs/ai-providers/llama-server-reranker.md (161 lines) — build llama.cpp, launch with --alias + --reranking, gbrain config commands with correct keys (provider_base_urls.llama-server-reranker, search.reranker.model), verification via gbrain models doctor, cold-start timeout note, budget-cap interaction.
Updated: README.md — added Rerankers bullet naming hosted ZE default + new llama-server-reranker recipe. docs/integrations/embedding-providers.md — expanded the Voyage-only Reranking pair decision-tree bullet, added Local reranking (no API spend) bullet. CLAUDE.md — new recipe annotation. llms-full.txt regenerated. CHANGELOG.md v0.40.7.1 entry. (Pushed as separate docs: post-ship documentation sync commit.)

Verification

Done locally (16GB LXC — can't run the full 600-file parallel suite due to PGLite WASM heap accumulation; the CI's beefier hardware will do that):

bun run verify — typecheck + 13 pre-checks: clean
bun run check:all — 15 historical checks (privacy, jsonb, source-id-projection, progress-to-stdout, no-legacy-getconnection, test-isolation across 600 files, trailing-newline across 1009 files, wasm-embedded, exports-count, admin-build, admin-scope-drift, cli-executable, skill-brain-first): clean
bun test test/ai/rerank.test.ts test/ai/recipe-llama-server-reranker.test.ts test/search-mode.test.ts test/models-doctor-reranker.test.ts test/core/budget/budget-tracker.test.ts — 107/107 pass
/codex review --base master — GATE PASS, 1 [P2] caught + fixed in-wave
/plan-eng-review against the plan — CLEAR; outside-voice codex caught 10 plan-stage misses, all absorbed before code was written

Real-world end-to-end (recommended for reviewer)

# Spin up llama.cpp with --reranking + --alias (any open-weight cross-encoder GGUF):
llama-server --model qwen3-reranker-4b-q4_k_m.gguf --alias qwen3-reranker-4b --reranking --port 8081

# Point gbrain at it:
gbrain config set provider_base_urls.llama-server-reranker http://localhost:8081/v1
gbrain config set search.reranker.model llama-server-reranker:qwen3-reranker-4b
gbrain config set search.reranker.enabled true
gbrain models doctor    # confirms reachability + that --reranking mode is on
gbrain search "test query" --json | jq '.[].rerank_score'   # rerank_score lands on every row

Test plan

Type-check + pre-check sweeps clean
107/107 unit tests pass on affected files
Codex pre-landing review (gate PASS)
Codex caught + we fixed in-wave a [P2] ship-broken URL bug
Claude adversarial review (7 findings filed as TODOs)
Recipe shape regression test pins base_url + path === '…/v1/rerank' (not /v1/v1/rerank)
Reviewer manual test against a real llama.cpp host (commands above)
Full suite gate via CI (couldn't run locally on 16GB LXC; LXC OOMs on PGLite WASM accumulation across shards)

🤖 Generated with Claude Code

…via llama.cpp Adds local reranker support so users can point gbrain's reranker call at their own llama.cpp server instead of ZeroEntropy's hosted API. One new recipe (`llama-server-reranker`), a `path?: string` + `default_timeout_ms?: number` extension on `RerankerTouchpoint`, env passthrough wiring, budget-tracker `FREE_LOCAL_RERANK_PROVIDERS` set so `--max-cost` callers don't TX2 hard-fail on local rerank, and a doctor-probe divergence fix (probe and live search now read the same `search.reranker.model` path via `loadSearchModeConfig` + `resolveSearchMode`). ZE-hosted users are unchanged. Voyage / Cohere / vLLM rerankers stay out of scope — different wire shapes need adapter hooks designed against their actual shapes in a follow-up plan. Verification: - `bun run verify` (typecheck + 13 pre-checks): clean - `bun run check:all` (15 historical checks): clean - 107/107 expect() calls pass across 5 affected test files - /codex review against the full diff: GATE PASS (caught one [P2] /v1 path doubling bug pre-merge; fixed by changing recipe path to leaf `/rerank`) - Claude adversarial subagent: 7 net-new findings filed as v0.40.7+ TODOs (none currently exploitable; hardening for future contributor traps) Test surface (107 cases, 5 files): - test/ai/rerank.test.ts: path override (exact URL match), default_timeout_ms honored, empty models[] accepts any id, ZE regression - test/ai/recipe-llama-server-reranker.test.ts: recipe shape regression guard + base_url + path concat assertion (codex-caught /v1/v1/ regression) - test/search-mode.test.ts: timeout precedence chain (per-call > config > recipe > bundle), ZE no-recipe-default regression, unknown provider fallthrough - test/models-doctor-reranker.test.ts: divergence-fix helper across DB-plane read, mode default, disabled, override, DB-error graceful fallback - test/core/budget/budget-tracker.test.ts: free-local rerank pricing + arbitrary model id + chat-kind TX2 hard-fail preserved Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…r-reranker) The hand-curated llms-config.ts doc map never included docs/ai-providers/, so both zeroentropy.md (since v0.35.0.0) and the new llama-server-reranker.md were invisible to the AI-facing llms.txt / llms-full.txt index. Adds an "AI providers" section with both. Marked includeInFull: false (setup walkthroughs belong in the index but would push the single-fetch bundle past FULL_SIZE_BUDGET) — same treatment CHANGELOG.md gets. Caught by the /ship document-release subagent. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…eranker # Conflicts: # CHANGELOG.md # TODOS.md # VERSION # package.json # src/core/search/mode.ts

garrytan · 2026-06-08T03:01:53Z

Thanks for this contribution — and apologies for the slow triage. We did a full pass over the entire PR backlog. gbrain has moved fast, and the maintainer's larger "cathedral" rewrites have superseded a big share of community PRs: the AI gateway + recipes + user_provided_models system replaced almost all individual provider PRs; #1805 fixed the whole Postgres module-singleton class; #1542 unified the type taxonomy; #1657 the retrieval path; #1802 the doctor; and so on.

We're closing this one in that cleanup — either the fix already landed on master, it duplicates another PR or merged change, or it's outside the current merge bar. Where a closed PR carried a genuinely valuable idea, we've recorded it in docs/designs/COMMUNITY_IDEAS.md so nothing good is lost (a few may graduate into TODOs).

Please don't read the close as a judgment of the work — thank you for contributing. If you believe the underlying issue is still live on the latest master, reopen with a quick note and we'll take another look. 🙏

kohai-ut and others added 4 commits May 23, 2026 11:47

docs: post-ship documentation sync

6c77ab3

Merge remote-tracking branch 'origin/master' into feat/llama-server-r…

5b7aba5

…eranker # Conflicts: # CHANGELOG.md # TODOS.md # VERSION # package.json # src/core/search/mode.ts

kohai-ut changed the title ~~v0.40.6.1 feat: llama-server reranker — local Qwen3 / self-hosted ZE via llama.cpp~~ v0.40.7.1 feat: llama-server reranker — local Qwen3 / self-hosted ZE via llama.cpp May 23, 2026

kohai-ut mentioned this pull request May 23, 2026

v0.40.8.0 fix: local embeddings as a first-class provider #1329

Closed

4 tasks

This was referenced May 24, 2026

v0.41.4.0 wave: local providers + cross-platform stdin + gateway-routed dream judge (6 community PRs) #1377

Merged

fix(cli): use fd 0 instead of '/dev/stdin' for cross-platform stdin reads #1325

Closed

garrytan closed this Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.40.7.1 feat: llama-server reranker — local Qwen3 / self-hosted ZE via llama.cpp#1326

v0.40.7.1 feat: llama-server reranker — local Qwen3 / self-hosted ZE via llama.cpp#1326
kohai-ut wants to merge 4 commits into
garrytan:masterfrom
kohai-ut:feat/llama-server-reranker

kohai-ut commented May 23, 2026 •

edited

Loading

Uh oh!

garrytan commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kohai-ut commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes

Test Coverage

Pre-Landing Review

Plan Completion

TODOS

Documentation

Verification

Real-world end-to-end (recommended for reviewer)

Test plan

Uh oh!

garrytan commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kohai-ut commented May 23, 2026 •

edited

Loading