Skip to content

v0.40.7.1 feat: llama-server reranker — local Qwen3 / self-hosted ZE via llama.cpp#1326

Closed
kohai-ut wants to merge 4 commits into
garrytan:masterfrom
kohai-ut:feat/llama-server-reranker
Closed

v0.40.7.1 feat: llama-server reranker — local Qwen3 / self-hosted ZE via llama.cpp#1326
kohai-ut wants to merge 4 commits into
garrytan:masterfrom
kohai-ut:feat/llama-server-reranker

Conversation

@kohai-ut

@kohai-ut kohai-ut commented May 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds local llama.cpp reranker support so users can route gbrain's reranker call at their own llama.cpp server instead of ZeroEntropy's hosted API. ZE-hosted users are unchanged. The change set is one new recipe + a path/timeout extension on RerankerTouchpoint + plumbing around the edges so things like cold-start timeouts and --max-cost budget caps work for local rerank.

This is for you if you already run llama.cpp for embeddings (the existing llama-server recipe), want to keep retrieval data inside your LAN, or want to use Qwen3-Reranker without paying a per-rerank fee. Self-hosting ZeroEntropy's open-weight zerank-2 via llama.cpp uses the same recipe.

Setup is in docs/ai-providers/llama-server-reranker.md. The README + integrations doc surface it on the canonical provider lists.

Voyage / Cohere / vLLM rerankers stay out of scope — they have different wire shapes (different field names, different response shape) and need adapter hooks designed against their actual shapes in a follow-up plan.

Key changes

  • New recipe llama-server-reranker at src/core/ai/recipes/llama-server-reranker.ts. Distinct from the existing llama-server embedding recipe because llama.cpp's --reranking and --embeddings flags are mutually exclusive at server-launch time — one process per mode, two recipes for two base URLs (8081 vs 8080).
  • RerankerTouchpoint.path?: string + default_timeout_ms?: number in src/core/ai/types.ts. Optional fields; absent on ZE's recipe so behavior there is unchanged.
  • src/core/search/mode.ts reranker timeout precedence: per-call > search.reranker.timeout_ms config > recipe touchpoint default > mode-bundle default (5000). Closes the dead-default-timeout bug class where recipe-level timeouts never fired because hybridSearch always passed the bundle's value.
  • src/core/budget/budget-tracker.ts FREE_LOCAL_RERANK_PROVIDERS set zero-prices llama-server-reranker:* for the rerank kind, so --max-cost-bounded callers no longer TX2 hard-fail when configured for local rerank. Chat kind on the same provider still triggers no_pricing (defense in depth).
  • src/cli.ts buildGatewayConfig maps LLAMA_SERVER_RERANKER_BASE_URL to provider_base_urls.llama-server-reranker. Sibling of the existing LLAMA_SERVER_BASE_URL mapping.
  • src/commands/models.ts reranker probes now resolve the model the same way live search does (via loadSearchModeConfig + resolveSearchMode). Closes a file-plane / DB-plane divergence where doctor said "not configured" while live search was actively reranking.

Test Coverage

107/107 expect() calls pass across 5 affected test files (126ms wall):

USER FLOW                       CODE PATH                       TEST FILE
setup llama-server locally  -> recipe registry lookup        -> recipe-llama-server-reranker
                               base_url + auth_env shape        (8 cases incl. URL concat regression)

gbrain search (rerank call) -> resolveSearchMode              -> search-mode.test
                               timeout precedence chain         (5 cases: per-call > config >
                                                                 recipe > bundle)
                            -> gateway.rerank()               -> rerank.test
                               tp.path concat (/v1/rerank)      (5 new + 21 existing; tightened
                               ZE fallthrough /models/rerank    endsWith → exact-URL assertion
                               empty models[] allowlist         after codex-found /v1/v1/ bug)

--max-cost with local       -> BudgetTracker.reserve          -> budget-tracker.test
reranker                      FREE_LOCAL_RERANK_PROVIDERS       (3 new: zero-pricing rerank kind,
                              zero pricing on rerank kind       arbitrary model id, chat-kind
                                                                still TX2-fails)

gbrain models doctor        -> resolveLiveRerankerModel       -> models-doctor-reranker.test
                              (file/DB plane divergence fix)    (5 cases incl. DB-error fallback)

Coverage gate: 94% (Step 7 subagent verdict). 3 low-risk gaps documented and deferred (env→config glue with benign failure modes; covered by the recipe-id assertions).

Pre-Landing Review

/codex review against the diff caught one real [P2] bug that 5 of my own tests AND plan-eng-review missed: path: '/v1/rerank' on the recipe + base_url_default: 'http://localhost:8081/v1' were concatenating to …/v1/v1/rerank. Headline feature would have shipped broken. Fixed in same wave — recipe path is now leaf /rerank, and the loose endsWith('/v1/rerank') test assertion is tightened to exact-URL toBe('http://localhost:8081/v1/rerank') so the regression can't recur.

Cross-model: Claude adversarial subagent (independent fresh-context pass after codex) found 7 net-new findings codex missed. None are currently exploitable; they harden the new surface against future contributor traps. All 7 filed as v0.40.7+ llama-server-reranker follow-ups in TODOS.md:

  • P1: SSRF scheme validation sweep for all 6 openai-compat _BASE_URL env vars (pre-existing pattern; this PR adds one more env var to the gap; should be its own focused PR using the existing src/core/ssrf-validate.ts helpers).
  • P2: Document FREE_LOCAL_RERANK_PROVIDERS invariant (theoretical bypass requires a future caller skipping gateway.rerank()'s assertTouchpoint check).
  • P2: Recipe path-concat sanity check at gateway-init (catches future contributors making the same /v1/v1/ mistake — gate at configure time instead of relying on test assertions).
  • P3 x3: debug-log on malformed search.reranker.model; narrow the resolveLiveRerankerModel catch-all; validate modelStr shape before allocating probe timeout.

Plan Completion

20/23 plan items DONE, 3 CHANGED (test files landed at slightly different paths than the plan specified — test/search-mode.test.ts vs plan's test/search/mode.test.ts etc., because that matches gbrain's existing test layout), 0 NOT DONE.

Full plan + decision audit: ~/.claude/plans/write-up-a-plan-silly-fairy.md.

TODOS

  • Filed: 6 new entries under v0.40.7.1 llama-server-reranker follow-ups (v0.40.7+) in TODOS.md (the adversarial findings above).
  • No items marked complete from prior TODOs (this is a net-new feature wave).

Documentation

  • NEW: docs/ai-providers/llama-server-reranker.md (161 lines) — build llama.cpp, launch with --alias + --reranking, gbrain config commands with correct keys (provider_base_urls.llama-server-reranker, search.reranker.model), verification via gbrain models doctor, cold-start timeout note, budget-cap interaction.
  • Updated: README.md — added Rerankers bullet naming hosted ZE default + new llama-server-reranker recipe. docs/integrations/embedding-providers.md — expanded the Voyage-only Reranking pair decision-tree bullet, added Local reranking (no API spend) bullet. CLAUDE.md — new recipe annotation. llms-full.txt regenerated. CHANGELOG.md v0.40.7.1 entry. (Pushed as separate docs: post-ship documentation sync commit.)

Verification

Done locally (16GB LXC — can't run the full 600-file parallel suite due to PGLite WASM heap accumulation; the CI's beefier hardware will do that):

  • bun run verify — typecheck + 13 pre-checks: clean
  • bun run check:all — 15 historical checks (privacy, jsonb, source-id-projection, progress-to-stdout, no-legacy-getconnection, test-isolation across 600 files, trailing-newline across 1009 files, wasm-embedded, exports-count, admin-build, admin-scope-drift, cli-executable, skill-brain-first): clean
  • bun test test/ai/rerank.test.ts test/ai/recipe-llama-server-reranker.test.ts test/search-mode.test.ts test/models-doctor-reranker.test.ts test/core/budget/budget-tracker.test.ts107/107 pass
  • /codex review --base master — GATE PASS, 1 [P2] caught + fixed in-wave
  • /plan-eng-review against the plan — CLEAR; outside-voice codex caught 10 plan-stage misses, all absorbed before code was written

Real-world end-to-end (recommended for reviewer)

# Spin up llama.cpp with --reranking + --alias (any open-weight cross-encoder GGUF):
llama-server --model qwen3-reranker-4b-q4_k_m.gguf --alias qwen3-reranker-4b --reranking --port 8081

# Point gbrain at it:
gbrain config set provider_base_urls.llama-server-reranker http://localhost:8081/v1
gbrain config set search.reranker.model llama-server-reranker:qwen3-reranker-4b
gbrain config set search.reranker.enabled true
gbrain models doctor    # confirms reachability + that --reranking mode is on
gbrain search "test query" --json | jq '.[].rerank_score'   # rerank_score lands on every row

Test plan

  • Type-check + pre-check sweeps clean
  • 107/107 unit tests pass on affected files
  • Codex pre-landing review (gate PASS)
  • Codex caught + we fixed in-wave a [P2] ship-broken URL bug
  • Claude adversarial review (7 findings filed as TODOs)
  • Recipe shape regression test pins base_url + path === '…/v1/rerank' (not /v1/v1/rerank)
  • Reviewer manual test against a real llama.cpp host (commands above)
  • Full suite gate via CI (couldn't run locally on 16GB LXC; LXC OOMs on PGLite WASM accumulation across shards)

🤖 Generated with Claude Code

kohai-ut and others added 4 commits May 23, 2026 11:47
…via llama.cpp

Adds local reranker support so users can point gbrain's reranker call at their
own llama.cpp server instead of ZeroEntropy's hosted API. One new recipe
(`llama-server-reranker`), a `path?: string` + `default_timeout_ms?: number`
extension on `RerankerTouchpoint`, env passthrough wiring, budget-tracker
`FREE_LOCAL_RERANK_PROVIDERS` set so `--max-cost` callers don't TX2 hard-fail on
local rerank, and a doctor-probe divergence fix (probe and live search now read
the same `search.reranker.model` path via `loadSearchModeConfig` + `resolveSearchMode`).

ZE-hosted users are unchanged. Voyage / Cohere / vLLM rerankers stay out of
scope — different wire shapes need adapter hooks designed against their actual
shapes in a follow-up plan.

Verification:
- `bun run verify` (typecheck + 13 pre-checks): clean
- `bun run check:all` (15 historical checks): clean
- 107/107 expect() calls pass across 5 affected test files
- /codex review against the full diff: GATE PASS (caught one [P2] /v1 path
  doubling bug pre-merge; fixed by changing recipe path to leaf `/rerank`)
- Claude adversarial subagent: 7 net-new findings filed as v0.40.7+ TODOs
  (none currently exploitable; hardening for future contributor traps)

Test surface (107 cases, 5 files):
- test/ai/rerank.test.ts: path override (exact URL match), default_timeout_ms
  honored, empty models[] accepts any id, ZE regression
- test/ai/recipe-llama-server-reranker.test.ts: recipe shape regression guard
  + base_url + path concat assertion (codex-caught /v1/v1/ regression)
- test/search-mode.test.ts: timeout precedence chain (per-call > config >
  recipe > bundle), ZE no-recipe-default regression, unknown provider fallthrough
- test/models-doctor-reranker.test.ts: divergence-fix helper across DB-plane
  read, mode default, disabled, override, DB-error graceful fallback
- test/core/budget/budget-tracker.test.ts: free-local rerank pricing + arbitrary
  model id + chat-kind TX2 hard-fail preserved

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…r-reranker)

The hand-curated llms-config.ts doc map never included docs/ai-providers/, so
both zeroentropy.md (since v0.35.0.0) and the new llama-server-reranker.md were
invisible to the AI-facing llms.txt / llms-full.txt index. Adds an "AI providers"
section with both. Marked includeInFull: false (setup walkthroughs belong in the
index but would push the single-fetch bundle past FULL_SIZE_BUDGET) — same
treatment CHANGELOG.md gets.

Caught by the /ship document-release subagent.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…eranker

# Conflicts:
#	CHANGELOG.md
#	TODOS.md
#	VERSION
#	package.json
#	src/core/search/mode.ts
@kohai-ut kohai-ut changed the title v0.40.6.1 feat: llama-server reranker — local Qwen3 / self-hosted ZE via llama.cpp v0.40.7.1 feat: llama-server reranker — local Qwen3 / self-hosted ZE via llama.cpp May 23, 2026
@garrytan

garrytan commented Jun 8, 2026

Copy link
Copy Markdown
Owner

Thanks for this contribution — and apologies for the slow triage. We did a full pass over the entire PR backlog. gbrain has moved fast, and the maintainer's larger "cathedral" rewrites have superseded a big share of community PRs: the AI gateway + recipes + user_provided_models system replaced almost all individual provider PRs; #1805 fixed the whole Postgres module-singleton class; #1542 unified the type taxonomy; #1657 the retrieval path; #1802 the doctor; and so on.

We're closing this one in that cleanup — either the fix already landed on master, it duplicates another PR or merged change, or it's outside the current merge bar. Where a closed PR carried a genuinely valuable idea, we've recorded it in docs/designs/COMMUNITY_IDEAS.md so nothing good is lost (a few may graduate into TODOs).

Please don't read the close as a judgment of the work — thank you for contributing. If you believe the underlying issue is still live on the latest master, reopen with a quick note and we'll take another look. 🙏

@garrytan garrytan closed this Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants