TL;DR
gbrain's default retrieval (uncapped hybrid → ~20 results, recall ~0.99, precision 0.076–0.14 on PrecisionMemBench) is the worst-scoring default in its peer set on the metric that matters: active retrieval (did the right thing return, and only the right thing). The benchmark's own words: "Returning everything is not retrieval."
--adaptive-return (v0.41.33) tightens the set, but tight only helps if #1 is right — and on noisy/migrated cases it makes a wrong answer more confident by hiding the alternatives. We have three modes (hybrid / adaptive-tight / think) and none wins, because all three tune the same flat-ranked-list contract.
This issue proposes the floor/ceiling redesign, grounded in how the 11 systems in gbrain-evals/docs/comparison-systems.md actually solve it.
The design question (from review)
How do we make the out-of-box default the smartest possible (lowest bar, simplest, best for everyone) AND give a high ceiling (agent pulls every lever for the best possible outcome on a hard task)?
The answer from the literature: the default does the cheap thing; a confidence signal — not the user — decides when to climb. The ceiling is automatic by default, manual when forced.
What each peer system does
| System |
Default |
Confidence signal |
Auto-escalation |
Precision stance |
| tenure (1.00 precision) |
tiered belief-state lookup (relevant/pinned/open-questions) |
structural — no threshold needed |
none needed |
precision is architecture, not tuning |
| MemPalace |
raw semantic (96.6% R@5, no LLM) |
distance |
opt-in LLM-reader rerank → ≥99%, model-agnostic |
recall-headline, honest about the split |
| Mastra OM |
no per-turn retrieval — curated observation log replaces history |
emoji priority tags 🔴🟡🟢 |
token-budget triggers |
sidesteps tradeoff — don't retrieve noise |
| mem0 |
top-k, permissive |
threshold param (uncalibrated → 0.06 precision) |
tiered cutoffs (0.8/0.6/0.4 by stakes) |
precision = developer's knob |
| Zep |
hybrid + RRF |
fact-extraction confidence |
swappable rerankers (RRF/MMR/cross-encoder/graph) |
structural supersession, broad default |
| LlamaIndex |
top-k (no cutoff) |
per-node score |
SimilarityPostprocessor, RouterQueryEngine, auto-retriever |
permissive default + composable knobs |
| Elasticsearch |
top-size |
raw score (not portable w/o normalization) |
min_score after normalization; LTR |
measure, don't guess the cutoff |
| Weaviate |
top-10 |
distance |
autocut (cut at score discontinuity) + rerank |
adaptive sizing, no magic number |
| Self-RAG |
model decides whether to retrieve |
reflection tokens (ISREL/ISSUP/ISUSE) |
self-grade → re-retrieve |
adaptive by default |
| CRAG |
single retrieve |
evaluator → Correct/Ambiguous/Incorrect |
low confidence → query-rewrite + web fallback |
the canonical confidence→fallback loop |
| Adaptive-RAG |
complexity classifier |
query difficulty |
route to 0 / 1 / N-hop retrieval |
route by difficulty, not score |
Four design archetypes that recur
- Adaptive escalation (CRAG, Adaptive-RAG, Self-RAG) — default = cheapest move; a confidence/complexity signal decides when to climb. The user never sets a threshold for the common case.
- Adaptive result-set sizing (Weaviate autocut) — cut where the score curve breaks, not at a fixed k. Returns 1 when the answer is obvious, the cluster when it's genuinely several, never 20 because k=20. Direct fix for our "20 vs 1" problem, no LLM call.
- Structural precision (tenure) — you can't tune your way to 1.00 on a flat ranked list; you change the contract. Tiered results + alias resolution + supersession + "what must NOT return." Correct by construction, then fast because it's not over-retrieving.
- Composable knob menu (LlamaIndex/Zep/mem0) — permissive default + small orthogonal levers. The systems that only do this (mem0/Zep/vector DBs) leave precision as homework → 0.06–0.09 default precision.
Winners combine 1+2+3. Losers lean on 4 alone.
Cross-cutting truth: raw cosine is the weakest confidence signal (index-dependent, uncalibrated, smooth). gbrain currently uses exactly this. Every system that gets precision right replaces it — cross-encoder rerank score, score discontinuity, self-grade, or a structured lookup contract.
Recommendation: self-grading, structurally-precise default
The default pipeline (smart out of box, zero config)
query
→ route by shape (factual-lookup | open-analytical) [Adaptive-RAG router]
→ factual: pinned/alias tier, scope+supersession filters [tenure-style contract] → exact, ~10ms
→ open: hybrid retrieve top-20, normalize scores [score normalization]
→ cross-encoder / LLM-reader rerank [rerank-as-confidence]
→ AUTOCUT at score discontinuity [Weaviate autocut]
→ if top-confidence < band: escalate to `think` [CRAG fallback] → only on hard queries
Four borrowed mechanisms, each proven:
- Tiered belief-state contract (tenure) — gbrain already has
pinnedFacts/openQuestions tiers. Route exact answers through the pinned/alias-resolved tier; structurally exclude superseded + out-of-scope beliefs (not down-rank). Biggest single precision lever, mostly architecture not model.
- Autocut (Weaviate) — normalize hybrid scores, cut at the largest gap. Adaptive size, no magic number.
- Rerank-as-confidence (MemPalace + Cohere) — over-retrieve top-20, rerank, use the top rerank score as a real gate. High → return autocut set; low → the system knows it found nothing good. Model-agnostic (Haiku-class is enough).
- CRAG self-escalation — when top confidence is below a band, auto-escalate to
think (broad gather + narrow cite) only then, so its ~5s cost is paid only on hard queries. The agent never picks the mode; a classifier + confidence band does.
The ceiling (knobs the agent CAN pull, never has to)
Expose but default-to-auto: min_confidence (mem0-tiered 0.8/0.6/0.4 by stakes), max_results override, force-think, reranker choice (Haiku→Sonnet→cross-encoder), web_fallback toggle.
The tool descriptions should literally say "for the best outcome on X, try Y" — discoverable without leaving the tool.
Why this beats the current three modes
- Default precision jumps — exact lookups never touch the noisy semantic path (structural); open queries get autocut+rerank instead of fixed-20.
- Recall is protected — escalation is automatic. When confidence is low the system widens itself, so tight defaults don't permanently lose recall the way
adaptive-tight does (44/77).
- Latency stays low on the common case — the 5s
think cost fires only when the confidence gate trips.
- High ceiling for free — the out-of-box default already self-escalates, so the lowest-bar user gets near-ceiling behavior with zero config; a power agent can still force every lever.
The split rule (the principle)
- Floor = behavior. gbrain does the smart thing. No flag enables it — it's what happens when you type the simplest command. Never make the smart path an opt-in flag (that's the v0.41.33 mistake:
adaptiveReturn is default-OFF, so 90% of callers get the dumb path).
- Ceiling = control. Every flag overrides a smart default, documented with "use when." Flags exist to let an agent that knows better disagree with the classifier — never to turn on intelligence.
One-line synthesis: Stop tuning a flat ranked list. Make the default a self-grading, structurally-precise pipeline — exact-tier lookups for facts, autocut+rerank for open queries, CRAG-style auto-escalation to think/web only when a real confidence signal says the cheap path failed. Precision becomes structural (tenure), result-size adaptive (autocut), ceiling automatic (CRAG/Adaptive-RAG) — manual knobs available but never required.
Sources
PrecisionMemBench/tenure (tiered contract, 1.00 precision); MemPalace (model-agnostic LLM rerank); Mastra OM (no-retrieval default); mem0 (tiered thresholds); Zep/Graphiti (swappable rerankers); LlamaIndex (postprocessors/routers); Elasticsearch (min_score normalization); Weaviate (autocut); Self-RAG (arXiv 2310.11511); CRAG (arXiv 2401.15884); Adaptive-RAG (arXiv 2403.14403, NAACL 2024); Anthropic Contextual Retrieval + Cohere Rerank. Full per-system analysis with links in the research doc backing this issue.
TL;DR
gbrain's default retrieval (uncapped hybrid → ~20 results, recall ~0.99, precision 0.076–0.14 on PrecisionMemBench) is the worst-scoring default in its peer set on the metric that matters: active retrieval (did the right thing return, and only the right thing). The benchmark's own words: "Returning everything is not retrieval."
--adaptive-return(v0.41.33) tightens the set, but tight only helps if #1 is right — and on noisy/migrated cases it makes a wrong answer more confident by hiding the alternatives. We have three modes (hybrid / adaptive-tight / think) and none wins, because all three tune the same flat-ranked-list contract.This issue proposes the floor/ceiling redesign, grounded in how the 11 systems in
gbrain-evals/docs/comparison-systems.mdactually solve it.The design question (from review)
The answer from the literature: the default does the cheap thing; a confidence signal — not the user — decides when to climb. The ceiling is automatic by default, manual when forced.
What each peer system does
thresholdparam (uncalibrated → 0.06 precision)SimilarityPostprocessor,RouterQueryEngine, auto-retrieversizemin_scoreafter normalization; LTRFour design archetypes that recur
Winners combine 1+2+3. Losers lean on 4 alone.
Cross-cutting truth: raw cosine is the weakest confidence signal (index-dependent, uncalibrated, smooth). gbrain currently uses exactly this. Every system that gets precision right replaces it — cross-encoder rerank score, score discontinuity, self-grade, or a structured lookup contract.
Recommendation: self-grading, structurally-precise default
The default pipeline (smart out of box, zero config)
Four borrowed mechanisms, each proven:
pinnedFacts/openQuestionstiers. Route exact answers through the pinned/alias-resolved tier; structurally exclude superseded + out-of-scope beliefs (not down-rank). Biggest single precision lever, mostly architecture not model.think(broad gather + narrow cite) only then, so its ~5s cost is paid only on hard queries. The agent never picks the mode; a classifier + confidence band does.The ceiling (knobs the agent CAN pull, never has to)
Expose but default-to-auto:
min_confidence(mem0-tiered 0.8/0.6/0.4 by stakes),max_resultsoverride,force-think, reranker choice (Haiku→Sonnet→cross-encoder),web_fallbacktoggle.The tool descriptions should literally say "for the best outcome on X, try Y" — discoverable without leaving the tool.
Why this beats the current three modes
adaptive-tightdoes (44/77).thinkcost fires only when the confidence gate trips.The split rule (the principle)
adaptiveReturnis default-OFF, so 90% of callers get the dumb path).One-line synthesis: Stop tuning a flat ranked list. Make the default a self-grading, structurally-precise pipeline — exact-tier lookups for facts, autocut+rerank for open queries, CRAG-style auto-escalation to
think/web only when a real confidence signal says the cheap path failed. Precision becomes structural (tenure), result-size adaptive (autocut), ceiling automatic (CRAG/Adaptive-RAG) — manual knobs available but never required.Sources
PrecisionMemBench/tenure (tiered contract, 1.00 precision); MemPalace (model-agnostic LLM rerank); Mastra OM (no-retrieval default); mem0 (tiered thresholds); Zep/Graphiti (swappable rerankers); LlamaIndex (postprocessors/routers); Elasticsearch (min_score normalization); Weaviate (autocut); Self-RAG (arXiv 2310.11511); CRAG (arXiv 2401.15884); Adaptive-RAG (arXiv 2403.14403, NAACL 2024); Anthropic Contextual Retrieval + Cohere Rerank. Full per-system analysis with links in the research doc backing this issue.