Skip to content

feat: smart-default + high-ceiling retrieval — confidence-graded auto-escalation (floor/ceiling redesign) #1663

@garrytan-agents

Description

@garrytan-agents

TL;DR

gbrain's default retrieval (uncapped hybrid → ~20 results, recall ~0.99, precision 0.076–0.14 on PrecisionMemBench) is the worst-scoring default in its peer set on the metric that matters: active retrieval (did the right thing return, and only the right thing). The benchmark's own words: "Returning everything is not retrieval."

--adaptive-return (v0.41.33) tightens the set, but tight only helps if #1 is right — and on noisy/migrated cases it makes a wrong answer more confident by hiding the alternatives. We have three modes (hybrid / adaptive-tight / think) and none wins, because all three tune the same flat-ranked-list contract.

This issue proposes the floor/ceiling redesign, grounded in how the 11 systems in gbrain-evals/docs/comparison-systems.md actually solve it.

The design question (from review)

How do we make the out-of-box default the smartest possible (lowest bar, simplest, best for everyone) AND give a high ceiling (agent pulls every lever for the best possible outcome on a hard task)?

The answer from the literature: the default does the cheap thing; a confidence signal — not the user — decides when to climb. The ceiling is automatic by default, manual when forced.

What each peer system does

System Default Confidence signal Auto-escalation Precision stance
tenure (1.00 precision) tiered belief-state lookup (relevant/pinned/open-questions) structural — no threshold needed none needed precision is architecture, not tuning
MemPalace raw semantic (96.6% R@5, no LLM) distance opt-in LLM-reader rerank → ≥99%, model-agnostic recall-headline, honest about the split
Mastra OM no per-turn retrieval — curated observation log replaces history emoji priority tags 🔴🟡🟢 token-budget triggers sidesteps tradeoff — don't retrieve noise
mem0 top-k, permissive threshold param (uncalibrated → 0.06 precision) tiered cutoffs (0.8/0.6/0.4 by stakes) precision = developer's knob
Zep hybrid + RRF fact-extraction confidence swappable rerankers (RRF/MMR/cross-encoder/graph) structural supersession, broad default
LlamaIndex top-k (no cutoff) per-node score SimilarityPostprocessor, RouterQueryEngine, auto-retriever permissive default + composable knobs
Elasticsearch top-size raw score (not portable w/o normalization) min_score after normalization; LTR measure, don't guess the cutoff
Weaviate top-10 distance autocut (cut at score discontinuity) + rerank adaptive sizing, no magic number
Self-RAG model decides whether to retrieve reflection tokens (ISREL/ISSUP/ISUSE) self-grade → re-retrieve adaptive by default
CRAG single retrieve evaluator → Correct/Ambiguous/Incorrect low confidence → query-rewrite + web fallback the canonical confidence→fallback loop
Adaptive-RAG complexity classifier query difficulty route to 0 / 1 / N-hop retrieval route by difficulty, not score

Four design archetypes that recur

  1. Adaptive escalation (CRAG, Adaptive-RAG, Self-RAG) — default = cheapest move; a confidence/complexity signal decides when to climb. The user never sets a threshold for the common case.
  2. Adaptive result-set sizing (Weaviate autocut) — cut where the score curve breaks, not at a fixed k. Returns 1 when the answer is obvious, the cluster when it's genuinely several, never 20 because k=20. Direct fix for our "20 vs 1" problem, no LLM call.
  3. Structural precision (tenure) — you can't tune your way to 1.00 on a flat ranked list; you change the contract. Tiered results + alias resolution + supersession + "what must NOT return." Correct by construction, then fast because it's not over-retrieving.
  4. Composable knob menu (LlamaIndex/Zep/mem0) — permissive default + small orthogonal levers. The systems that only do this (mem0/Zep/vector DBs) leave precision as homework → 0.06–0.09 default precision.

Winners combine 1+2+3. Losers lean on 4 alone.

Cross-cutting truth: raw cosine is the weakest confidence signal (index-dependent, uncalibrated, smooth). gbrain currently uses exactly this. Every system that gets precision right replaces it — cross-encoder rerank score, score discontinuity, self-grade, or a structured lookup contract.

Recommendation: self-grading, structurally-precise default

The default pipeline (smart out of box, zero config)

query
  → route by shape (factual-lookup | open-analytical)        [Adaptive-RAG router]
  → factual: pinned/alias tier, scope+supersession filters   [tenure-style contract]  → exact, ~10ms
  → open:    hybrid retrieve top-20, normalize scores         [score normalization]
             → cross-encoder / LLM-reader rerank              [rerank-as-confidence]
             → AUTOCUT at score discontinuity                 [Weaviate autocut]
             → if top-confidence < band: escalate to `think`  [CRAG fallback]   → only on hard queries

Four borrowed mechanisms, each proven:

  1. Tiered belief-state contract (tenure) — gbrain already has pinnedFacts/openQuestions tiers. Route exact answers through the pinned/alias-resolved tier; structurally exclude superseded + out-of-scope beliefs (not down-rank). Biggest single precision lever, mostly architecture not model.
  2. Autocut (Weaviate) — normalize hybrid scores, cut at the largest gap. Adaptive size, no magic number.
  3. Rerank-as-confidence (MemPalace + Cohere) — over-retrieve top-20, rerank, use the top rerank score as a real gate. High → return autocut set; low → the system knows it found nothing good. Model-agnostic (Haiku-class is enough).
  4. CRAG self-escalation — when top confidence is below a band, auto-escalate to think (broad gather + narrow cite) only then, so its ~5s cost is paid only on hard queries. The agent never picks the mode; a classifier + confidence band does.

The ceiling (knobs the agent CAN pull, never has to)

Expose but default-to-auto: min_confidence (mem0-tiered 0.8/0.6/0.4 by stakes), max_results override, force-think, reranker choice (Haiku→Sonnet→cross-encoder), web_fallback toggle.

The tool descriptions should literally say "for the best outcome on X, try Y" — discoverable without leaving the tool.

Why this beats the current three modes

  • Default precision jumps — exact lookups never touch the noisy semantic path (structural); open queries get autocut+rerank instead of fixed-20.
  • Recall is protected — escalation is automatic. When confidence is low the system widens itself, so tight defaults don't permanently lose recall the way adaptive-tight does (44/77).
  • Latency stays low on the common case — the 5s think cost fires only when the confidence gate trips.
  • High ceiling for free — the out-of-box default already self-escalates, so the lowest-bar user gets near-ceiling behavior with zero config; a power agent can still force every lever.

The split rule (the principle)

  • Floor = behavior. gbrain does the smart thing. No flag enables it — it's what happens when you type the simplest command. Never make the smart path an opt-in flag (that's the v0.41.33 mistake: adaptiveReturn is default-OFF, so 90% of callers get the dumb path).
  • Ceiling = control. Every flag overrides a smart default, documented with "use when." Flags exist to let an agent that knows better disagree with the classifier — never to turn on intelligence.

One-line synthesis: Stop tuning a flat ranked list. Make the default a self-grading, structurally-precise pipeline — exact-tier lookups for facts, autocut+rerank for open queries, CRAG-style auto-escalation to think/web only when a real confidence signal says the cheap path failed. Precision becomes structural (tenure), result-size adaptive (autocut), ceiling automatic (CRAG/Adaptive-RAG) — manual knobs available but never required.

Sources

PrecisionMemBench/tenure (tiered contract, 1.00 precision); MemPalace (model-agnostic LLM rerank); Mastra OM (no-retrieval default); mem0 (tiered thresholds); Zep/Graphiti (swappable rerankers); LlamaIndex (postprocessors/routers); Elasticsearch (min_score normalization); Weaviate (autocut); Self-RAG (arXiv 2310.11511); CRAG (arXiv 2401.15884); Adaptive-RAG (arXiv 2403.14403, NAACL 2024); Anthropic Contextual Retrieval + Cohere Rerank. Full per-system analysis with links in the research doc backing this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions