feat: smart-default + high-ceiling retrieval — confidence-graded auto-escalation (floor/ceiling redesign)

## TL;DR

gbrain's default retrieval (uncapped hybrid → ~20 results, recall ~0.99, **precision 0.076–0.14** on PrecisionMemBench) is the worst-scoring default in its peer set on the metric that matters: *active retrieval* (did the right thing return, and only the right thing). The benchmark's own words: **"Returning everything is not retrieval."**

`--adaptive-return` (v0.41.33) tightens the set, but tight only helps if #1 is right — and on noisy/migrated cases it makes a wrong answer *more* confident by hiding the alternatives. We have three modes (hybrid / adaptive-tight / think) and **none wins**, because all three tune the same flat-ranked-list contract.

This issue proposes the floor/ceiling redesign, grounded in how the 11 systems in `gbrain-evals/docs/comparison-systems.md` actually solve it.

## The design question (from review)

> How do we make the out-of-box **default** the smartest possible (lowest bar, simplest, best for everyone) AND give a high **ceiling** (agent pulls every lever for the best possible outcome on a hard task)?

The answer from the literature: **the default does the cheap thing; a confidence signal — not the user — decides when to climb.** The ceiling is automatic by default, manual when forced.

## What each peer system does

| System | Default | Confidence signal | Auto-escalation | Precision stance |
|---|---|---|---|---|
| **tenure** (1.00 precision) | tiered belief-state lookup (relevant/pinned/open-questions) | structural — no threshold needed | none needed | **precision is architecture, not tuning** |
| **MemPalace** | raw semantic (96.6% R@5, no LLM) | distance | opt-in LLM-reader rerank → ≥99%, model-agnostic | recall-headline, honest about the split |
| **Mastra OM** | *no per-turn retrieval* — curated observation log replaces history | emoji priority tags 🔴🟡🟢 | token-budget triggers | sidesteps tradeoff — don't retrieve noise |
| **mem0** | top-k, permissive | `threshold` param (uncalibrated → 0.06 precision) | tiered cutoffs (0.8/0.6/0.4 by stakes) | precision = developer's knob |
| **Zep** | hybrid + RRF | fact-extraction confidence | swappable rerankers (RRF/MMR/cross-encoder/graph) | structural supersession, broad default |
| **LlamaIndex** | top-k (no cutoff) | per-node score | `SimilarityPostprocessor`, `RouterQueryEngine`, auto-retriever | permissive default + composable knobs |
| **Elasticsearch** | top-`size` | raw score (**not portable** w/o normalization) | `min_score` after normalization; LTR | measure, don't guess the cutoff |
| **Weaviate** | top-10 | distance | **autocut** (cut at score discontinuity) + rerank | adaptive sizing, no magic number |
| **Self-RAG** | model decides *whether* to retrieve | reflection tokens (ISREL/ISSUP/ISUSE) | self-grade → re-retrieve | adaptive by default |
| **CRAG** | single retrieve | evaluator → Correct/Ambiguous/Incorrect | **low confidence → query-rewrite + web fallback** | the canonical confidence→fallback loop |
| **Adaptive-RAG** | complexity classifier | query difficulty | route to 0 / 1 / N-hop retrieval | route by difficulty, not score |

## Four design archetypes that recur

1. **Adaptive escalation** (CRAG, Adaptive-RAG, Self-RAG) — default = cheapest move; a confidence/complexity signal decides when to climb. *The user never sets a threshold for the common case.*
2. **Adaptive result-set sizing** (Weaviate autocut) — cut where the **score curve breaks**, not at a fixed k. Returns 1 when the answer is obvious, the cluster when it's genuinely several, never 20 because k=20. **Direct fix for our "20 vs 1" problem, no LLM call.**
3. **Structural precision** (tenure) — you can't tune your way to 1.00 on a flat ranked list; you change the contract. Tiered results + alias resolution + supersession + "what must NOT return." Correct by construction, then *fast* because it's not over-retrieving.
4. **Composable knob menu** (LlamaIndex/Zep/mem0) — permissive default + small orthogonal levers. The systems that *only* do this (mem0/Zep/vector DBs) leave precision as homework → 0.06–0.09 default precision.

**Winners combine 1+2+3. Losers lean on 4 alone.**

Cross-cutting truth: **raw cosine is the weakest confidence signal** (index-dependent, uncalibrated, smooth). gbrain currently uses exactly this. Every system that gets precision right replaces it — cross-encoder rerank score, score discontinuity, self-grade, or a structured lookup contract.

## Recommendation: self-grading, structurally-precise default

### The default pipeline (smart out of box, zero config)

```
query
  → route by shape (factual-lookup | open-analytical)        [Adaptive-RAG router]
  → factual: pinned/alias tier, scope+supersession filters   [tenure-style contract]  → exact, ~10ms
  → open:    hybrid retrieve top-20, normalize scores         [score normalization]
             → cross-encoder / LLM-reader rerank              [rerank-as-confidence]
             → AUTOCUT at score discontinuity                 [Weaviate autocut]
             → if top-confidence < band: escalate to `think`  [CRAG fallback]   → only on hard queries
```

Four borrowed mechanisms, each proven:

1. **Tiered belief-state contract** (tenure) — gbrain already has `pinnedFacts`/`openQuestions` tiers. Route exact answers through the pinned/alias-resolved tier; structurally exclude superseded + out-of-scope beliefs (not down-rank). **Biggest single precision lever, mostly architecture not model.**
2. **Autocut** (Weaviate) — normalize hybrid scores, cut at the largest gap. Adaptive size, no magic number.
3. **Rerank-as-confidence** (MemPalace + Cohere) — over-retrieve top-20, rerank, use the top rerank score as a real gate. High → return autocut set; low → the system *knows* it found nothing good. Model-agnostic (Haiku-class is enough).
4. **CRAG self-escalation** — when top confidence is below a band, auto-escalate to `think` (broad gather + narrow cite) *only then*, so its ~5s cost is paid only on hard queries. The agent never picks the mode; a classifier + confidence band does.

### The ceiling (knobs the agent CAN pull, never has to)

Expose but default-to-auto: `min_confidence` (mem0-tiered 0.8/0.6/0.4 by stakes), `max_results` override, `force-think`, reranker choice (Haiku→Sonnet→cross-encoder), `web_fallback` toggle.

The tool descriptions should literally say *"for the best outcome on X, try Y"* — discoverable without leaving the tool.

### Why this beats the current three modes

- **Default precision jumps** — exact lookups never touch the noisy semantic path (structural); open queries get autocut+rerank instead of fixed-20.
- **Recall is protected** — escalation is automatic. When confidence is low the system widens *itself*, so tight defaults don't permanently lose recall the way `adaptive-tight` does (44/77).
- **Latency stays low on the common case** — the 5s `think` cost fires only when the confidence gate trips.
- **High ceiling for free** — the out-of-box default already self-escalates, so the lowest-bar user gets near-ceiling behavior with zero config; a power agent can still force every lever.

### The split rule (the principle)

- **Floor = behavior.** gbrain does the smart thing. No flag enables it — it's what happens when you type the simplest command. *Never make the smart path an opt-in flag* (that's the v0.41.33 mistake: `adaptiveReturn` is default-OFF, so 90% of callers get the dumb path).
- **Ceiling = control.** Every flag *overrides* a smart default, documented with "use when." Flags exist to let an agent that knows better disagree with the classifier — never to turn on intelligence.

**One-line synthesis:** Stop tuning a flat ranked list. Make the default a self-grading, structurally-precise pipeline — exact-tier lookups for facts, autocut+rerank for open queries, CRAG-style auto-escalation to `think`/web only when a real confidence signal says the cheap path failed. Precision becomes structural (tenure), result-size adaptive (autocut), ceiling automatic (CRAG/Adaptive-RAG) — manual knobs available but never required.

## Sources

PrecisionMemBench/tenure (tiered contract, 1.00 precision); MemPalace (model-agnostic LLM rerank); Mastra OM (no-retrieval default); mem0 (tiered thresholds); Zep/Graphiti (swappable rerankers); LlamaIndex (postprocessors/routers); Elasticsearch (min_score normalization); Weaviate (autocut); Self-RAG (arXiv 2310.11511); CRAG (arXiv 2401.15884); Adaptive-RAG (arXiv 2403.14403, NAACL 2024); Anthropic Contextual Retrieval + Cohere Rerank. Full per-system analysis with links in the research doc backing this issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: smart-default + high-ceiling retrieval — confidence-graded auto-escalation (floor/ceiling redesign) #1663

TL;DR

The design question (from review)

What each peer system does

Four design archetypes that recur

Recommendation: self-grading, structurally-precise default

The default pipeline (smart out of box, zero config)

The ceiling (knobs the agent CAN pull, never has to)

Why this beats the current three modes

The split rule (the principle)

Sources

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

System	Default	Confidence signal	Auto-escalation	Precision stance
tenure (1.00 precision)	tiered belief-state lookup (relevant/pinned/open-questions)	structural — no threshold needed	none needed	precision is architecture, not tuning
MemPalace	raw semantic (96.6% R@5, no LLM)	distance	opt-in LLM-reader rerank → ≥99%, model-agnostic	recall-headline, honest about the split
Mastra OM	no per-turn retrieval — curated observation log replaces history	emoji priority tags 🔴🟡🟢	token-budget triggers	sidesteps tradeoff — don't retrieve noise
mem0	top-k, permissive	`threshold` param (uncalibrated → 0.06 precision)	tiered cutoffs (0.8/0.6/0.4 by stakes)	precision = developer's knob
Zep	hybrid + RRF	fact-extraction confidence	swappable rerankers (RRF/MMR/cross-encoder/graph)	structural supersession, broad default
LlamaIndex	top-k (no cutoff)	per-node score	`SimilarityPostprocessor`, `RouterQueryEngine`, auto-retriever	permissive default + composable knobs
Elasticsearch	top-`size`	raw score (not portable w/o normalization)	`min_score` after normalization; LTR	measure, don't guess the cutoff
Weaviate	top-10	distance	autocut (cut at score discontinuity) + rerank	adaptive sizing, no magic number
Self-RAG	model decides whether to retrieve	reflection tokens (ISREL/ISSUP/ISUSE)	self-grade → re-retrieve	adaptive by default
CRAG	single retrieve	evaluator → Correct/Ambiguous/Incorrect	low confidence → query-rewrite + web fallback	the canonical confidence→fallback loop
Adaptive-RAG	complexity classifier	query difficulty	route to 0 / 1 / N-hop retrieval	route by difficulty, not score

feat: smart-default + high-ceiling retrieval — confidence-graded auto-escalation (floor/ceiling redesign) #1663

Description

TL;DR

The design question (from review)

What each peer system does

Four design archetypes that recur

Recommendation: self-grading, structurally-precise default

The default pipeline (smart out of box, zero config)

The ceiling (knobs the agent CAN pull, never has to)

Why this beats the current three modes

The split rule (the principle)

Sources

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions