Every RAG system returns an answer even when it doesn't have one. This model teaches retrieval to say "I don't know."
A lightweight, community-trainable model for retrieval abstention — built for the BEAM benchmark (ICLR 2026), usable in any RAG system.
Retrieval-Augmented Generation systems always return results. Ask about something that doesn't exist in the knowledge base, and the system confidently hands you the least-bad match. The reader LLM then hallucinates an answer based on irrelevant context.
On the BEAM benchmark — the hardest long-term memory benchmark — abstention is where every system fails. Diagnostic data on BEAM shows abstention queries get higher retrieval scores (avg 0.926) than many answerable queries. Score-based thresholds cannot discriminate.
Existing NLI models (MNLI/SNLI) check logical entailment, not retrieval relevance. They weren't designed for this task.
A binary classifier specifically trained to answer: "Does this passage contain information that answers this query?"
| Property | Value |
|---|---|
| Architecture | DistilBERT fine-tuned (66M params) |
| Input | (query, passage) pair |
| Output | confidence score [0, 1] |
| Model size | 64 MB (INT8 quantized ONNX) |
| Inference | ~32 ms per pair on CPU |
| Training | MLX (Apple Silicon) or PyTorch |
v0.1 — working discriminator trained on 19k BEAM pairs.
| Metric | Score |
|---|---|
| Accuracy | 72.8% |
| Precision | 67.6% |
| Recall | 80.1% |
| F1 | 0.733 |
Score range on diverse test queries: 0.215 to 0.830 (proper discrimination, not clustered around 0.5).
Generated from BEAM splits 100K + 500K + 1M (90 conversations total):
- 19,111 pairs (9,199 relevant / 9,912 irrelevant — 48/52 balance)
- Hard negatives mined via cosine similarity to query (semantically close but non-answering passages)
- Easy negatives added for clear contrast signal
- 1:1 ratio of hard to easy negatives per question
| Version | Pairs | Approach | F1 | Notes |
|---|---|---|---|---|
| v1 | 1448 | random negatives | 0.836* | * biased — predicted "relevant" for everything |
| v2 | 2198 | all hard negatives | 0.330 | swung to "irrelevant" for everything |
| v3 | 2390 | balanced 3 epochs | 0.679 | undertrained |
| v4 | 2390 | balanced 5 epochs | 0.690 | data too small |
| v5 | 19111 | balanced 5 epochs | 0.733 | shipped as v0.1 |
The model improves over time through community-contributed labeled data — see Contributing.
pip install cortex-beam-abstainfrom cortex_beam_abstain import AbstentionClassifier
clf = AbstentionClassifier() # auto-downloads model from HuggingFace
# Single prediction
score = clf.predict(
query="What recipe did they discuss?",
passage="We talked about the new API design for authentication.",
)
# score < 0.3 → should abstain
# Batch prediction
scores = clf.predict_batch([
("What language does the user prefer?", "The user said they always use TypeScript"),
("What recipe did they discuss?", "We fixed the database migration issue"),
])
# Decision: should the system abstain entirely?
if clf.should_abstain("query", ["passage1", "passage2"], threshold=0.3):
return [] # No relevant results — abstainIf no model is available, the classifier falls back to a token-overlap heuristic:
clf = AbstentionClassifier(use_heuristic=True)Or to the raw cosine gap heuristic from BEAM diagnostic data (Cohen's d = 1.01).
python scripts/generate_seed_data.py --output data/seed/beam.jsonl --limit 20python scripts/train_torch.py \
--data data/ \
--output checkpoints/v1 \
--epochs 3 \
--lr 2e-5 \
--batch-size 16 \
--evalpython scripts/export_onnx.py \
--checkpoint checkpoints/v1 \
--output models/abstention.onnx \
--quantize int8The export script verifies that the ONNX output matches PyTorch (max diff < 1e-3) before quantizing.
MLX training script is scaffolded in scripts/train_mlx.py but the LoRA fine-tuning loop is still under development. Use PyTorch for now.
This model only gets better with more labeled data. Every contribution helps.
JSONL files in data/community/:
{"query": "What color was the car?", "passage": "They discussed the API rate limiting strategy.", "label": "irrelevant", "source": "user", "contributor": "your_github_handle"}
{"query": "What framework do they use?", "passage": "We decided to switch from React to Vue because.", "label": "relevant", "source": "user", "contributor": "your_github_handle"}- Fork this repo
- Add labeled JSONL files to
data/community/your_handle.jsonl - Run
python scripts/validate_data.pyto check format - Open a PR — CI validates automatically
The most valuable contributions are hard negatives — passages that are topically similar to the query but don't actually answer it. These are exactly the cases where naive retrieval fails.
See CONTRIBUTING.md for labeling guidelines.
- Base model: DistilBERT (
distilbert-base-uncased, 66M params) - Fine-tuning: full fine-tune via HuggingFace Trainer (LoRA via MLX is WIP)
- Classification: binary (relevant / irrelevant)
- Input format:
[CLS] query [SEP] passage [SEP], max 256 tokens - Export: ONNX opset 14, INT8 dynamic quantization
- Verification: ONNX outputs verified against PyTorch (max diff = 0.000001)
- Fallback: raw cosine gap heuristic (Cohen's d = 1.01 on BEAM diagnostic)
- cortex — part of the Cortex memory system family
- beam — trained for and evaluated against BEAM
- abstain — what it teaches RAG to do when it doesn't actually know
The repo name (cortex-know-when-to-stop-training-model) is the long descriptive form. The Python package (cortex_beam_abstain) is the terse version for imports.
- Cortex — Persistent memory for Claude Code (parent project)
- BEAM Benchmark — Long-term memory evaluation (ICLR 2026)
MIT