Local document memory with hybrid retrieval. Single SQLite file. Zero cloud dependencies for search. Beats ColBERTv2 on 5/5 BEIR datasets with the tuned bge-small-rrf-v3 model. ~50 ms p50 at 50K chunks on Apple Silicon with the optional snapvec backend (~73 ms on the default sqlite-vec for dense corpora).
pip install vstash
vstash add paper.pdf notes.md https://example.com/article
vstash search "what's the main argument?"| Dataset | Docs | vstash (v3) | ColBERTv2 | BM25 | Δ vs ColBERTv2 |
|---|---|---|---|---|---|
| SciFact | 5.2K | 0.9361 | 0.693 | 0.665 | +0.243 |
| NFCorpus | 3.6K | 0.3927 | 0.344 | 0.325 | +0.049 |
| SciDocs | 25.7K | 0.3693 | 0.154 | 0.158 | +0.215 |
| FiQA | 57.6K | 0.7506 | 0.356 | 0.236 | +0.395 |
| ArguAna | 8.7K | 0.7540 | 0.463 | 0.315 | +0.291 |
Absolute NDCG@10 on BEIR via the full production retrieval pipeline (RRF hybrid + adaptive weights + MMR dedup + IDF, 2026-04-19). Tuned model: Stffens/bge-small-rrf-v3 (33M params, 384d). v3 beats ColBERTv2 on 5/5 BEIR datasets and improves macro NDCG@10 by +0.016 absolute over bge-small-rrf-v2 (0.6405 vs 0.6246). Training-time eval uses a batched path that skips MMR/IDF for speed; absolute NDCG@10 differs by a few percent vs the production numbers above, but baseline-vs-final deltas are preserved. See experiments/results/v2_v3_head_to_head.json for the full table (reproduce via python -m experiments.v2_v3_head_to_head) and the methodological note in experiments/hypotheses.md for the pipeline-shift caveat.
Query --> Embed --+--> Vector ANN (sqlite-vec) --+
| +--> Adaptive RRF --> MMR Dedup --> Results
+--> FTS5 BM25 ----------------+
- Hybrid search: vector + keyword, fused via Reciprocal Rank Fusion.
- Adaptive RRF: IDF-based per-query weights. Rare terms boost keywords, common terms boost vectors.
- MMR dedup: diverse sections from long documents, not redundant chunks from one.
- Self-tuned, gated:
vstash retrainfine-tunes embeddings from your own disagreement signal; the eval gate refuses regressions.
pip install vstash # SDK + search
pip install 'vstash[ingest]' # + PDF, DOCX, PPTX parsing
pip install 'vstash[serve]' # + web UI (vstash serve)
pip install 'vstash[all]' # everything# Ingest: files, folders, URLs
vstash add report.pdf ~/notes/ https://arxiv.org/abs/2310.06825
# Search: local, no API key
vstash search "what is the proposed method?"
# Ask: needs a local LLM, auto-detects Ollama / LM Studio
vstash ask "summarize the key findings"
vstash chat # interactive
# Fine-tune on your own corpus (eval-gated, refuses regressions)
vstash retrain
vstash reindex --model ~/.vstash/models/retrainedfrom vstash import Memory
mem = Memory(project="my_agent")
mem.add("docs/spec.pdf")
mem.remember("OAuth uses PKCE for public clients", title="auth-notes")
results = mem.search("deployment strategy", top_k=5)
for r in results:
print(r.text, r.score, r.collection, r.tags, r.added_at)
answer = mem.ask("What are the system requirements?")vstash add <file/dir/url> Add documents to memory
vstash remember "<text>" Ingest text directly
vstash search "<query>" Semantic search (free, local)
vstash ask "<question>" Answer from your documents (needs LLM)
vstash chat Interactive Q&A
vstash list Show all documents
vstash stats Memory statistics
vstash forget <file> Remove a document
vstash retrain Fine-tune embeddings on your data
vstash reindex Re-embed with a new model
vstash watch <dir> Auto-ingest on file changes
vstash serve Web UI on localhost
vstash check [--repair] Integrity check and repair
vstash config Show configuration
vstash profile <cmd> Manage named profiles
vstash journal <cmd> Cross-session agent memory
16 tools for Claude Desktop, Claude Code, Cursor, or any MCP client:
vstash-mcp # start MCP server{
"mcpServers": {
"vstash": {
"command": "vstash-mcp"
}
}
}vstash can tune its own embedding model to your corpus, without any human labels.
vstash retrain # generate training pairs + fine-tune
vstash reindex --model ~/.vstash/models/retrainedHow it works, in one paragraph. When you search your corpus, the vector and keyword halves of the pipeline sometimes rank different documents at the top. Those disagreements are a free signal: the document each half picked is probably relevant, the one only one half picked might not be. vstash turns this into training pairs and fine-tunes the embedding model on them. The run is eval-gated: it evaluates the candidate against the base model on a held-out slice of your corpus and refuses to save a model that performs worse.
The feature is maturing fast. Each release tightens the recipe, lifts the measured numbers, and adds infrastructure that keeps the next iteration honest:
| Release | Training recipe | 5-dataset BEIR macro NDCG@10 | What landed alongside |
|---|---|---|---|
base bge-small |
no fine-tune | 0.6118 | reference |
rrf-v2 |
76k triples, ad-hoc scripts | 0.6246 | first paper-grade result; still the NFCorpus specialist |
rrf-v3 |
60k triples via retrain-multi CLI, temperature=0.5, eval gate |
0.6405 | H-R9 ablation picked the config empirically; H-R7 seeded RNGs make it reproducible; H-R5 reports NDCG@3 + Recall@100 so regressions are visible before they ship |
Both v2 and v3 beat ColBERTv2 on 5/5 BEIR datasets under the current pipeline. v3 improves macro by +0.016 over v2 (+2.6% relative), with the largest per-dataset gain on FiQA (+0.097 absolute). It is a trade, not a strict upgrade: v3 gives up ~0.040 NDCG@10 on NFCorpus vs v2 in exchange for the FiQA and SciFact wins. v2 remains the better pick for keyword-heavy / biomedical corpora where NFCorpus-style retrieval dominates; v3 is the recommended default for everything else. The eval gate also catches losers: hypothesis H-R3 (hard-negative margin filter) regressed macro -2.49pp, the candidate was refused, the branch was closed without merging. The pipeline's job is to refuse bad models, and it does.
See the Retrieval Quality table, docs/retrain.md for the full recipe and per-version breakdown, and experiments/results/v2_v3_head_to_head.json for reproducible numbers.
Requires sentence-transformers, torch, and accelerate:
pip install 'sentence-transformers>=3' torch 'accelerate>=1.1.0'| Component | Data leaves machine? |
|---|---|
| Embeddings (FastEmbed) | Never |
| Search (sqlite-vec + FTS5) | Never |
| Inference (Ollama/LM Studio) | Never |
| Inference (Cerebras/OpenAI) | Yes (query + context sent to API) |
Search is always private. Use a local LLM for fully private answers.
vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents
Adaptive RRF, self-supervised embedding refinement, a negative result on post-RRF scoring, and the production substrate all in one place. PDF build at paper/arxiv/vstash.pdf.
| Guide | Description |
|---|---|
| How It Works | Search pipeline, chunking, RRF |
| Configuration | Full TOML reference |
| Embedding Models | Model comparison, vstash retrain |
| MCP Server | 16 tools for LLM agents |
| Experiments | BEIR benchmarks, ablations |
| Experiment | Key Result | Command |
|---|---|---|
| BEIR Benchmark | With bge-small-rrf-v3 (current default): 5/5 BEIR datasets beat ColBERTv2. With -rrf-v2 (previous): 4/5 under this script's historical pipeline. See Retrieval Quality for the v3 numbers. |
python -m experiments.beir_benchmark --no-chroma |
| Retrain (eval-gated) | Fine-tune your embedding model on your own corpus, refuses regressions | vstash retrain --help |
| Pipeline latency | At N=50K real BGE corpus on M4 Pro: sqlite-vec 73 ms, snapvec 53 ms, snapvec-ivfpq 51 ms p50; identical NDCG@10 (~0.716) | python -m experiments.vstash_pipeline_ivfpq_bench --n 50000 |
| Relevance Signal | F1=0.996 cross-domain | python -m experiments.relevance_signal_beir |
chat.ask_full()returningAskResult(v0.36) -- new public API surfaces the reasoning trace and token usage thatask()previously discarded. Cerebrasgpt-oss-120bpopulatesresult.reasoning; Ollama qwen3 thinking-mode usesmessage.thinking; OpenAI-compat servers (vLLM, DeepSeek, Together, xAI Grok, OpenAI o1/o3) readreasoning_content.ask()keeps its-> strcontract via a thin wrapper -- zero call-site change for existing code. Also exposed asMemory.ask_full(). Drives Merken Phase 2 distillation pipeline.- Centralized store construction (v0.36) --
open_store_for_config(cfg)is the single entry point used by CLI, MCP, web, SDK, journal, and federated search. Previously each surface duplicated theStorageConfig -> VstashStorewiring and silently dropped IVFPQ tuning fields on some paths (#297). vec_onlylong-query distance cutoff fix (v0.36) --retrieval_mode="vec_only"now applies the same long-query relaxation ashybrid; ArguAnavec_onlyjumped from NDCG@10 = 0.0013 (1403/1406 zero) to 0.4250. Hybrid mode and all paper / model-card numbers untouched (#304).- Bug fixes (v0.36) --
Memory.add(collection=None)falls back to schema default instead of crashing on the NOT NULL constraint (#296);vstash retrain --synthesize-queriesno longer crashes on Ollama / Cerebras backends (#294); web uploads now persist under~/.vstash/uploads/<uuid>-<safe-name>instead of pointing at deleted temp paths (#295).
bge-small-rrf-lme-v1chat-memory specialist (v0.35) — fine-tuned on 398 labeled LongMemEval queries through the eval-gated retrain loop. +3.79pp R@5 on n=102 holdout vs vanilla BGE-small. Use when your corpus is primarily chat sessions / agent memory.- Eval-gated labeled retrain (v0.35) —
vstash retrain --training-queries train.jsonl --eval-queries eval.jsonlaccepts user-supplied(query, relevant_paths)JSONL and refuses to save fine-tunes that regress NDCG@10 on the holdout. See docs/retrain.md. vstash whymiss analysis (v0.33) — diagnose why an expected document did not surface for a query. Traces vector pool, distance cutoff, FTS match, RRF fusion, MMR, and context-expansion stages with parameter suggestions. Auto-logs misses on empty / low-relevance searches.retrieval_modeenum (v0.33) —Literal["hybrid", "vec_only", "fts_only"] = "hybrid"onMemory.search,Memory.ask,VstashStore.search, and MCP tools.vec_onlyis the symmetric branch tofts_only. Default stayshybrid. Legacyfts_only=Trueboolean was removed in v0.35.- Custom encoder resolver hook (v0.34) —
register_encoder_resolver(fn)lets callers plug LoRA-adapted, locally fine-tuned, or otherwise unnamable encoders into the embed pipeline. See docs/embedding-models.md. - Cosine metric in
vec_chunks(v0.34) — sqlite-vec virtual table now uses cosine distance (was L2 before; v1 DBs migrate in place atomically on first open). Fixed a latent bug where non-unit-normalized models silently mis-ranked. - Persistent embedder daemon (v0.32) —
vstash serve --warmpre-loads the embedding model and exposes/api/embedonlocalhost:8585. CLI and SDK clients auto-detect and delegate; cold start drops from ~2 s to ~5 ms.
See CHANGELOG for full version history.