TurboQuant-RS

Fast vector compression and semantic memory for AI agents.

TurboQuant-RS implements the TurboQuant family of algorithms for compressing high-dimensional embedding vectors to 1-4 bits per dimension, enabling semantic search over thousands of documents with minimal memory. Built for AI coding agents that need to recall prior context across sessions.

How It Works

Open docs/algorithm-explainer.html in a browser for the full interactive version with randomizable demos.

Architecture

turboquant-rs/
  crates/
    turboquant/       Core: QJL (1-bit) + TurboQuant_mse (2-4 bit) + two-stage search
    agent-memory/     Session context ranking + persistent cross-session semantic recall
    memory-mcp/       MCP server: 4 tools for AI agent integration
    codesearch-mcp/   Semantic code search (scaffolded, not yet implemented)
  scripts/
    export_onnx.py    Download & export embedding models to ONNX format

Compression Algorithms

QJL (Quantized Johnson-Lindenstrauss) - 1-bit

Compresses each vector to 1 bit per dimension (32x compression vs float32).

Generate a random Gaussian projection matrix R (deterministic from seed)
Project the vector: y = R @ x
Keep only the sign bits: b = sign(y)

Vectors that are similar in the original space will share most sign bits, so Hamming distance between bit vectors approximates cosine distance. Used as a fast pre-filter in the first search stage.

Zandieh, A., Daliri, M., & Han, I. (2024). QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead. arXiv:2406.03482

TurboQuant_mse - Optimal Scalar Quantization (2-4 bit)

Compresses each vector to b bits per dimension with provably near-optimal MSE distortion (within 2.7x of the information-theoretic lower bound).

Normalize the vector and store its L2 norm separately
Multiply by a random orthogonal matrix Q (Gram-Schmidt on Gaussian, deterministic from seed)
After rotation, each coordinate follows ≈ N(0, 1/d) — quantize independently with a precomputed Lloyd-Max codebook
Pack b-bit indices into bytes

Key insight for search: Since Q is orthogonal, <x, y> = <Qx, Qy>. Similarity is computed directly in the rotated domain using codebook lookups — no matrix multiply during search.

Bits	Centroids	Storage (384-dim)	MSE bound (theoretical)
2	4	100 bytes	≤ 0.170
3	8	148 bytes	≤ 0.043
4	16	196 bytes	≤ 0.011

Zandieh, A., Daliri, M., Hadian, M., & Mirrokni, V. (2025). TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. arXiv:2504.19874

PolarQuant (also included)

Available as an alternative compressor. Pairs adjacent dimensions into 2D polar coordinates, quantizes angles (4-bit) and radii (8-bit). Not used in the default search pipeline but kept for comparison.

Han, I., Kacham, P., Karbasi, A., Mirrokni, V., & Zandieh, A. (2025). PolarQuant. arXiv:2502.02617

Two-Stage Search

Query vector
    |
    v
[QJL 1-bit] --> Hamming scan over all vectors --> top-K candidates  (fast, approximate)
    |
    v
[TurboQuant_mse 3-bit] --> codebook similarity on K candidates --> final top-k results

QJL eliminates ~98% of candidates with a single XOR + popcount per vector. TurboQuant_mse re-ranks only the survivors with near-optimal accuracy.

Crates

`turboquant`

Core compression library with no runtime dependencies beyond ndarray.

use turboquant::{TurboIndex, SearchResult};

// Create an index (stored on disk via mmap)
// Default: QJL 1-bit pre-filter + TurboQuant_mse 3-bit re-ranking
let mut index = TurboIndex::create("./my_index", 384, /*qjl_seed=*/42, /*tqmse_seed=*/99)?;

// Insert vectors (e.g., from an embedding model)
index.insert(1, &embedding_vec)?;
index.insert_batch(&ids, &embedding_matrix)?;

// Search
let results: Vec<SearchResult> = index.search(&query_vec, 10);
for r in &results {
    println!("id={} score={:.3} hamming={}", r.id, r.score, r.distance);
}

// Maintenance
index.delete(1)?;
index.compact()?;  // rebuild without deleted vectors

Use TurboIndex::create_with_bits() to choose 2-bit (smallest) or 4-bit (most accurate) re-ranking.

Storage format: Memory-mapped files with 16-byte headers. QJL vectors (*.qjl) and TurboQuant_mse vectors (*.tqsq) are stored separately for cache-friendly scanning.

`agent-memory`

Two memory systems designed for AI agent workflows:

SessionMemory - in-conversation context management:

Tracks conversation turns with token counts
Selects relevant context within a token budget
Ranks by relevance_weight * search_score + recency_weight * recency_score
Always includes the N most recent turns (configurable)

use agent_memory::{SessionMemory, OnnxEmbedder};

let embedder = Arc::new(OnnxEmbedder::load(Path::new("models/minilm"), 384)?);
let mut session = SessionMemory::new(embedder)?;
session.add_turn("user", "How does the auth middleware work?", 12)?;
session.add_turn("assistant", "The auth middleware validates JWT tokens...", 85)?;

// Select context that fits in 4096 tokens, ranked by relevance to current message
let context = session.select_context("Now fix the token expiry bug", 4096, None)?;

PersistentMemory - cross-session knowledge base:

Ingests documents (plans, RCAs, investigation notes) with semantic indexing
Recalls by natural language query with type filtering
Tracks file changes via SHA-256 for incremental sync
Supports glob patterns for bulk ingestion

use agent_memory::{PersistentMemory, OnnxEmbedder, DocumentType};

let embedder = Arc::new(OnnxEmbedder::load(Path::new("models/minilm"), 384)?);
let mut mem = PersistentMemory::open(Path::new("./memory_store"), embedder)?;

// Ingest
mem.ingest("Auth rewrite: moved to asymmetric JWT...", DocumentType::RCA, Some("docs/rca_auth.md"))?;
mem.ingest_glob("docs/plans/*.md", DocumentType::Plan)?;

// Recall
let results = mem.recall("JWT token validation", 5)?;
let rca_only = mem.recall_typed("auth bug", 5, &[DocumentType::RCA])?;

// Sync after files change on disk
let stats = mem.sync()?;
println!("{} added, {} updated, {} removed", stats.added, stats.updated, stats.removed);

Document types: Plan, Memory, RCA, CommitContext, ConversationTurn, ToolResult, Custom(String)

Embedder trait: Pluggable embedding backend. Ships with OnnxEmbedder for production use via ONNX Runtime (supports any HuggingFace model exported to ONNX). A MockEmbedder is also available for unit testing.

`memory-mcp`

MCP (Model Context Protocol) server that exposes agent-memory to AI coding agents like Claude Code.

Tools:

Tool	Description
`memory_ingest`	Ingest text, a file, or files matching a glob pattern into the index
`memory_recall`	Semantic search over ingested documents by natural language query
`memory_sync`	Re-index changed files, remove deleted files
`memory_stats`	Index statistics: document count, token count, type breakdown

Model auto-detection: On startup, the server looks for an ONNX model in:

MEMORY_MODEL_DIR environment variable
<binary_dir>/../models/minilm/
~/.cache/agent-memory/models/minilm/

Falls back to a non-semantic embedder if no model is found (recall quality will be degraded — download the model for production use).

Storage: Per-project memory stored at ~/.cache/agent-memory/<project_hash>/.

`codesearch-mcp` (scaffolded)

Semantic code search server. Will chunk code files (.gitignore-aware), embed with BGE-code-v1, and expose index_codebase + semantic_search MCP tools. Not yet implemented.

Quick Start: memory-mcp (MCP Server)

1. Build

git clone https://github.com/coderjack/turboquant-rs.git
cd turboquant-rs
cargo build --release -p memory-mcp

The binary is at target/release/memory-mcp.

2. Download the embedding model

The server needs an ONNX embedding model. We use all-MiniLM-L6-v2 (384 dimensions, ~80MB):

pip install transformers optimum[onnxruntime] torch
python scripts/export_onnx.py --output ~/.cache/agent-memory/models/minilm

Use --model to export a different model:

python scripts/export_onnx.py --model sentence-transformers/all-mpnet-base-v2 --output ~/.cache/agent-memory/models/mpnet

The server auto-detects the model on startup from these locations (in order):

MEMORY_MODEL_DIR environment variable
<binary_dir>/../models/minilm/
~/.cache/agent-memory/models/minilm/

If no model is found, the server starts with a non-semantic fallback embedder and logs a warning. Semantic recall requires the ONNX model.

3. Configure Claude Code

Add to ~/.claude/settings.json:

{
  "mcpServers": {
    "memory": {
      "command": "/absolute/path/to/turboquant-rs/target/release/memory-mcp"
    }
  }
}

Or with a custom model path:

{
  "mcpServers": {
    "memory": {
      "command": "/absolute/path/to/turboquant-rs/target/release/memory-mcp",
      "env": {
        "MEMORY_MODEL_DIR": "/path/to/your/model/directory"
      }
    }
  }
}

Restart Claude Code after editing settings — MCP servers are loaded on startup.

4. Verify

In a new Claude Code session, you should see 4 new tools. Test with:

memory_stats()

5. Seed the index

memory_ingest(glob: "~/workspace/docs/*.md", doc_type: "Plan")
memory_ingest(path: "/path/to/rca.md", doc_type: "RCA")
memory_ingest(content: "The auth service uses asymmetric JWT with RS256.", doc_type: "Memory")

6. Recall

memory_recall(query: "auth middleware JWT token validation")
memory_recall(query: "database migration", top_k: 10, doc_types: ["RCA"])

7. Sync after file changes

memory_sync()

Re-indexes changed files and removes deleted ones from the index.

Using as a Rust Library

`turboquant` — vector compression

# Cargo.toml
[dependencies]
turboquant = { git = "https://github.com/coderjack/turboquant-rs" }

use turboquant::{TurboIndex, SearchResult};

// Create an index with QJL 1-bit + TurboQuant_mse 3-bit (default)
let mut index = TurboIndex::create("./my_index", 384, 42, 99)?;
index.insert(1, &embedding)?;

let results = index.search(&query, 10);

// Or use compressors directly
use turboquant::{QjlCompressor, TqMseCompressor};

let qjl = QjlCompressor::new(384, 42);
let bits = qjl.compress(&vector);  // 48 bytes

let tqmse = TqMseCompressor::new(384, 99, 3);  // 3-bit
let compressed = tqmse.compress(&vector);        // 148 bytes
let similarity = tqmse.similarity_raw(&query, &compressed);

`agent-memory` — semantic memory for agents

[dependencies]
agent-memory = { git = "https://github.com/coderjack/turboquant-rs" }

use std::sync::Arc;
use agent_memory::{PersistentMemory, OnnxEmbedder, DocumentType};

let embedder = Arc::new(OnnxEmbedder::load("models/minilm".as_ref(), 384)?);
let mut mem = PersistentMemory::open("./memory_store".as_ref(), embedder)?;

// Ingest documents
mem.ingest("RCA: the auth bug was caused by...", DocumentType::RCA, Some("rca.md"))?;
mem.ingest_glob("plans/*.md", DocumentType::Plan)?;

// Recall by semantic query
let results = mem.recall("authentication bug", 5)?;
for r in &results {
    println!("[{:.2}] {:?} — {}", r.combined_score, r.document.doc_type, r.document.content_preview);
}

// Session memory with token budgets
use agent_memory::SessionMemory;

let mut session = SessionMemory::new(embedder.clone())?;
session.add_turn("user", "How does auth work?", 8)?;
session.add_turn("assistant", "The auth middleware validates JWT tokens...", 42)?;

let context = session.select_context("Fix the token expiry bug", 4096, None)?;

Troubleshooting

MCP server not showing up in Claude Code

MCP servers are loaded on startup. Restart Claude Code after editing settings.json.
The command path must be absolute, not relative.
Check that the binary exists: ls /path/to/target/release/memory-mcp

"ONNX model not available" warning on startup

The server couldn't find an ONNX model and is running with a non-semantic fallback. Run the export script:

python scripts/export_onnx.py --output ~/.cache/agent-memory/models/minilm

Verify the model files exist:

ls ~/.cache/agent-memory/models/minilm/
# Should contain: model.onnx, tokenizer.json, and config files

Or set the path explicitly via the MEMORY_MODEL_DIR environment variable in your MCP server config.

"Session successfully initialized" but tools don't work

Check the server logs. memory-mcp logs to stderr:

echo '{}' | RUST_LOG=debug /path/to/memory-mcp 2>&1 | head -20

Look for errors after "memory-mcp server starting".

memory_recall returns no results

The index is empty. Ingest documents first with memory_ingest. Check the index status with memory_stats().

High memory usage

Each project gets its own index at ~/.cache/agent-memory/<hash>/. To clear a project's index:

# Find the cache directory
ls ~/.cache/agent-memory/

# Remove a specific project's index
rm -rf ~/.cache/agent-memory/<hash>/

Compression Efficiency

Method	Bits/dim	Bytes/vec (384-dim)	Compression vs f32	Use case
QJL	1	48	32x	Fast pre-filtering
TurboQuant_mse 2-bit	2	100	15x	Compact re-ranking
TurboQuant_mse 3-bit	3	148	10x	Default re-ranking
TurboQuant_mse 4-bit	4	196	7.8x	High-accuracy re-ranking
PolarQuant	~6	288	5x	Alternative (legacy)

For 10,000 documents at 384 dimensions:

Raw float32: 15.0 MB
QJL + TurboQuant_mse 3-bit: 1.9 MB (7.9x total compression)
QJL + PolarQuant: 3.3 MB

Running Tests

# All tests (65 total)
cargo test --workspace

# With real ONNX model (requires export_onnx.py first)
cargo test -p agent-memory test_onnx_embedder_real_model

Citations

This project implements algorithms from the following papers:

@article{zandieh2025turboquant,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
  journal={arXiv preprint arXiv:2504.19874},
  year={2025}
}

@article{zandieh2024qjl,
  title={QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead},
  author={Zandieh, Amir and Daliri, Majid and Han, Insu},
  journal={arXiv preprint arXiv:2406.03482},
  year={2024}
}

@article{han2025polarquant,
  title={PolarQuant},
  author={Han, Insu and Kacham, Praneeth and Karbasi, Amin and Mirrokni, Vahab and Zandieh, Amir},
  journal={arXiv preprint arXiv:2502.02617},
  year={2025}
}

License

Licensed under either of

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT License (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
crates		crates
docs		docs
scripts		scripts
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

TurboQuant-RS

How It Works

Architecture

Compression Algorithms

QJL (Quantized Johnson-Lindenstrauss) - 1-bit

TurboQuant_mse - Optimal Scalar Quantization (2-4 bit)

PolarQuant (also included)

Two-Stage Search

Crates

turboquant

agent-memory

memory-mcp

codesearch-mcp (scaffolded)

Quick Start: memory-mcp (MCP Server)

1. Build

2. Download the embedding model

3. Configure Claude Code

4. Verify

5. Seed the index

6. Recall

7. Sync after file changes

Using as a Rust Library

turboquant — vector compression

agent-memory — semantic memory for agents

Troubleshooting

MCP server not showing up in Claude Code

"ONNX model not available" warning on startup

"Session successfully initialized" but tools don't work

memory_recall returns no results

High memory usage

Compression Efficiency

Running Tests

Citations

License

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`turboquant`

`agent-memory`

`memory-mcp`

`codesearch-mcp` (scaffolded)

`turboquant` — vector compression

`agent-memory` — semantic memory for agents

Packages