Skip to content

coderjack/turboquant-rs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TurboQuant-RS

Fast vector compression and semantic memory for AI agents.

TurboQuant-RS implements the TurboQuant family of algorithms for compressing high-dimensional embedding vectors to 1-4 bits per dimension, enabling semantic search over thousands of documents with minimal memory. Built for AI coding agents that need to recall prior context across sessions.

How It Works

TurboQuant Algorithm Animation

Open docs/algorithm-explainer.html in a browser for the full interactive version with randomizable demos.

Architecture

turboquant-rs/
  crates/
    turboquant/       Core: QJL (1-bit) + TurboQuant_mse (2-4 bit) + two-stage search
    agent-memory/     Session context ranking + persistent cross-session semantic recall
    memory-mcp/       MCP server: 4 tools for AI agent integration
    codesearch-mcp/   Semantic code search (scaffolded, not yet implemented)
  scripts/
    export_onnx.py    Download & export embedding models to ONNX format

Compression Algorithms

QJL (Quantized Johnson-Lindenstrauss) - 1-bit

Compresses each vector to 1 bit per dimension (32x compression vs float32).

  1. Generate a random Gaussian projection matrix R (deterministic from seed)
  2. Project the vector: y = R @ x
  3. Keep only the sign bits: b = sign(y)

Vectors that are similar in the original space will share most sign bits, so Hamming distance between bit vectors approximates cosine distance. Used as a fast pre-filter in the first search stage.

Zandieh, A., Daliri, M., & Han, I. (2024). QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead. arXiv:2406.03482

TurboQuant_mse - Optimal Scalar Quantization (2-4 bit)

Compresses each vector to b bits per dimension with provably near-optimal MSE distortion (within 2.7x of the information-theoretic lower bound).

  1. Normalize the vector and store its L2 norm separately
  2. Multiply by a random orthogonal matrix Q (Gram-Schmidt on Gaussian, deterministic from seed)
  3. After rotation, each coordinate follows ≈ N(0, 1/d) — quantize independently with a precomputed Lloyd-Max codebook
  4. Pack b-bit indices into bytes

Key insight for search: Since Q is orthogonal, <x, y> = <Qx, Qy>. Similarity is computed directly in the rotated domain using codebook lookups — no matrix multiply during search.

Bits Centroids Storage (384-dim) MSE bound (theoretical)
2 4 100 bytes ≤ 0.170
3 8 148 bytes ≤ 0.043
4 16 196 bytes ≤ 0.011

Zandieh, A., Daliri, M., Hadian, M., & Mirrokni, V. (2025). TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate. arXiv:2504.19874

PolarQuant (also included)

Available as an alternative compressor. Pairs adjacent dimensions into 2D polar coordinates, quantizes angles (4-bit) and radii (8-bit). Not used in the default search pipeline but kept for comparison.

Han, I., Kacham, P., Karbasi, A., Mirrokni, V., & Zandieh, A. (2025). PolarQuant. arXiv:2502.02617

Two-Stage Search

Query vector
    |
    v
[QJL 1-bit] --> Hamming scan over all vectors --> top-K candidates  (fast, approximate)
    |
    v
[TurboQuant_mse 3-bit] --> codebook similarity on K candidates --> final top-k results

QJL eliminates ~98% of candidates with a single XOR + popcount per vector. TurboQuant_mse re-ranks only the survivors with near-optimal accuracy.

Crates

turboquant

Core compression library with no runtime dependencies beyond ndarray.

use turboquant::{TurboIndex, SearchResult};

// Create an index (stored on disk via mmap)
// Default: QJL 1-bit pre-filter + TurboQuant_mse 3-bit re-ranking
let mut index = TurboIndex::create("./my_index", 384, /*qjl_seed=*/42, /*tqmse_seed=*/99)?;

// Insert vectors (e.g., from an embedding model)
index.insert(1, &embedding_vec)?;
index.insert_batch(&ids, &embedding_matrix)?;

// Search
let results: Vec<SearchResult> = index.search(&query_vec, 10);
for r in &results {
    println!("id={} score={:.3} hamming={}", r.id, r.score, r.distance);
}

// Maintenance
index.delete(1)?;
index.compact()?;  // rebuild without deleted vectors

Use TurboIndex::create_with_bits() to choose 2-bit (smallest) or 4-bit (most accurate) re-ranking.

Storage format: Memory-mapped files with 16-byte headers. QJL vectors (*.qjl) and TurboQuant_mse vectors (*.tqsq) are stored separately for cache-friendly scanning.

agent-memory

Two memory systems designed for AI agent workflows:

SessionMemory - in-conversation context management:

  • Tracks conversation turns with token counts
  • Selects relevant context within a token budget
  • Ranks by relevance_weight * search_score + recency_weight * recency_score
  • Always includes the N most recent turns (configurable)
use agent_memory::{SessionMemory, OnnxEmbedder};

let embedder = Arc::new(OnnxEmbedder::load(Path::new("models/minilm"), 384)?);
let mut session = SessionMemory::new(embedder)?;
session.add_turn("user", "How does the auth middleware work?", 12)?;
session.add_turn("assistant", "The auth middleware validates JWT tokens...", 85)?;

// Select context that fits in 4096 tokens, ranked by relevance to current message
let context = session.select_context("Now fix the token expiry bug", 4096, None)?;

PersistentMemory - cross-session knowledge base:

  • Ingests documents (plans, RCAs, investigation notes) with semantic indexing
  • Recalls by natural language query with type filtering
  • Tracks file changes via SHA-256 for incremental sync
  • Supports glob patterns for bulk ingestion
use agent_memory::{PersistentMemory, OnnxEmbedder, DocumentType};

let embedder = Arc::new(OnnxEmbedder::load(Path::new("models/minilm"), 384)?);
let mut mem = PersistentMemory::open(Path::new("./memory_store"), embedder)?;

// Ingest
mem.ingest("Auth rewrite: moved to asymmetric JWT...", DocumentType::RCA, Some("docs/rca_auth.md"))?;
mem.ingest_glob("docs/plans/*.md", DocumentType::Plan)?;

// Recall
let results = mem.recall("JWT token validation", 5)?;
let rca_only = mem.recall_typed("auth bug", 5, &[DocumentType::RCA])?;

// Sync after files change on disk
let stats = mem.sync()?;
println!("{} added, {} updated, {} removed", stats.added, stats.updated, stats.removed);

Document types: Plan, Memory, RCA, CommitContext, ConversationTurn, ToolResult, Custom(String)

Embedder trait: Pluggable embedding backend. Ships with OnnxEmbedder for production use via ONNX Runtime (supports any HuggingFace model exported to ONNX). A MockEmbedder is also available for unit testing.

memory-mcp

MCP (Model Context Protocol) server that exposes agent-memory to AI coding agents like Claude Code.

Tools:

Tool Description
memory_ingest Ingest text, a file, or files matching a glob pattern into the index
memory_recall Semantic search over ingested documents by natural language query
memory_sync Re-index changed files, remove deleted files
memory_stats Index statistics: document count, token count, type breakdown

Model auto-detection: On startup, the server looks for an ONNX model in:

  1. MEMORY_MODEL_DIR environment variable
  2. <binary_dir>/../models/minilm/
  3. ~/.cache/agent-memory/models/minilm/

Falls back to a non-semantic embedder if no model is found (recall quality will be degraded — download the model for production use).

Storage: Per-project memory stored at ~/.cache/agent-memory/<project_hash>/.

codesearch-mcp (scaffolded)

Semantic code search server. Will chunk code files (.gitignore-aware), embed with BGE-code-v1, and expose index_codebase + semantic_search MCP tools. Not yet implemented.

Quick Start: memory-mcp (MCP Server)

1. Build

git clone https://github.com/coderjack/turboquant-rs.git
cd turboquant-rs
cargo build --release -p memory-mcp

The binary is at target/release/memory-mcp.

2. Download the embedding model

The server needs an ONNX embedding model. We use all-MiniLM-L6-v2 (384 dimensions, ~80MB):

pip install transformers optimum[onnxruntime] torch
python scripts/export_onnx.py --output ~/.cache/agent-memory/models/minilm

Use --model to export a different model:

python scripts/export_onnx.py --model sentence-transformers/all-mpnet-base-v2 --output ~/.cache/agent-memory/models/mpnet

The server auto-detects the model on startup from these locations (in order):

  1. MEMORY_MODEL_DIR environment variable
  2. <binary_dir>/../models/minilm/
  3. ~/.cache/agent-memory/models/minilm/

If no model is found, the server starts with a non-semantic fallback embedder and logs a warning. Semantic recall requires the ONNX model.

3. Configure Claude Code

Add to ~/.claude/settings.json:

{
  "mcpServers": {
    "memory": {
      "command": "/absolute/path/to/turboquant-rs/target/release/memory-mcp"
    }
  }
}

Or with a custom model path:

{
  "mcpServers": {
    "memory": {
      "command": "/absolute/path/to/turboquant-rs/target/release/memory-mcp",
      "env": {
        "MEMORY_MODEL_DIR": "/path/to/your/model/directory"
      }
    }
  }
}

Restart Claude Code after editing settings — MCP servers are loaded on startup.

4. Verify

In a new Claude Code session, you should see 4 new tools. Test with:

memory_stats()

5. Seed the index

memory_ingest(glob: "~/workspace/docs/*.md", doc_type: "Plan")
memory_ingest(path: "/path/to/rca.md", doc_type: "RCA")
memory_ingest(content: "The auth service uses asymmetric JWT with RS256.", doc_type: "Memory")

6. Recall

memory_recall(query: "auth middleware JWT token validation")
memory_recall(query: "database migration", top_k: 10, doc_types: ["RCA"])

7. Sync after file changes

memory_sync()

Re-indexes changed files and removes deleted ones from the index.

Using as a Rust Library

turboquant — vector compression

# Cargo.toml
[dependencies]
turboquant = { git = "https://github.com/coderjack/turboquant-rs" }
use turboquant::{TurboIndex, SearchResult};

// Create an index with QJL 1-bit + TurboQuant_mse 3-bit (default)
let mut index = TurboIndex::create("./my_index", 384, 42, 99)?;
index.insert(1, &embedding)?;

let results = index.search(&query, 10);

// Or use compressors directly
use turboquant::{QjlCompressor, TqMseCompressor};

let qjl = QjlCompressor::new(384, 42);
let bits = qjl.compress(&vector);  // 48 bytes

let tqmse = TqMseCompressor::new(384, 99, 3);  // 3-bit
let compressed = tqmse.compress(&vector);        // 148 bytes
let similarity = tqmse.similarity_raw(&query, &compressed);

agent-memory — semantic memory for agents

[dependencies]
agent-memory = { git = "https://github.com/coderjack/turboquant-rs" }
use std::sync::Arc;
use agent_memory::{PersistentMemory, OnnxEmbedder, DocumentType};

let embedder = Arc::new(OnnxEmbedder::load("models/minilm".as_ref(), 384)?);
let mut mem = PersistentMemory::open("./memory_store".as_ref(), embedder)?;

// Ingest documents
mem.ingest("RCA: the auth bug was caused by...", DocumentType::RCA, Some("rca.md"))?;
mem.ingest_glob("plans/*.md", DocumentType::Plan)?;

// Recall by semantic query
let results = mem.recall("authentication bug", 5)?;
for r in &results {
    println!("[{:.2}] {:?} — {}", r.combined_score, r.document.doc_type, r.document.content_preview);
}

// Session memory with token budgets
use agent_memory::SessionMemory;

let mut session = SessionMemory::new(embedder.clone())?;
session.add_turn("user", "How does auth work?", 8)?;
session.add_turn("assistant", "The auth middleware validates JWT tokens...", 42)?;

let context = session.select_context("Fix the token expiry bug", 4096, None)?;

Troubleshooting

MCP server not showing up in Claude Code

  • MCP servers are loaded on startup. Restart Claude Code after editing settings.json.
  • The command path must be absolute, not relative.
  • Check that the binary exists: ls /path/to/target/release/memory-mcp

"ONNX model not available" warning on startup

The server couldn't find an ONNX model and is running with a non-semantic fallback. Run the export script:

python scripts/export_onnx.py --output ~/.cache/agent-memory/models/minilm

Verify the model files exist:

ls ~/.cache/agent-memory/models/minilm/
# Should contain: model.onnx, tokenizer.json, and config files

Or set the path explicitly via the MEMORY_MODEL_DIR environment variable in your MCP server config.

"Session successfully initialized" but tools don't work

Check the server logs. memory-mcp logs to stderr:

echo '{}' | RUST_LOG=debug /path/to/memory-mcp 2>&1 | head -20

Look for errors after "memory-mcp server starting".

memory_recall returns no results

The index is empty. Ingest documents first with memory_ingest. Check the index status with memory_stats().

High memory usage

Each project gets its own index at ~/.cache/agent-memory/<hash>/. To clear a project's index:

# Find the cache directory
ls ~/.cache/agent-memory/

# Remove a specific project's index
rm -rf ~/.cache/agent-memory/<hash>/

Compression Efficiency

Method Bits/dim Bytes/vec (384-dim) Compression vs f32 Use case
QJL 1 48 32x Fast pre-filtering
TurboQuant_mse 2-bit 2 100 15x Compact re-ranking
TurboQuant_mse 3-bit 3 148 10x Default re-ranking
TurboQuant_mse 4-bit 4 196 7.8x High-accuracy re-ranking
PolarQuant ~6 288 5x Alternative (legacy)

For 10,000 documents at 384 dimensions:

  • Raw float32: 15.0 MB
  • QJL + TurboQuant_mse 3-bit: 1.9 MB (7.9x total compression)
  • QJL + PolarQuant: 3.3 MB

Running Tests

# All tests (65 total)
cargo test --workspace

# With real ONNX model (requires export_onnx.py first)
cargo test -p agent-memory test_onnx_embedder_real_model

Citations

This project implements algorithms from the following papers:

@article{zandieh2025turboquant,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
  journal={arXiv preprint arXiv:2504.19874},
  year={2025}
}

@article{zandieh2024qjl,
  title={QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead},
  author={Zandieh, Amir and Daliri, Majid and Han, Insu},
  journal={arXiv preprint arXiv:2406.03482},
  year={2024}
}

@article{han2025polarquant,
  title={PolarQuant},
  author={Han, Insu and Kacham, Praneeth and Karbasi, Amin and Mirrokni, Vahab and Zandieh, Amir},
  journal={arXiv preprint arXiv:2502.02617},
  year={2025}
}

License

Licensed under either of

at your option.

About

QJL + PolarQuant vector compression, semantic memory for AI agents, MCP server

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Packages

 
 
 

Contributors