Local embedding server in Rust.
Ecosystem: Memory API • Code Search • Dashboard • Local Embeddings
Drop-in replacement for OpenAI's embeddings API. Zero cost, sub-10ms latency, data never leaves your machine.
| OpenAI | engram-embed | |
|---|---|---|
| Cost | $0.0001/1K tokens | Free |
| Latency | ~100ms (network) | ~10ms (local) |
| Rate limits | Yes | None |
| Privacy | Data sent to cloud | Data stays local |
| Offline | No | Yes |
At scale: $100+/day → $0/day
# Install Rust (if needed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Clone and build
git clone https://github.com/heybeaux/engram-embed
cd engram-embed
cargo build --release
# Run (models download on first request)
cargo run --release
# Test it
curl -X POST http://127.0.0.1:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"input": "Hello, world!"}'┌─────────────────────────────────────────────────────────────┐
│ engram-embed │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Axum Server │ │
│ │ POST /v1/embeddings │ │
│ │ GET /v1/models │ │
│ │ GET /health │ │
│ └───────────────────────┬──────────────────────────────┘ │
│ │ │
│ ┌───────────────────────▼──────────────────────────────┐ │
│ │ ModelRegistry (lazy loading) │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ bge-base │ │ minilm │ │ gte-base │ │ │
│ │ │ 768-dim │ │ 384-dim │ │ 768-dim │ │ │
│ │ │ 512 tok │ │ 256 tok │ │ 512 tok │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ │ ┌─────────────┐ │ │
│ │ │ nomic │ │ │
│ │ │ 768-dim │ │ │
│ │ │ 8192 tok │ │ │
│ │ └─────────────┘ │ │
│ └───────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────────────▼──────────────────────────────┐ │
│ │ Candle Runtime │ │
│ │ HuggingFace's Rust ML Framework │ │
│ │ (CPU / Metal acceleration) │ │
│ └───────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
| Model | Dimensions | Max Tokens | Best For | Memory |
|---|---|---|---|---|
bge-base |
768 | 512 | General purpose, best quality | ~450MB |
minilm |
384 | 256 | Fast, short text | ~90MB |
gte-base |
768 | 512 | Alternative semantic space | ~450MB |
nomic |
768 | 8192 | Long documents, code | ~550MB |
kalm-v2 |
896 | 512 | High-quality multilingual (opt-in) | ~1GB |
Default: bge-base — top-tier open-source embeddings, excellent quality/speed tradeoff.
KaLM-V2 (HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v2) — a 0.5B Qwen2-based embedding model that rivals models 3-26× larger on MTEB benchmarks. Opt-in only:
EMBED_MODELS=kalm-v2orEMBED_MODELS=bge-base,kalm-v2. Uses instruction prefixes for queries; no prefix for documents. Apache 2.0 licensed.
# Single model (default)
EMBED_MODELS=bge-base cargo run --release
# Multiple models for ensemble
EMBED_MODELS=bge-base,minilm,nomic cargo run --release
# All available models
EMBED_MODELS=all cargo run --releaseModels are loaded lazily on first request to save memory. Up to 3 models kept loaded with LRU eviction.
POST /v1/embeddingsRequest:
{
"input": "text to embed", // string or array of strings
"model": "bge-base" // optional, defaults to bge-base
}Response:
{
"object": "list",
"data": [
{
"object": "embedding",
"embedding": [0.123, -0.456, ...],
"index": 0
}
],
"model": "bge-base",
"usage": {
"prompt_tokens": 3,
"total_tokens": 3
}
}Use model: "*" or model: "all" to embed with all enabled models at once:
curl -X POST http://127.0.0.1:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"input": "Hello, world!", "model": "*"}'Response:
{
"object": "list",
"embeddings": [
{
"model": "bge-base",
"dimensions": 768,
"data": [{ "embedding": [...], "index": 0 }]
},
{
"model": "minilm",
"dimensions": 384,
"data": [{ "embedding": [...], "index": 0 }]
}
],
"timing": {
"total_ms": 25,
"per_model": { "bge-base": 12, "minilm": 8 }
}
}GET /v1/modelsResponse:
{
"object": "list",
"data": [
{
"id": "bge-base",
"dimensions": 768,
"max_tokens": 512,
"loaded": true
},
{
"id": "minilm",
"dimensions": 384,
"max_tokens": 256,
"loaded": false
}
]
}GET /healthResponse:
{
"status": "ok",
"models": [
{ "id": "bge-base", "dimensions": 768, "max_tokens": 512, "loaded": true, "default": true }
],
"loaded_count": 1,
"version": "0.1.0"
}BERT-based models have a maximum sequence length (typically 512 tokens). Without truncation, long inputs cause a panic:
thread 'main' panicked at 'index out of bounds: position embeddings only support 512 tokens'
engram-embed handles this automatically:
// Truncation enabled on tokenizer initialization
tokenizer.with_truncation(Some(TruncationParams {
max_length: model.max_tokens(), // 512 for bge-base
strategy: TruncationStrategy::LongestFirst,
direction: TruncationDirection::Right,
}));This means:
- Long text is automatically truncated to fit the model
- No panics or errors on long inputs
- Truncation happens from the right (keeps the beginning)
- Works for all models with their respective limits
For very long content (code files, documents): Use the nomic model which supports 8192 tokens.
# In engram/.env
EMBEDDING_PROVIDER=local
EMBEDDING_LOCAL_URL=http://127.0.0.1:8080
EMBEDDING_DIMENSIONS=768# In engram-code/.env
ENGRAM_EMBED_URL=http://127.0.0.1:8080Both services share the same embedding server for consistent vector representations.
For improved search accuracy, use multiple models together:
┌─────────────────────────────────────────────────────┐
│ Query: "user authentication" │
│ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ bge-base │ │ nomic │ │
│ │ General │ │ Long ctx │ │
│ │ purpose │ │ semantic │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ └─────────┬─────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ RRF Fusion │ │
│ │ (in engram / │ │
│ │ engram-code) │ │
│ └─────────────────┘ │
│ │ │
│ Better recall than single model │
└─────────────────────────────────────────────────────┘
Why multiple models?
- Different models capture different semantic aspects
- Consensus (found by multiple models) increases confidence
- Reduces single-model blind spots
- Nomic's 8K context catches patterns bge-base might miss
# In engram/.env
ENSEMBLE_ENABLED=true
ENSEMBLE_MODELS=bge-base,nomic
ENSEMBLE_WEIGHTS={"bge-base": 1.0, "nomic": 0.8}
ENSEMBLE_RRF_K=60On M2 MacBook Pro (CPU):
| Operation | bge-base | minilm | nomic |
|---|---|---|---|
| Single text | ~10ms | ~5ms | ~15ms |
| Batch of 100 | ~400ms | ~200ms | ~600ms |
| First request (load) | ~3s | ~2s | ~5s |
Memory usage:
- 1 model loaded: ~500MB
- 2 models loaded: ~1GB
- 3 models loaded: ~1.5GB
Models are loaded lazily and evicted LRU when memory limit reached.
| Variable | Default | Description |
|---|---|---|
EMBED_MODELS |
bge-base |
Models to enable (comma-separated or all) |
PORT |
8080 |
Server port |
| Component | Technology | Why |
|---|---|---|
| Language | Rust | Performance, single binary, memory safety |
| HTTP | Axum | Async, ergonomic, Tokio-based |
| ML Runtime | Candle | HuggingFace's Rust ML, Apple Silicon support |
| Tokenizer | tokenizers | Rust-native, fast |
# Debug build (faster compile, slower runtime)
cargo build
# Release build (slower compile, optimized runtime)
cargo build --release
# Run tests
cargo test
# Run with specific models
EMBED_MODELS=bge-base,minilm cargo run --releaseCreate ~/Library/LaunchAgents/com.engram.embed.plist:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.engram.embed</string>
<key>ProgramArguments</key>
<array>
<string>/path/to/engram-embed/target/release/engram-embed</string>
</array>
<key>EnvironmentVariables</key>
<dict>
<key>EMBED_MODELS</key>
<string>bge-base,nomic</string>
</dict>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>StandardOutPath</key>
<string>/tmp/engram-embed.log</string>
<key>StandardErrorPath</key>
<string>/tmp/engram-embed.err</string>
</dict>
</plist># Load the service
launchctl load ~/Library/LaunchAgents/com.engram.embed.plist
# Check status (PID shown means running)
launchctl list | grep engram
# View logs
tail -f /tmp/engram-embed.log
# Restart the service
launchctl unload ~/Library/LaunchAgents/com.engram.embed.plist
launchctl load ~/Library/LaunchAgents/com.engram.embed.plist# Stop and unload the service
launchctl unload ~/Library/LaunchAgents/com.engram.embed.plist
# Remove the plist file
rm ~/Library/LaunchAgents/com.engram.embed.plist
# Optional: remove log files
rm /tmp/engram-embed.log /tmp/engram-embed.err
# Optional: remove cached model files
rm -rf ~/.cache/huggingface/hub/models--BAAI--bge-base-en-v1.5
rm -rf ~/.cache/huggingface/hub/models--nomic-ai--nomic-embed-text-v1.5Models are downloaded from HuggingFace Hub on first request. If download fails:
# Check network connectivity
curl -I https://huggingface.co
# Pre-download model manually
huggingface-cli download BAAI/bge-base-en-v1.5Reduce the number of loaded models:
EMBED_MODELS=bge-base cargo run --releaseFirst request for each model triggers download + load (~3-5s). Subsequent requests are fast (~10ms).
To pre-warm models on startup:
# After starting server, hit each model once
curl -X POST http://127.0.0.1:8080/v1/embeddings \
-d '{"input": "warmup", "model": "bge-base"}'MIT
Embeddings, locally.