Deploy AI across mobile, web, and embedded applications.
On-device inference with zero cloud dependency. Or use our cloud API for the same models at scale.
SDKs, optimized runtime, model tools, and a cloud API. Everything you need to deploy AI on any device.
Reduce latency. Work offline. Keep data local and private. No cloud round-trips, no data leaves the device.
Run the same model across macOS, Linux, iOS, Android, Web, and embedded. One model, every surface.
Compatible with GGUF, SafeTensors, and ONNX models. Bring your own or use pre-optimized Zen models.
Flexible SDKs, optimized runtime, hardware acceleration. Metal, CUDA, WebAssembly, Vulkan -- automatic detection.
Start with optimized Zen models or bring any GGUF model from Hugging Face. Edge runs them all.
Optimized for Edge
Pre-quantized models from 600M to 8B parameters, tuned for on-device inference. Download and run instantly.
Any GGUF from Hugging Face
Convert and run any GGUF model directly. Supports all major open-weight architectures out of the box.
Native SDKs for Rust, CLI, JavaScript/WASM, Python, Swift, and Kotlin. Same models, same API surface, every device.
# Run interactive chat hanzo-edge run zenlm/zen4-mini-gguf # Single prompt hanzo-edge run zenlm/zen4-mini-gguf -p "Explain quantum computing" # Serve OpenAI-compatible API hanzo-edge serve zenlm/zen4-mini-gguf --port 8080 # Inspect model metadata hanzo-edge inspect zenlm/zen4-mini-gguf # Benchmark throughput hanzo-edge bench zenlm/zen4-mini-gguf --tokens 512
use hanzo_edge_core::{load_model, InferenceSession, SamplingParams}; fn main() -> Result<(), Box<dyn std::error::Error>> { let model = load_model("zenlm/zen4-mini-gguf", None)?; let session = InferenceSession::new(model, SamplingParams::default()); let output = session.generate("Explain quantum computing")?; println!("{output}"); Ok(()) }
import { EdgeModel } from 'hanzo-edge-wasm'; // Load model bytes (from fetch, IndexedDB, etc.) const model = await EdgeModel.new(modelBytes, tokenizerBytes); // Streaming generation model.generate_stream("Hello!", (token) => { process.stdout.write(token); }); // Or await full response const response = await model.generate("Explain quantum computing"); console.log(response);
from openai import OpenAI # Connect to local hanzo-edge server client = OpenAI( base_url="http://localhost:8080/v1", api_key="local" ) response = client.chat.completions.create( model="zen4-mini", messages=[{"role": "user", "content": "Hello!"}] ) print(response.choices[0].message.content)
import HanzoEdge let model = try EdgeModel(path: "zen4-mini.gguf") // Single generation let response = try model.generate("Hello, world!") print(response) // Streaming for try await token in model.stream("Explain quantum computing") { print(token, terminator: "") }
import ai.hanzo.edge.EdgeModel val model = EdgeModel.load("zen4-mini.gguf") // Single generation val response = model.generate("Hello, world!") println(response) // Streaming with coroutines model.stream("Explain quantum computing").collect { token -> print(token) }
From model selection to hardware-accelerated inference in minutes.
Choose from pre-optimized Zen models or bring any GGUF from Hugging Face. Models are downloaded and cached automatically.
Run via CLI, serve a REST API, embed in your app with a native SDK, or load in the browser with WASM.
Automatic hardware detection. Metal on macOS and iOS, CUDA on Linux, CPU everywhere. Zero configuration needed.
# Install Hanzo Edge curl -sSL https://edge.hanzo.ai/install.sh | sh # Or install via Cargo cargo install hanzo-edge # Run a Zen model (downloads automatically) hanzo-edge run zenlm/zen4-mini-gguf # Or run any GGUF from Hugging Face hanzo-edge run user/my-custom-model-gguf # Start a local OpenAI-compatible server hanzo-edge serve zenlm/zen4-mini-gguf --port 8080
Production-ready on desktop and server. Expanding to mobile, web, and embedded.
| Platform | SDK | Backend | Status |
|---|---|---|---|
| macOS (Apple Silicon) | Rust, CLI | Metal | Production |
| macOS (Intel) | Rust, CLI | CPU / Accelerate | Production |
| Linux x86_64 | Rust, CLI | CPU / CUDA | Production |
| Linux ARM64 | Rust, CLI | CPU | Production |
| Web | JavaScript / WASM | CPU | Beta |
| iOS | Swift | Metal / CoreML | Coming Soon |
| Android | Kotlin | Vulkan / NNAPI | Coming Soon |
Use the Hanzo Cloud API for server-side inference when you need it. Same Zen models, same API format, auto-scaling infrastructure.
curl https://api.hanzo.ai/v1/chat/completions \ -H "Authorization: Bearer $HANZO_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "zen4-mini", "messages": [ {"role": "user", "content": "Hello!"} ] }'
Edge for on-device, Engine for your servers, Cloud API for managed infrastructure.
| Edge | Engine | Cloud API | |
|---|---|---|---|
| Where | Your device | Your server | Hanzo servers |
| Models | GGUF quantized | All formats | All formats |
| GPU | Metal, CPU, WASM | CUDA, Metal | NVIDIA A100/H100 |
| Latency | ~0ms network | ~0ms network | ~50ms network |
| Privacy | Full local | Your infra | Hanzo managed |
| Scale | Single user | Multi-user | Auto-scaling |
| Pricing | Free / OSS | Free / OSS | Pay per token |
Install the runtime, the CLI, or the WASM binding independently. All open source.
Core inference runtime library. Model loading, tokenization, sampling, and hardware-accelerated tensor ops.
cargo add hanzo-edge-coreCLI binary. Interactive sessions, model inspection, benchmarking, and OpenAI-compatible HTTP server.
cargo install hanzo-edgeBrowser inference via WebAssembly. Load models from URLs or IndexedDB and run completions client-side.
cargo add hanzo-edge-wasm