Hanzo Edge

Deploy AI across mobile, web, and embedded applications.

On-device inference with zero cloud dependency. Or use our cloud API for the same models at scale.

Click to copy curl -sSL https://edge.hanzo.ai/install.sh | sh

A full AI edge stack.

SDKs, optimized runtime, model tools, and a cloud API. Everything you need to deploy AI on any device.

On Device

Reduce latency. Work offline. Keep data local and private. No cloud round-trips, no data leaves the device.

Cross Platform

Run the same model across macOS, Linux, iOS, Android, Web, and embedded. One model, every surface.

Multi-Format

Compatible with GGUF, SafeTensors, and ONNX models. Bring your own or use pre-optimized Zen models.

Full Stack

Flexible SDKs, optimized runtime, hardware acceleration. Metal, CUDA, WebAssembly, Vulkan -- automatic detection.

Ready-made solutions and flexible SDKs.

Start with optimized Zen models or bring any GGUF model from Hugging Face. Edge runs them all.

Zen Models

Optimized for Edge

Pre-quantized models from 600M to 8B parameters, tuned for on-device inference. Download and run instantly.

zen-nano (600M) zen-eco (4B) zen4-mini (8B)
Browse models on Hugging Face

Bring Your Own Model

Any GGUF from Hugging Face

Convert and run any GGUF model directly. Supports all major open-weight architectures out of the box.

Llama Qwen Mistral Gemma Phi DeepSeek
Find GGUF models

One runtime. Every platform.

Native SDKs for Rust, CLI, JavaScript/WASM, Python, Swift, and Kotlin. Same models, same API surface, every device.

Terminal
# Run interactive chat
hanzo-edge run zenlm/zen4-mini-gguf

# Single prompt
hanzo-edge run zenlm/zen4-mini-gguf -p "Explain quantum computing"

# Serve OpenAI-compatible API
hanzo-edge serve zenlm/zen4-mini-gguf --port 8080

# Inspect model metadata
hanzo-edge inspect zenlm/zen4-mini-gguf

# Benchmark throughput
hanzo-edge bench zenlm/zen4-mini-gguf --tokens 512
Rust
use hanzo_edge_core::{load_model, InferenceSession, SamplingParams};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let model = load_model("zenlm/zen4-mini-gguf", None)?;
    let session = InferenceSession::new(model, SamplingParams::default());

    let output = session.generate("Explain quantum computing")?;
    println!("{output}");

    Ok(())
}
JavaScript / WASM
import { EdgeModel } from 'hanzo-edge-wasm';

// Load model bytes (from fetch, IndexedDB, etc.)
const model = await EdgeModel.new(modelBytes, tokenizerBytes);

// Streaming generation
model.generate_stream("Hello!", (token) => {
  process.stdout.write(token);
});

// Or await full response
const response = await model.generate("Explain quantum computing");
console.log(response);
Python (via local REST API)
from openai import OpenAI

# Connect to local hanzo-edge server
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="local"
)

response = client.chat.completions.create(
    model="zen4-mini",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)
Swift (Coming Soon)
import HanzoEdge

let model = try EdgeModel(path: "zen4-mini.gguf")

// Single generation
let response = try model.generate("Hello, world!")
print(response)

// Streaming
for try await token in model.stream("Explain quantum computing") {
    print(token, terminator: "")
}
Kotlin (Coming Soon)
import ai.hanzo.edge.EdgeModel

val model = EdgeModel.load("zen4-mini.gguf")

// Single generation
val response = model.generate("Hello, world!")
println(response)

// Streaming with coroutines
model.stream("Explain quantum computing").collect { token ->
    print(token)
}

Three steps to on-device AI.

From model selection to hardware-accelerated inference in minutes.

1

Pick a model

Choose from pre-optimized Zen models or bring any GGUF from Hugging Face. Models are downloaded and cached automatically.

2

Deploy anywhere

Run via CLI, serve a REST API, embed in your app with a native SDK, or load in the browser with WASM.

3

Accelerate

Automatic hardware detection. Metal on macOS and iOS, CUDA on Linux, CPU everywhere. Zero configuration needed.

Quick Start
# Install Hanzo Edge
curl -sSL https://edge.hanzo.ai/install.sh | sh

# Or install via Cargo
cargo install hanzo-edge

# Run a Zen model (downloads automatically)
hanzo-edge run zenlm/zen4-mini-gguf

# Or run any GGUF from Hugging Face
hanzo-edge run user/my-custom-model-gguf

# Start a local OpenAI-compatible server
hanzo-edge serve zenlm/zen4-mini-gguf --port 8080

Run anywhere hardware exists.

Production-ready on desktop and server. Expanding to mobile, web, and embedded.

Platform SDK Backend Status
macOS (Apple Silicon) Rust, CLI Metal Production
macOS (Intel) Rust, CLI CPU / Accelerate Production
Linux x86_64 Rust, CLI CPU / CUDA Production
Linux ARM64 Rust, CLI CPU Production
Web JavaScript / WASM CPU Beta
iOS Swift Metal / CoreML Coming Soon
Android Kotlin Vulkan / NNAPI Coming Soon

Same models, cloud scale.

Use the Hanzo Cloud API for server-side inference when you need it. Same Zen models, same API format, auto-scaling infrastructure.

Hanzo Cloud API
curl https://api.hanzo.ai/v1/chat/completions \
  -H "Authorization: Bearer $HANZO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zen4-mini",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'
Get API Keys ↗

Choose the right deployment.

Edge for on-device, Engine for your servers, Cloud API for managed infrastructure.

Edge Engine Cloud API
Where Your device Your server Hanzo servers
Models GGUF quantized All formats All formats
GPU Metal, CPU, WASM CUDA, Metal NVIDIA A100/H100
Latency ~0ms network ~0ms network ~50ms network
Privacy Full local Your infra Hanzo managed
Scale Single user Multi-user Auto-scaling
Pricing Free / OSS Free / OSS Pay per token

Crates and packages.

Install the runtime, the CLI, or the WASM binding independently. All open source.

hanzo-edge-core

Core inference runtime library. Model loading, tokenization, sampling, and hardware-accelerated tensor ops.

cargo add hanzo-edge-core

hanzo-edge

CLI binary. Interactive sessions, model inspection, benchmarking, and OpenAI-compatible HTTP server.

cargo install hanzo-edge

hanzo-edge-wasm

Browser inference via WebAssembly. Load models from URLs or IndexedDB and run completions client-side.

cargo add hanzo-edge-wasm