Hanzo Edge

Why Hanzo Edge

A full AI edge stack.

SDKs, optimized runtime, model tools, and a cloud API. Everything you need to deploy AI on any device.

On Device

Reduce latency. Work offline. Keep data local and private. No cloud round-trips, no data leaves the device.

Cross Platform

Run the same model across macOS, Linux, iOS, Android, Web, and embedded. One model, every surface.

Multi-Format

Compatible with GGUF, SafeTensors, and ONNX models. Bring your own or use pre-optimized Zen models.

Full Stack

Flexible SDKs, optimized runtime, hardware acceleration. Metal, CUDA, WebAssembly, Vulkan -- automatic detection.

Solutions

Ready-made solutions and flexible SDKs.

Start with optimized Zen models or bring any GGUF model from Hugging Face. Edge runs them all.

Zen Models

Optimized for Edge

Pre-quantized models from 600M to 8B parameters, tuned for on-device inference. Download and run instantly.

zen-nano (600M) zen-eco (4B) zen4-mini (8B)

Browse models on Hugging Face

Bring Your Own Model

Any GGUF from Hugging Face

Convert and run any GGUF model directly. Supports all major open-weight architectures out of the box.

Llama Qwen Mistral Gemma Phi DeepSeek

Find GGUF models

Platform SDKs

One runtime. Every platform.

Native SDKs for Rust, CLI, JavaScript/WASM, Python, Swift, and Kotlin. Same models, same API surface, every device.

Terminal

# Run interactive chat
hanzo-edge run zenlm/zen4-mini-gguf

# Single prompt
hanzo-edge run zenlm/zen4-mini-gguf -p "Explain quantum computing"

# Serve OpenAI-compatible API
hanzo-edge serve zenlm/zen4-mini-gguf --port 8080

# Inspect model metadata
hanzo-edge inspect zenlm/zen4-mini-gguf

# Benchmark throughput
hanzo-edge bench zenlm/zen4-mini-gguf --tokens 512

Rust

use hanzo_edge_core::{load_model, InferenceSession, SamplingParams};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let model = load_model("zenlm/zen4-mini-gguf", None)?;
    let session = InferenceSession::new(model, SamplingParams::default());

    let output = session.generate("Explain quantum computing")?;
    println!("{output}");

    Ok(())
}

JavaScript / WASM

import { EdgeModel } from 'hanzo-edge-wasm';

// Load model bytes (from fetch, IndexedDB, etc.)
const model = await EdgeModel.new(modelBytes, tokenizerBytes);

// Streaming generation
model.generate_stream("Hello!", (token) => {
  process.stdout.write(token);
});

// Or await full response
const response = await model.generate("Explain quantum computing");
console.log(response);

Python (via local REST API)

from openai import OpenAI

# Connect to local hanzo-edge server
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="local"
)

response = client.chat.completions.create(
    model="zen4-mini",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

Swift (Coming Soon)

import HanzoEdge

let model = try EdgeModel(path: "zen4-mini.gguf")

// Single generation
let response = try model.generate("Hello, world!")
print(response)

// Streaming
for try await token in model.stream("Explain quantum computing") {
    print(token, terminator: "")
}

Kotlin (Coming Soon)

import ai.hanzo.edge.EdgeModel

val model = EdgeModel.load("zen4-mini.gguf")

// Single generation
val response = model.generate("Hello, world!")
println(response)

// Streaming with coroutines
model.stream("Explain quantum computing").collect { token ->
    print(token)
}

How It Works

Three steps to on-device AI.

From model selection to hardware-accelerated inference in minutes.

Pick a model

Choose from pre-optimized Zen models or bring any GGUF from Hugging Face. Models are downloaded and cached automatically.

Deploy anywhere

Run via CLI, serve a REST API, embed in your app with a native SDK, or load in the browser with WASM.

Accelerate

Automatic hardware detection. Metal on macOS and iOS, CUDA on Linux, CPU everywhere. Zero configuration needed.

Quick Start

# Install Hanzo Edge
curl -sSL https://edge.hanzo.ai/install.sh | sh

# Or install via Cargo
cargo install hanzo-edge

# Run a Zen model (downloads automatically)
hanzo-edge run zenlm/zen4-mini-gguf

# Or run any GGUF from Hugging Face
hanzo-edge run user/my-custom-model-gguf

# Start a local OpenAI-compatible server
hanzo-edge serve zenlm/zen4-mini-gguf --port 8080

Platform	SDK	Backend	Status
macOS (Apple Silicon)	Rust, CLI	Metal	Production
macOS (Intel)	Rust, CLI	CPU / Accelerate	Production
Linux x86_64	Rust, CLI	CPU / CUDA	Production
Linux ARM64	Rust, CLI	CPU	Production
Web	JavaScript / WASM	CPU	Beta
iOS	Swift	Metal / CoreML	Coming Soon
Android	Kotlin	Vulkan / NNAPI	Coming Soon

Cloud API

Same models, cloud scale.

Use the Hanzo Cloud API for server-side inference when you need it. Same Zen models, same API format, auto-scaling infrastructure.

Hanzo Cloud API

curl https://api.hanzo.ai/v1/chat/completions \
  -H "Authorization: Bearer $HANZO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zen4-mini",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Get API Keys ↗

	Edge	Engine	Cloud API
Where	Your device	Your server	Hanzo servers
Models	GGUF quantized	All formats	All formats
GPU	Metal, CPU, WASM	CUDA, Metal	NVIDIA A100/H100
Latency	~0ms network	~0ms network	~50ms network
Privacy	Full local	Your infra	Hanzo managed
Scale	Single user	Multi-user	Auto-scaling
Pricing	Free / OSS	Free / OSS	Pay per token

Packages

Crates and packages.

Install the runtime, the CLI, or the WASM binding independently. All open source.

hanzo-edge-core

Core inference runtime library. Model loading, tokenization, sampling, and hardware-accelerated tensor ops.

cargo add hanzo-edge-core

hanzo-edge

CLI binary. Interactive sessions, model inspection, benchmarking, and OpenAI-compatible HTTP server.

cargo install hanzo-edge

hanzo-edge-wasm

Browser inference via WebAssembly. Load models from URLs or IndexedDB and run completions client-side.

cargo add hanzo-edge-wasm