|
1 | | -# mlx-server |
| 1 | +# ⚡️ mlx-server |
2 | 2 |
|
3 | | -A native Swift server that serves [MLX](https://github.com/ml-explore/mlx) models with an **OpenAI-compatible API**. No Python runtime required — compiles to a single binary that runs on Apple Silicon. |
| 3 | +A blazingly fast, native Swift inference server that serves [MLX](https://github.com/ml-explore/mlx) models with a strict **OpenAI-compatible API**. |
4 | 4 |
|
5 | | -## Features |
| 5 | +No Python runtime, no Global Interpreter Lock (GIL), no unnecessary memory copies. Just bare-metal Apple Silicon performance compiled to a single binary. |
6 | 6 |
|
7 | | -- 🚀 **Native Swift** — compiled binary, no Python dependency |
8 | | -- 🍎 **Apple Silicon optimized** — uses Metal GPU via MLX |
9 | | -- 🔌 **OpenAI-compatible API** — drop-in replacement for local inference |
10 | | -- 📡 **Streaming support** — SSE streaming for real-time token generation |
11 | | -- 🤗 **HuggingFace models** — loads any MLX-format model directly |
| 7 | +## 🚀 Features |
12 | 8 |
|
13 | | -## Quick Start |
| 9 | +- 🍎 **100% Native Apple Silicon**: Powered natively by Metal and Swift. |
| 10 | +- 🔌 **OpenAI-compatible**: Drop-in replacement for OpenAI SDKs (`/v1/chat/completions`, streaming, etc). |
| 11 | +- 🧠 **Smart Model Routing**: Loads HuggingFace format models directly, with native Safetensors parsing. |
| 12 | +- ⚡️ **TurboQuantization Integrated**: Custom low-level MLX Metal primitives that apply extremely fast quantization for KV caching out-of-the-box. |
| 13 | +- 💾 **SSD Expert Streaming**: *Experimental* zero-copy streaming that swaps Mixture of Experts (MoE) layers directly from the NVMe SSD to the GPU command buffer without trashing macOS Unified Memory (prevents Watchdog OS kernel panics on 122B+ models). |
| 14 | +- 🎛️ **Granular Memory Control**: Integrated Layer Partitioning (`--gpu-layers`) and Wisdom Auto-Calibration for squeezing massive models into RAM. |
| 15 | + |
| 16 | +--- |
| 17 | + |
| 18 | +## 🆚 Why `mlx-server`? (vs. llama.cpp & python mlx-lm) |
| 19 | + |
| 20 | +| Feature | `mlx-server` (Swift/C++) | `llama.cpp` (Metal) | `python mlx-lm` | |
| 21 | +| :--- | :--- | :--- | :--- | |
| 22 | +| **Backend Math** | Official Apple MLX (Metal) | Custom Metal Shaders | Official Apple MLX (Metal) | |
| 23 | +| **Concurrency / GIL** | 🟢 **Zero GIL** (Swift async) | 🟢 **Zero GIL** (C++) | 🔴 **GIL Bottlenecked** (Python) | |
| 24 | +| **Model Format** | Native HuggingFace (Safetensors)| GGUF (Requires Conversion) | Native HuggingFace (Safetensors)| |
| 25 | +| **MoE Memory Footprint**| 🟢 **Direct SSD Streaming** | 🟡 CPU `mmap` Swapping | 🔴 OS Swap (High memory pressure) | |
| 26 | +| **KV Cache** | 🟢 **TurboQuantization** | 🟢 Aggressive Quantization | 🟡 Standard Python Hooks | |
| 27 | +| **Dependencies** | None (Single Native Binary) | None (Single Native Binary) | Python Runtime, `pip` packages | |
| 28 | + |
| 29 | +**The TL;DR:** |
| 30 | +- Use **`llama.cpp`** if you prefer GGUF formats and are running cross-platform on Windows/Linux. |
| 31 | +- Use **`python mlx-lm`** if you are explicitly prototyping ML code or data science scripts in Python. |
| 32 | +- Use **`mlx-server`** if you want the absolute maximum MLX inference performance on macOS for serving an API (e.g. for multi-agent workflows, long-running REST APIs, or local deployment) without the Python GIL blocking simultaneous request streaming. |
| 33 | + |
| 34 | +--- |
| 35 | + |
| 36 | +## 🛠️ Quick Start |
| 37 | + |
| 38 | +### Build |
14 | 39 |
|
15 | 40 | ```bash |
16 | | -# Build |
17 | 41 | swift build -c release |
| 42 | +``` |
| 43 | + |
| 44 | +### Run (Downloads model natively on first launch) |
18 | 45 |
|
19 | | -# Run (downloads model on first launch) |
| 46 | +```bash |
20 | 47 | .build/release/mlx-server \ |
21 | 48 | --model mlx-community/Qwen2.5-3B-Instruct-4bit \ |
22 | 49 | --port 5413 |
23 | 50 | ``` |
24 | 51 |
|
25 | | -## API Endpoints |
| 52 | +*(Note: Add `--stream-experts=true` if you are attempting to run oversized MoE models like Qwen3.5 122B to bypass macOS virtual memory swapping!)* |
| 53 | + |
| 54 | +--- |
| 55 | + |
| 56 | +## 📡 API Endpoints |
26 | 57 |
|
27 | 58 | | Endpoint | Method | Description | |
28 | 59 | |---|---|---| |
29 | | -| `/health` | GET | Server health + loaded model | |
| 60 | +| `/health` | GET | Server health + loaded model capabilities | |
30 | 61 | | `/v1/models` | GET | List available models | |
31 | | -| `/v1/chat/completions` | POST | Chat completions (streaming & non-streaming) | |
| 62 | +| `/v1/chat/completions` | POST | Chat completions (LLM and VLM support, multi-turn, system prompts) | |
32 | 63 |
|
33 | | -## Usage Examples |
| 64 | +## 💻 Usage Examples |
34 | 65 |
|
| 66 | +### Chat Completion (Streaming) |
| 67 | +Drop-in compatible with standard OpenAI HTTP consumers: |
35 | 68 | ```bash |
36 | | -# Health check |
37 | | -curl http://localhost:5413/health |
38 | | - |
39 | | -# Chat completion |
40 | | -curl http://localhost:5413/v1/chat/completions \ |
41 | | - -H "Content-Type: application/json" \ |
42 | | - -d '{ |
43 | | - "model": "mlx-community/Qwen2.5-3B-Instruct-4bit", |
44 | | - "messages": [{"role": "user", "content": "Hello!"}] |
45 | | - }' |
46 | | - |
47 | | -# Streaming |
48 | 69 | curl http://localhost:5413/v1/chat/completions \ |
49 | 70 | -H "Content-Type: application/json" \ |
50 | 71 | -d '{ |
51 | 72 | "model": "mlx-community/Qwen2.5-3B-Instruct-4bit", |
52 | 73 | "stream": true, |
53 | | - "messages": [{"role": "user", "content": "Hello!"}] |
| 74 | + "messages": [{"role": "user", "content": "Explain the speed of light."}] |
54 | 75 | }' |
55 | 76 | ``` |
56 | 77 |
|
57 | | -## CLI Options |
| 78 | +--- |
| 79 | + |
| 80 | +## ⚙️ CLI Options |
58 | 81 |
|
59 | 82 | | Option | Default | Description | |
60 | 83 | |---|---|---| |
61 | 84 | | `--model` | (required) | HuggingFace model ID or local path | |
62 | 85 | | `--port` | `5413` | Port to listen on | |
63 | 86 | | `--host` | `127.0.0.1` | Host to bind | |
64 | | -| `--max-tokens` | `2048` | Max tokens per request | |
| 87 | +| `--max-tokens` | `2048` | Max tokens limit per generation | |
| 88 | +| `--gpu-layers` | `model_default`| Restrict the amount of layers allocated to GPU hardware | |
| 89 | +| `--stream-experts` | `false` | Enable experimental SSD streaming for MoE model expert matrices | |
65 | 90 |
|
66 | | -## Metal Shader Library |
67 | | - |
68 | | -MLX requires `mlx.metallib` to be co-located with the binary for GPU compute. If you encounter a "Failed to load the default metallib" error: |
69 | | - |
70 | | -```bash |
71 | | -# Extract from official MLX Python package |
72 | | -python3 -m venv /tmp/mlx_venv |
73 | | -/tmp/mlx_venv/bin/pip install mlx |
74 | | -cp /tmp/mlx_venv/lib/python3.*/site-packages/mlx/lib/mlx.metallib .build/release/ |
75 | | -``` |
76 | | - |
77 | | -## Requirements |
| 91 | +## 📦 Requirements |
78 | 92 |
|
79 | 93 | - macOS 14.0+ |
80 | 94 | - Apple Silicon (M1/M2/M3/M4/M5) |
81 | 95 | - Xcode Command Line Tools |
82 | 96 | - Metal Toolchain (`xcodebuild -downloadComponent MetalToolchain`) |
83 | 97 |
|
84 | | -## Dependencies |
| 98 | +## 📄 Dependencies & License |
85 | 99 |
|
| 100 | +Built entirely on the hard work of the Apple MLX community. |
86 | 101 | - [mlx-swift](https://github.com/ml-explore/mlx-swift) — Apple MLX framework for Swift |
87 | | -- [mlx-swift-lm](https://github.com/ml-explore/mlx-swift-lm) — Language model support |
88 | | -- [Hummingbird](https://github.com/hummingbird-project/hummingbird) — Swift HTTP server |
89 | | -- [swift-argument-parser](https://github.com/apple/swift-argument-parser) — CLI argument parsing |
90 | | - |
91 | | -## License |
| 102 | +- [Hummingbird](https://github.com/hummingbird-project/hummingbird) — Event-driven Swift HTTP server |
92 | 103 |
|
93 | | -MIT |
| 104 | +**MIT License** |
0 commit comments