Skip to content

Commit ed2d2b0

Browse files
simbasimba
authored andcommitted
docs: recreate README with mlx-server comparisons and architecture details
1 parent 9131ff7 commit ed2d2b0

1 file changed

Lines changed: 60 additions & 49 deletions

File tree

README.md

Lines changed: 60 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -1,93 +1,104 @@
1-
# mlx-server
1+
# ⚡️ mlx-server
22

3-
A native Swift server that serves [MLX](https://github.com/ml-explore/mlx) models with an **OpenAI-compatible API**. No Python runtime required — compiles to a single binary that runs on Apple Silicon.
3+
A blazingly fast, native Swift inference server that serves [MLX](https://github.com/ml-explore/mlx) models with a strict **OpenAI-compatible API**.
44

5-
## Features
5+
No Python runtime, no Global Interpreter Lock (GIL), no unnecessary memory copies. Just bare-metal Apple Silicon performance compiled to a single binary.
66

7-
- 🚀 **Native Swift** — compiled binary, no Python dependency
8-
- 🍎 **Apple Silicon optimized** — uses Metal GPU via MLX
9-
- 🔌 **OpenAI-compatible API** — drop-in replacement for local inference
10-
- 📡 **Streaming support** — SSE streaming for real-time token generation
11-
- 🤗 **HuggingFace models** — loads any MLX-format model directly
7+
## 🚀 Features
128

13-
## Quick Start
9+
- 🍎 **100% Native Apple Silicon**: Powered natively by Metal and Swift.
10+
- 🔌 **OpenAI-compatible**: Drop-in replacement for OpenAI SDKs (`/v1/chat/completions`, streaming, etc).
11+
- 🧠 **Smart Model Routing**: Loads HuggingFace format models directly, with native Safetensors parsing.
12+
- ⚡️ **TurboQuantization Integrated**: Custom low-level MLX Metal primitives that apply extremely fast quantization for KV caching out-of-the-box.
13+
- 💾 **SSD Expert Streaming**: *Experimental* zero-copy streaming that swaps Mixture of Experts (MoE) layers directly from the NVMe SSD to the GPU command buffer without trashing macOS Unified Memory (prevents Watchdog OS kernel panics on 122B+ models).
14+
- 🎛️ **Granular Memory Control**: Integrated Layer Partitioning (`--gpu-layers`) and Wisdom Auto-Calibration for squeezing massive models into RAM.
15+
16+
---
17+
18+
## 🆚 Why `mlx-server`? (vs. llama.cpp & python mlx-lm)
19+
20+
| Feature | `mlx-server` (Swift/C++) | `llama.cpp` (Metal) | `python mlx-lm` |
21+
| :--- | :--- | :--- | :--- |
22+
| **Backend Math** | Official Apple MLX (Metal) | Custom Metal Shaders | Official Apple MLX (Metal) |
23+
| **Concurrency / GIL** | 🟢 **Zero GIL** (Swift async) | 🟢 **Zero GIL** (C++) | 🔴 **GIL Bottlenecked** (Python) |
24+
| **Model Format** | Native HuggingFace (Safetensors)| GGUF (Requires Conversion) | Native HuggingFace (Safetensors)|
25+
| **MoE Memory Footprint**| 🟢 **Direct SSD Streaming** | 🟡 CPU `mmap` Swapping | 🔴 OS Swap (High memory pressure) |
26+
| **KV Cache** | 🟢 **TurboQuantization** | 🟢 Aggressive Quantization | 🟡 Standard Python Hooks |
27+
| **Dependencies** | None (Single Native Binary) | None (Single Native Binary) | Python Runtime, `pip` packages |
28+
29+
**The TL;DR:**
30+
- Use **`llama.cpp`** if you prefer GGUF formats and are running cross-platform on Windows/Linux.
31+
- Use **`python mlx-lm`** if you are explicitly prototyping ML code or data science scripts in Python.
32+
- Use **`mlx-server`** if you want the absolute maximum MLX inference performance on macOS for serving an API (e.g. for multi-agent workflows, long-running REST APIs, or local deployment) without the Python GIL blocking simultaneous request streaming.
33+
34+
---
35+
36+
## 🛠️ Quick Start
37+
38+
### Build
1439

1540
```bash
16-
# Build
1741
swift build -c release
42+
```
43+
44+
### Run (Downloads model natively on first launch)
1845

19-
# Run (downloads model on first launch)
46+
```bash
2047
.build/release/mlx-server \
2148
--model mlx-community/Qwen2.5-3B-Instruct-4bit \
2249
--port 5413
2350
```
2451

25-
## API Endpoints
52+
*(Note: Add `--stream-experts=true` if you are attempting to run oversized MoE models like Qwen3.5 122B to bypass macOS virtual memory swapping!)*
53+
54+
---
55+
56+
## 📡 API Endpoints
2657

2758
| Endpoint | Method | Description |
2859
|---|---|---|
29-
| `/health` | GET | Server health + loaded model |
60+
| `/health` | GET | Server health + loaded model capabilities |
3061
| `/v1/models` | GET | List available models |
31-
| `/v1/chat/completions` | POST | Chat completions (streaming & non-streaming) |
62+
| `/v1/chat/completions` | POST | Chat completions (LLM and VLM support, multi-turn, system prompts) |
3263

33-
## Usage Examples
64+
## 💻 Usage Examples
3465

66+
### Chat Completion (Streaming)
67+
Drop-in compatible with standard OpenAI HTTP consumers:
3568
```bash
36-
# Health check
37-
curl http://localhost:5413/health
38-
39-
# Chat completion
40-
curl http://localhost:5413/v1/chat/completions \
41-
-H "Content-Type: application/json" \
42-
-d '{
43-
"model": "mlx-community/Qwen2.5-3B-Instruct-4bit",
44-
"messages": [{"role": "user", "content": "Hello!"}]
45-
}'
46-
47-
# Streaming
4869
curl http://localhost:5413/v1/chat/completions \
4970
-H "Content-Type: application/json" \
5071
-d '{
5172
"model": "mlx-community/Qwen2.5-3B-Instruct-4bit",
5273
"stream": true,
53-
"messages": [{"role": "user", "content": "Hello!"}]
74+
"messages": [{"role": "user", "content": "Explain the speed of light."}]
5475
}'
5576
```
5677

57-
## CLI Options
78+
---
79+
80+
## ⚙️ CLI Options
5881

5982
| Option | Default | Description |
6083
|---|---|---|
6184
| `--model` | (required) | HuggingFace model ID or local path |
6285
| `--port` | `5413` | Port to listen on |
6386
| `--host` | `127.0.0.1` | Host to bind |
64-
| `--max-tokens` | `2048` | Max tokens per request |
87+
| `--max-tokens` | `2048` | Max tokens limit per generation |
88+
| `--gpu-layers` | `model_default`| Restrict the amount of layers allocated to GPU hardware |
89+
| `--stream-experts` | `false` | Enable experimental SSD streaming for MoE model expert matrices |
6590

66-
## Metal Shader Library
67-
68-
MLX requires `mlx.metallib` to be co-located with the binary for GPU compute. If you encounter a "Failed to load the default metallib" error:
69-
70-
```bash
71-
# Extract from official MLX Python package
72-
python3 -m venv /tmp/mlx_venv
73-
/tmp/mlx_venv/bin/pip install mlx
74-
cp /tmp/mlx_venv/lib/python3.*/site-packages/mlx/lib/mlx.metallib .build/release/
75-
```
76-
77-
## Requirements
91+
## 📦 Requirements
7892

7993
- macOS 14.0+
8094
- Apple Silicon (M1/M2/M3/M4/M5)
8195
- Xcode Command Line Tools
8296
- Metal Toolchain (`xcodebuild -downloadComponent MetalToolchain`)
8397

84-
## Dependencies
98+
## 📄 Dependencies & License
8599

100+
Built entirely on the hard work of the Apple MLX community.
86101
- [mlx-swift](https://github.com/ml-explore/mlx-swift) — Apple MLX framework for Swift
87-
- [mlx-swift-lm](https://github.com/ml-explore/mlx-swift-lm) — Language model support
88-
- [Hummingbird](https://github.com/hummingbird-project/hummingbird) — Swift HTTP server
89-
- [swift-argument-parser](https://github.com/apple/swift-argument-parser) — CLI argument parsing
90-
91-
## License
102+
- [Hummingbird](https://github.com/hummingbird-project/hummingbird) — Event-driven Swift HTTP server
92103

93-
MIT
104+
**MIT License**

0 commit comments

Comments
 (0)