Skip to content

Commit ab35fd2

Browse files
committed
docs: Add TurboQuant KV cache algorithm description to README
1 parent 7c20227 commit ab35fd2

1 file changed

Lines changed: 28 additions & 0 deletions

File tree

README.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,34 @@ No Python runtime, no Global Interpreter Lock (GIL), no unnecessary memory copie
1515

1616
---
1717

18+
## ⚡️ TurboQuantization: KV Cache Compression
19+
20+
`mlx-server` implements **TurboQuant** (AISTATS/ICLR 2026) for on-the-fly KV cache compression, enabling long-context inference with drastically reduced memory. At 3 bits/coordinate, the KV cache is compressed ~5.8× vs FP16 with near-zero accuracy loss.
21+
22+
The algorithm runs in two stages per KV vector:
23+
24+
**Stage 1 — PolarQuant (2 bits):**
25+
1. Extract L2 norm: `‖x‖`
26+
2. Normalize: `x̂ = x / ‖x‖`
27+
3. Rotate: `y = R @ x̂` (random orthogonal R via Fast Walsh-Hadamard Transform — O(d log d))
28+
4. Quantize each coordinate to nearest Lloyd-Max centroid (optimal for post-rotation Gaussian distribution)
29+
- → Store: `(2-bit indices[d], float16 norm)`
30+
31+
**Stage 2 — QJL residual (1 bit):**
32+
1. Dequantize Stage 1 → `x̂_mse`
33+
2. Compute residual: `r = x - x̂_mse`
34+
3. Project: `z = S @ r` (S ~ N(0,1) random matrix)
35+
4. Sign-bit encode: `signs = sign(z) ∈ {+1, -1}`
36+
- → Store: `(1-bit signs[d], float16 residual_norm)`
37+
38+
**Total: 3 bits/coord + 32-bit norm ≈ 5.8× compression vs FP16**
39+
40+
> *K cache uses full TurboQuant (Stage 1 + Stage 2) to preserve attention dot-product accuracy. V cache uses Stage 1 only (PolarQuant MSE) since MSE-optimal reconstruction doesn't need the QJL residual stage.*
41+
42+
Reference implementation: [`turboquant_plus`](https://github.com/TheTom/turboquant_plus) (Python) | Paper: [TurboQuant, AISTATS 2026](https://aistats.org)
43+
44+
---
45+
1846
## 🆚 Why `mlx-server`? (vs. llama.cpp & python mlx-lm)
1947

2048
| Feature | `mlx-server` (Swift) | `llama.cpp` (Metal) | `python mlx-lm` |

0 commit comments

Comments
 (0)