You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+28Lines changed: 28 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,6 +15,34 @@ No Python runtime, no Global Interpreter Lock (GIL), no unnecessary memory copie
15
15
16
16
---
17
17
18
+
## ⚡️ TurboQuantization: KV Cache Compression
19
+
20
+
`mlx-server` implements **TurboQuant** (AISTATS/ICLR 2026) for on-the-fly KV cache compression, enabling long-context inference with drastically reduced memory. At 3 bits/coordinate, the KV cache is compressed ~5.8× vs FP16 with near-zero accuracy loss.
21
+
22
+
The algorithm runs in two stages per KV vector:
23
+
24
+
**Stage 1 — PolarQuant (2 bits):**
25
+
1. Extract L2 norm: `‖x‖`
26
+
2. Normalize: `x̂ = x / ‖x‖`
27
+
3. Rotate: `y = R @ x̂` (random orthogonal R via Fast Walsh-Hadamard Transform — O(d log d))
28
+
4. Quantize each coordinate to nearest Lloyd-Max centroid (optimal for post-rotation Gaussian distribution)
29
+
- → Store: `(2-bit indices[d], float16 norm)`
30
+
31
+
**Stage 2 — QJL residual (1 bit):**
32
+
1. Dequantize Stage 1 → `x̂_mse`
33
+
2. Compute residual: `r = x - x̂_mse`
34
+
3. Project: `z = S @ r` (S ~ N(0,1) random matrix)
> *K cache uses full TurboQuant (Stage 1 + Stage 2) to preserve attention dot-product accuracy. V cache uses Stage 1 only (PolarQuant MSE) since MSE-optimal reconstruction doesn't need the QJL residual stage.*
0 commit comments