forked from ggml-org/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 98
TurboQuant quality degradation on Q4_K_M base weight models + turbo4-V anomaly #27
Copy link
Copy link
Closed
Description
Update: Asymmetric K/V rescue and turbo4-V anomaly
Tested asymmetric K/V on Qwen2.5-7B-Instruct-Q4_K_M:
| K | V | PPL | Notes |
|---|---|---|---|
| q8_0 | q8_0 | 6.58 | baseline |
| q8_0 | turbo3 | 6.68 | +1.6% — quality rescued |
| turbo4 | turbo3 | 257 | K precision matters |
| turbo4 | turbo4 | 218 | symmetric turbo4 |
| turbo3 | turbo3 | 3556 | symmetric turbo3 |
| q8_0 | turbo4 | 45393 | turbo4-V catastrophic — possible bug |
Key findings:
- q8_0-K + turbo3-V recovers quality to within 1.6% of baseline even on Q4_K_M
- K quantization precision is the dominant quality factor
- turbo4-V is anomalously worse than turbo3-V (45393 vs 6.68 with same q8_0-K), suggesting a turbo4-specific V path issue
- Verifying turbo4-V on phi-4-Q8_0 now to determine if this is model-specific or a turbo4 bug
BELOW IS FROM THE ORIGINAL ISSUE FOR POSTERITY. NOW INVESTGATING QUALITY DEGRADATION on Q4_K_M base weight models.
Description
turbo3 and turbo4 KV cache quantization produce catastrophic perplexity on M2 Pro (Apple8) hardware while working correctly on M5 Max (Apple10).
Reproduction
# M2 Pro — PPL 5321 (BROKEN)
./build/bin/llama-perplexity -m Qwen2.5-7B-Instruct-Q4_K_M.gguf \
-ngl 99 -fa 1 -ctk turbo3 -ctv turbo3 -c 512 -f wiki.test.raw
# M5 Max — PPL 6.60 (correct)
# Same command, same commit, same model
# q8_0 on M2 Pro — PPL 8.03 (correct)
./build/bin/llama-perplexity -m Qwen2.5-7B-Instruct-Q4_K_M.gguf \
-ngl 99 -fa 1 -ctk q8_0 -ctv q8_0 -c 512 -f wiki.test.rawEnvironment
- M2 Pro 32GB (Mac Mini), MTLGPUFamilyApple8
- M5 Max 128GB, MTLGPUFamilyApple10
- Branch: feature/turboquant-kv-cache
- Tested on commits 43f7d3d, 3380d3c, 172fc85, 4cf7145 — all produce same bad PPL on M2
- Models tested: Qwen2.5-7B-Instruct-Q4_K_M (PPL 5321), Qwen2.5-1.5B-Instruct-Q4_K_M (PPL 8643)
Investigation so far
- q8_0 PPL is fine on M2 — rules out general Metal FA bug
- Same commit gives correct PPL on M5 — M2-specific
- 4-mag LUT math verified correct — both 4-mag and 8-entry paths produce identical values
- Pre-signalnine-merge commit also broken — not introduced by PR feat: CUDA port of TurboQuant3 KV cache — 3.47x compression, 98.5% of F16 decode speed on RTX 5090 #3
- Binary search in progress — going back through commit history to find introduction point
- Decode speed benchmarks were unaffected (speed was real, just computing on wrong attention scores)
Possible causes under investigation
- M2 uses
TURBO_USE_4MAG=1path (pre-M5 hardware detection). M5 usesTURBO_USE_4MAG=0. While the dequant math is verified identical, there could be a precision or ordering issue in the FA kernel's inner loop that manifests differently. - Metal WHT kernel hardcodes 128-element groups, ignoring
group_sizefrom op_params. On head_dim=128 models this should be fine, but needs verification. - Some interaction between the 4-mag dequant and the Q pre-rotation or V inverse rotation that produces subtly wrong scores.
Impact
- All turbo KV cache types are broken on M2/M1/M3/M4 (pre-M5) hardware
- Decode speed benchmarks were real but computed on wrong attention scores
- Asymmetric K/V work cannot be validated on M2 until this is fixed
Priority
High — blocks M2 turbo quality validation and any pre-M5 deployment.
🤖 Generated with Claude Code
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels