Skip to content

TurboQuant quality degradation on Q4_K_M base weight models + turbo4-V anomaly #27

@TheTom

Description

@TheTom

Update: Asymmetric K/V rescue and turbo4-V anomaly

Tested asymmetric K/V on Qwen2.5-7B-Instruct-Q4_K_M:

K V PPL Notes
q8_0 q8_0 6.58 baseline
q8_0 turbo3 6.68 +1.6% — quality rescued
turbo4 turbo3 257 K precision matters
turbo4 turbo4 218 symmetric turbo4
turbo3 turbo3 3556 symmetric turbo3
q8_0 turbo4 45393 turbo4-V catastrophic — possible bug

Key findings:

  • q8_0-K + turbo3-V recovers quality to within 1.6% of baseline even on Q4_K_M
  • K quantization precision is the dominant quality factor
  • turbo4-V is anomalously worse than turbo3-V (45393 vs 6.68 with same q8_0-K), suggesting a turbo4-specific V path issue
  • Verifying turbo4-V on phi-4-Q8_0 now to determine if this is model-specific or a turbo4 bug

BELOW IS FROM THE ORIGINAL ISSUE FOR POSTERITY. NOW INVESTGATING QUALITY DEGRADATION on Q4_K_M base weight models.

Description

turbo3 and turbo4 KV cache quantization produce catastrophic perplexity on M2 Pro (Apple8) hardware while working correctly on M5 Max (Apple10).

Reproduction

# M2 Pro — PPL 5321 (BROKEN)
./build/bin/llama-perplexity -m Qwen2.5-7B-Instruct-Q4_K_M.gguf \
  -ngl 99 -fa 1 -ctk turbo3 -ctv turbo3 -c 512 -f wiki.test.raw

# M5 Max — PPL 6.60 (correct)
# Same command, same commit, same model

# q8_0 on M2 Pro — PPL 8.03 (correct)
./build/bin/llama-perplexity -m Qwen2.5-7B-Instruct-Q4_K_M.gguf \
  -ngl 99 -fa 1 -ctk q8_0 -ctv q8_0 -c 512 -f wiki.test.raw

Environment

  • M2 Pro 32GB (Mac Mini), MTLGPUFamilyApple8
  • M5 Max 128GB, MTLGPUFamilyApple10
  • Branch: feature/turboquant-kv-cache
  • Tested on commits 43f7d3d, 3380d3c, 172fc85, 4cf7145 — all produce same bad PPL on M2
  • Models tested: Qwen2.5-7B-Instruct-Q4_K_M (PPL 5321), Qwen2.5-1.5B-Instruct-Q4_K_M (PPL 8643)

Investigation so far

  • q8_0 PPL is fine on M2 — rules out general Metal FA bug
  • Same commit gives correct PPL on M5 — M2-specific
  • 4-mag LUT math verified correct — both 4-mag and 8-entry paths produce identical values
  • Pre-signalnine-merge commit also broken — not introduced by PR feat: CUDA port of TurboQuant3 KV cache — 3.47x compression, 98.5% of F16 decode speed on RTX 5090 #3
  • Binary search in progress — going back through commit history to find introduction point
  • Decode speed benchmarks were unaffected (speed was real, just computing on wrong attention scores)

Possible causes under investigation

  1. M2 uses TURBO_USE_4MAG=1 path (pre-M5 hardware detection). M5 uses TURBO_USE_4MAG=0. While the dequant math is verified identical, there could be a precision or ordering issue in the FA kernel's inner loop that manifests differently.
  2. Metal WHT kernel hardcodes 128-element groups, ignoring group_size from op_params. On head_dim=128 models this should be fine, but needs verification.
  3. Some interaction between the 4-mag dequant and the Q pre-rotation or V inverse rotation that produces subtly wrong scores.

Impact

  • All turbo KV cache types are broken on M2/M1/M3/M4 (pre-M5) hardware
  • Decode speed benchmarks were real but computed on wrong attention scores
  • Asymmetric K/V work cannot be validated on M2 until this is fixed

Priority

High — blocks M2 turbo quality validation and any pre-M5 deployment.

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions