forked from ggml-org/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 94
Experiment: CAT diagonal alignment for head_dim=128 #9
Copy link
Copy link
Open
Description
Hypothesis
Per-channel scaling before FWHT reduces the head_dim=128 quality gap by aligning channel variances.
Background
head_dim=128 degrades 2-4% PPL vs head_dim=256. Root cause: fewer FWHT butterfly stages (7 vs 8), weaker mixing, heavier tails in post-rotation distribution. CAT paper (arXiv:2603.04359) proposes per-channel scaling m_i = sqrt(E[k_i^2] / sum_j W_K_{ij}^2) before rotation.
What to test
- Compute per-channel variance calibration (128 multiplies per token)
- Apply scaling before WHT rotation in SET_ROWS
- PPL on head_dim=128 models (Qwen2.5-7B, Llama-3.1-8B)
- Compare to baseline turbo3 on same models
- Measure overhead of extra multiply
Expected outcome
Close 25-39% of head_dim=128 PPL gap per CAT paper. Minimal speed impact (one extra multiply in SET_ROWS).
Priority
Medium — addresses the biggest open quality problem.
Source
AutoRepl: TODO-002 (buun, fork_dc582a), arXiv:2603.04359
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels