Skip to content

Experiment: CAT diagonal alignment for head_dim=128 #9

@TheTom

Description

@TheTom

Hypothesis

Per-channel scaling before FWHT reduces the head_dim=128 quality gap by aligning channel variances.

Background

head_dim=128 degrades 2-4% PPL vs head_dim=256. Root cause: fewer FWHT butterfly stages (7 vs 8), weaker mixing, heavier tails in post-rotation distribution. CAT paper (arXiv:2603.04359) proposes per-channel scaling m_i = sqrt(E[k_i^2] / sum_j W_K_{ij}^2) before rotation.

What to test

  • Compute per-channel variance calibration (128 multiplies per token)
  • Apply scaling before WHT rotation in SET_ROWS
  • PPL on head_dim=128 models (Qwen2.5-7B, Llama-3.1-8B)
  • Compare to baseline turbo3 on same models
  • Measure overhead of extra multiply

Expected outcome

Close 25-39% of head_dim=128 PPL gap per CAT paper. Minimal speed impact (one extra multiply in SET_ROWS).

Priority

Medium — addresses the biggest open quality problem.

Source

AutoRepl: TODO-002 (buun, fork_dc582a), arXiv:2603.04359

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions